Elasticsearch's distributed document storage principle
Route a document to a shard
When indexing a document, the document is stored in a primary shard. How does Elasticsearch know which shard a document should be stored in? When we create a document, how does it decide whether the document should be stored in shard 1 or shard 2?
First of all it's definitely not random, or we won't know where to look when we want to get documents in the future. In fact, this process is determined according to the following formula:
shard = hash(routing) % number_of_primary_shards
routing
Is a variable value, the default is the document _id
, it can also be set to a custom value. routing
Generate a number through the hash
function, and then divide this number by number_of_primary_shards
(the number of primary shards) to get the remainder. The remainder of this distribution between 0
to is the location of the shard where the document we are looking for is located.number_of_primary_shards-1
This explains why we need to determine the number of primary shards when we create the index, and never change this number: because if the number changes, all previously routed values will be invalid, and the document will never be found again. arrive.
All document APIs ( get
, index
, delete
, bulk
, update
and mget
) accept routing
a , through which we can customize the document-to-shard mapping. A custom routing parameter can be used to ensure that all related documents—for example, all documents belonging to the same user—are stored in the same shard.
How primary and replica shards interact
We can send requests to any node in the cluster. Each node has the ability to handle arbitrary requests. Each node is aware of the location of any document in the cluster, so it can forward requests directly to the desired node. Each cluster will have one master node, which we will call it 协调节点(coordinating node)
.
When sending requests, in order to scale the load, it is better to poll all nodes in the cluster.
Create, index and delete documents
New, index and delete requests are all write operations and must be completed on the primary shard before they can be replicated to the relevant replica shards.
When the client receives a successful response, the document change has been performed on the primary shard and all replica shards, and the change is safe.
There are some optional request parameters that allow you to influence this process, possibly improving performance at the expense of data security. These options are rarely used because Elasticsearch is already fast, but for completeness they are elaborated here as follows:
consistency
consistency
, that is, consistency. By default , the primary shard will require quorum (or in other words, a majority) of shard replicas must be active before even attempting to perform a write operation state, the write operation will be performed (where the shard copy can be the primary shard or the replica shard). This is to avoid write , which could lead to data inconsistencies. The specified quantity is:
int( (primary + number_of_replicas) / 2 ) + 1
consistency
The value of the parameter can be set to one
(write operations are allowed as long as the status of the primary shard is ok ), all
( write ), or quorum
. The default is quorum
that most shard replicas are in good condition to allow write operations.
Note that 规定数量
the calculation formula number_of_replicas
refers to the number of replica shards set in the index settings, not the number of replica shards that are currently processing active. If your index settings specify that the current index has three replica shards, the calculation result of the specified number is:
int( (primary + 3 replicas) / 2 ) + 1 = 3
If you only start two nodes at this point, then the number of active shard replicas will not reach the specified number, so you will not be able to index and delete any documents.
timeout
What happens if there are not enough replica shards? Elasticsearch will wait, hoping for more shards to appear. By default it waits up to 1 minute. If you want, you can make it terminate earlier with the timeout
parameter : 100
100ms, which 30s
is 30s.
Other things to pay attention to
New indexes have 1
one replica shard by default, which means that 规定数量
two active shard replicas should be required to satisfy them. However, these default settings prevent us from doing anything on a single node. To avoid this problem, it is required number_of_replicas
that the specified number will only be executed if it is greater than 1.
retrieve a document
Documents can be retrieved from the primary shard or from any other replica shard
When processing read requests, the coordinator node will balance the load by polling all replica shards for each request.
When documents are retrieved, documents that have been indexed may already exist on the primary shard but have not yet been replicated to the replica shards. In this case, the replica shard may report that the document does not exist, but the primary shard may successfully return the document. Once the indexing request is successfully returned to the user, the document is available on both the primary and replica shards.
Partially update documentation
Here are the steps to partially update a document:
The client sends an update request to Node 1.
It forwards the request to Node 3 where the primary shard is located.
Node 3 retrieves the document from the primary shard, modifies the JSON in the
_source
field , and attempts to reindex the primary shard's document. If the document has been modified by another process, it will retry step 3retry_on_conflict
and .If Node 3 successfully updates the document, it forwards the new version of the document to the replica shards on Node 1 and Node 2 in parallel for re-indexing. Once all replica shards return success, Node 3 also returns success to the coordinator node, which returns success to the client.
update
The API also accepts routing
, replication
, consistency
and timeout
parameters .
document-based replication
When the primary shard forwards changes to replica shards, it does not forward update requests. Instead, it forwards a new version of the full document. Keep in mind that these changes will be forwarded to replica shards asynchronously, and there is no guarantee that they will arrive in the same order in which they were sent. If Elasticsearch only forwards change requests, it is possible to apply changes in the wrong order, resulting in corrupted documents.
buy atorvastatin generic & lt;a href="https://lipiws.top/"& gt;buy generic atorvastatin 40mg& lt;/a& gt; atorvastatin price
Gdusct
2024-03-07