• notice
  • Congratulations on the launch of the Sought Tech site

Elasticsearch's distributed document storage principle

Route a document to a shard

When indexing a document, the document is stored in a primary shard. How does Elasticsearch know which shard a document should be stored in? When we create a document, how does it decide whether the document should be stored in shard 1 or shard 2?

First of all it's definitely not random, or we won't know where to look when we want to get documents in the future. In fact, this process is determined according to the following formula:

shard = hash(routing) % number_of_primary_shards

routingIs a variable value, the default is the document _id, it can also be set to a custom value. routingGenerate a number through the hashfunction, and then divide this number by number_of_primary_shards(the number of primary shards) to get the remainder. The remainder of this distribution between 0to is the location of the shard where the document we are looking for is located.number_of_primary_shards-1

This explains why we need to determine the number of primary shards when we create the index, and never change this number: because if the number changes, all previously routed values will be invalid, and the document will never be found again. arrive.

All document APIs ( getindexdeletebulkupdateand mget) accept routinga , through which we can customize the document-to-shard mapping. A custom routing parameter can be used to ensure that all related documents—for example, all documents belonging to the same user—are stored in the same shard.

How primary and replica shards interact

We can send requests to any node in the cluster. Each node has the ability to handle arbitrary requests. Each node is aware of the location of any document in the cluster, so it can forward requests directly to the desired node. Each cluster will have one master node, which we will call it 协调节点(coordinating node).

When sending requests, in order to scale the load, it is better to poll all nodes in the cluster.

Create, index and delete documents

New, index and delete requests are all write operations and must be completed on the primary shard before they can be replicated to the relevant replica shards.

When the client receives a successful response, the document change has been performed on the primary shard and all replica shards, and the change is safe.

There are some optional request parameters that allow you to influence this process, possibly improving performance at the expense of data security. These options are rarely used because Elasticsearch is already fast, but for completeness they are elaborated here as follows:

consistency

consistency, that is, consistency. By default , the primary shard will require quorum (or in other words, a majority) of shard replicas must be active before even attempting to perform a write operation state, the write operation will be performed (where the shard copy can be the primary shard or the replica shard). This is to avoid write , which could lead to data inconsistencies. The specified quantity is:

int( (primary + number_of_replicas) / 2 ) + 1

consistencyThe value of the parameter can be set to one(write operations are allowed as long as the status of the primary shard is ok ), allwrite ), or quorum. The default is quorumthat most shard replicas are in good condition to allow write operations.

Note that 规定数量the calculation formula number_of_replicasrefers to the number of replica shards set in the index settings, not the number of replica shards that are currently processing active. If your index settings specify that the current index has three replica shards, the calculation result of the specified number is:

int( (primary + 3 replicas) / 2 ) + 1 = 3

If you only start two nodes at this point, then the number of active shard replicas will not reach the specified number, so you will not be able to index and delete any documents.

timeout

What happens if there are not enough replica shards? Elasticsearch will wait, hoping for more shards to appear. By default it waits up to 1 minute. If you want, you can make it terminate earlier with the timeoutparameter : 100100ms, which 30sis 30s.

Other things to pay attention to

New indexes have 1one replica shard by default, which means that 规定数量two active shard replicas should be required to satisfy them. However, these default settings prevent us from doing anything on a single node. To avoid this problem, it is required number_of_replicasthat the specified number will only be executed if it is greater than 1.

retrieve a document

Documents can be retrieved from the primary shard or from any other replica shard

When processing read requests, the coordinator node will balance the load by polling all replica shards for each request.

When documents are retrieved, documents that have been indexed may already exist on the primary shard but have not yet been replicated to the replica shards. In this case, the replica shard may report that the document does not exist, but the primary shard may successfully return the document. Once the indexing request is successfully returned to the user, the document is available on both the primary and replica shards.

Partially update documentation

Here are the steps to partially update a document:

  1. The client sends an update request to Node 1.

  2. It forwards the request to Node 3 where the primary shard is located.

  3. Node 3 retrieves the document from the primary shard, modifies the JSON in the _sourcefield , and attempts to reindex the primary shard's document. If the document has been modified by another process, it will retry step 3 retry_on_conflictand .

  4. If Node 3 successfully updates the document, it forwards the new version of the document to the replica shards on Node 1 and Node 2 in parallel for re-indexing. Once all replica shards return success, Node 3 also returns success to the coordinator node, which returns success to the client.

updateThe API also accepts routingreplicationconsistencyand timeoutparameters .

document-based replication

When the primary shard forwards changes to replica shards, it does not forward update requests. Instead, it forwards a new version of the full document. Keep in mind that these changes will be forwarded to replica shards asynchronously, and there is no guarantee that they will arrive in the same order in which they were sent. If Elasticsearch only forwards change requests, it is possible to apply changes in the wrong order, resulting in corrupted documents.



Tags

Technical otaku

Sought technology together

Related Topic

1 Comments

author

buy atorvastatin generic & lt;a href="https://lipiws.top/"& gt;buy generic atorvastatin 40mg& lt;/a& gt; atorvastatin price

Gdusct

2024-03-07

Leave a Reply

+