Zookeeper principle and Zookeeper election mechanism and process
Zookeeper principle
The role of the ZooKeeper cluster
Zookeeper is a high-availability cluster based on master-slave replication. Usually 3 servers can form a ZooKeeper cluster. The architecture diagram officially provided by ZooKeeper is a ZooKeeper cluster that provides services to the outside world as a whole.
Each Server represents a server that installs the ZooKeeper service. The servers that make up the ZooKeeper service maintain the current server state in memory, and each server maintains communication with each other.
Zookeeper roles include Leader, Follower, and Observer.
Leader
A running Zookeeper cluster has only one Leader service, and the Leader service has two main responsibilities:
Responsible for write operations of cluster data
Initiate and maintain the heartbeat between each Follower and Observer to monitor the running status of the cluster.
All write operations in the Zookeeper cluster must go through the leader. Only after the leader's write operation is completed, the write operation is broadcast to other followers. The write request is considered successful only when more than half of the nodes (excluding Observer nodes) write successfully.
Follower
A running Zookeeper cluster has only one Follower, and the Follower maintains a connection with the Leader through heartbeats. The Follower service has two main responsibilities:
Responsible for read operations of cluster data
Participate in cluster leader election
After the Follower receives a client request, it will first determine whether the request is a read request or a write request. If it is a read request, the Follower will read data from the local node and return it to the client; if it is a write request, the Follower will The write request is forwarded to the leader for processing; in addition, if the leader fails, the follower needs to vote in the cluster election.
Observer
A running Zookeeper cluster has only one Observer, and the main responsibility of the Observer is to read the cluster data. Observer is similar to Follower, but Observer has no voting rights.
The Zookeeper cluster needs to support more concurrent client operations during operation, which requires more service instances, and more service instances will complicate the cluster voting stage, and the cluster selection time process is not conducive to rapid cluster failures. recover. Therefore, the Observer is introduced into Zookeeper. The Observer does not participate in voting, but receives connections from the client and responds to the client's read request, and forwards the write request to the Leader. Adding more Observer nodes not only improves the throughput of the Zookeeper cluster, but also ensures the stability of the system.
ZAB protocol
The ZAB (ZooKeeper Atomic Broadcast) protocol is the Zookeeper atomic broadcast protocol.
In ZooKeeper, it mainly relies on the ZAB protocol to achieve distributed data consistency. Based on this protocol, ZooKeeper implements a master-standby mode system architecture to maintain data consistency between replicas in the cluster.
Epoch & Zxid
Epoch is the cycle number (epoch number) of the current cluster. Each leader change of the cluster will generate a new cycle number. The generation rule of the cycle number is to add 1 to the previous cycle number. When the previous leader crashes and recovers You will find that your cycle number is smaller than the current cycle number, indicating that a new leader has been generated in the cluster at this time, and the old leader will enter the cluster again as a follower.
Zxid is the transaction number of the ZAB protocol, which is a 64-bit data, of which the lower 32 bits are a monotonically increasing counter. For each transaction request from the client, the counter is incremented by 1, and the upper 32 bits store the Epoch. Each time a new leader is elected, the leader will take the largest Zxid from the current server log, obtain the 32-bit Epoch value and add 1 as the new Epoch, and count the lower 32 bits from 0. Zxid is used to identify the id of a proposal. In order to ensure order, Zxid must be monotonically increasing.
2 modes of ZAB protocol
The ZAB protocol has 2 modes: recovery mode and broadcast mode.
Recovery mode (cluster master election)
After the cluster starts, the cluster restarts, or the leader crashes, the cluster will start to elect the master, and this process is called recovery mode.
Broadcast mode (data synchronization)
When the leader is elected, the leader broadcasts the latest cluster status to other followers, which is the broadcast mode. After more than half of my followers have finished synchronizing with the leader state, the broadcast mode ends. The state synchronization here refers to data synchronization , which is used to ensure that more than half of the machines in the cluster are consistent with the leader's data state.
The 4 phases of the ZAB protocol
1. Leader Election
At the start of a cluster election, all nodes are in the election phase. When a node has more than half of the votes, the node will be elected as leader .
Only after reaching the broadcast stage, the quasi-leader will become the real leader.
The purpose of the election phase is to produce a quasi-leader.
2. Discovery
During the discovery phase, Followers begin to communicate with the prospective Leader to synchronize the transaction proposals recently received by Followers. At this point, the prospective Leader will generate a new Epoch, and try to let the Followers receive the Epoch and then update it locally.
A follower in the discovery phase will only connect to one leader. If node 1 thinks that node 2 is the leader, then node 1 will try to connect to node 2. If the connection is rejected, the cluster will re-enter the election phase.
The main purpose of the discovery phase is to discover the latest proposals received by the majority of nodes.
3. Synchronization
The synchronization phase mainly uses the latest proposal information obtained by the leader in the previous phase to synchronize to all replicas in the cluster. Only when more than half of the nodes are synchronized, the quasi-leader will become the real leader.
After the synchronization phase is completed, the cluster master selection operation is completed, and a new leader will be generated.
4. Broadcast stage (Broadcast)
At the broadcast stage, the ZooKeeper cluster can officially provide transaction services to the outside world, and the leader broadcasts messages to notify other followers of its status. If a new node joins, the Leader needs to synchronize the state of the new node.
Java for ZAB protocol
Zookeeper election mechanism and process
Zookeeper election mechanism
Each server can propose itself to be the leader, and vote for itself, and then compare the voting results with the votes of other servers, and the winner will be the most powerful. The specific election process is as follows:
After each server starts, it asks other servers who to vote for, and other servers reply to the leader they recommend according to their own status and return the corresponding leader id and Zxid. When the cluster is first started, each server recommends itself as the leader.
When the server receives the replies from all other servers, it calculates the server with the largest Zxid, and sets this server as the server to vote next time.
The server with the most votes in the calculation process will be the winner. If the winner's votes exceed half of the number of clusters, the server will be elected as the leader. Otherwise keep voting until the leader is elected.
Leader waits for other servers to connect.
The Follower connects to the Leader and sends the largest Zxid to the Leader.
The leader determines the synchronization point according to the follower's Zxid. At this point, the election phase is completed.
After the election phase is completed, the Leader notifies other Followers that the cluster has become the Uptodate state. After the Follower receives the Uptodate message, it receives client requests and starts to provide external services.
Java implementation of ZAB protocol
The Java implementation of the ZAB protocol is slightly different from its definition. In the actual implementation, the Fast Leader Election mode is adopted in the election phase. In this mode, the node first proposes to all servers that it wants to become a leader. When other servers receive the proposal, they judge the Epoch information and accept the proposal of the other party, and then send the message that the proposal is completed to the other party; at the same time, in the implementation process of Java The discovery phase and the synchronization phase are combined into the recovery phase (Recovery). Therefore, the Java implementation of the ZAB protocol has only 3 stages: Fast Leader Election, Recovery and Broadcast.
The specific election process
Take the selection of 5 servers as an example:
S1 start
S1 proposes to be the leader and vote for itself, and then sends the voting results to other servers. At this time, other servers have not yet started, so no feedback is received. S1 is in the Looking state.
S2 start
S2 proposes to be the leader and vote for itself, and then exchanges the voting results with S1. Since the Zxid of S2 is greater than the Zxid of S1, S2 wins, but the votes are not over half, and both S1 and S2 are in the Looking state.
S3 boot
S3 proposes to be the leader and votes for itself, and then exchanges the voting results with S1 and S2. Since S3 has the largest Zxid value, S3 wins. At this time, S3 has more than half of the votes, S3 is the leader, and S1 and S2 become the followers.
S4 boot
S4 proposes to be the leader and vote for itself, and then exchanges the voting results with S1, S2, and S3, and finds that S3 has become the leader, and S4 has become the follower.
S5 boot
S5 proposes to be the leader and votes for itself, and then exchanges the voting results with S1, S2, S3 and S4, and finds that S3 has become the leader, and S5 has become the follower.
A few questions about ZooKeeper clusters
An odd number of Servers are generally deployed in a ZooKeeper cluster, why?
This is because after the ZooKeeper cluster shuts down several servers, the entire ZooKeeper can still be used if the number of remaining servers is greater than the number of downed servers. If there are n servers in our cluster, then the number of remaining services must be greater than n/2.
Suppose there are 3 servers, then a maximum of 1 ZooKeeper server is allowed to be down. If we have 4 servers, only 1 is allowed to be down. If there are 5 servers, a maximum of 2 ZooKeeper servers are allowed to be down. If When we have 6, we also only allow 2 to go down. It can be seen that the fault tolerance of 2n and 2n-1 is the same, so it is enough to use an odd number of Servers.
Why does ZooKeeper use the electoral majority mechanism?
How to use the majority mechanism to prevent cluster split-brain.
Cluster split-brain refers to a cluster, if a network failure occurs, the servers cannot communicate normally, and the cluster is divided into several small clusters. At this time, the sub-clusters elect their own masters, resulting in a "split-brain" situation.
Suppose a cluster consisting of 6 servers is deployed in 2 computer rooms, each with 3 servers. Under normal circumstances, there is only one leader, but when the network between the two computer rooms is disconnected, the three servers in each computer room will think that the three servers in the other computer room are offline, and select their own leader and provide services to the outside world. If there is no more than half mechanism, when the network is restored, there will be 2 leaders. This is like the original 1 brain (Leader) scattered into 2 brains, which causes a split-brain phenomenon. During the split-brain period, both "brains" may provide services to the outside world, which will bring problems such as data consistency. The more than half mechanism of ZooKeeper makes it impossible to generate 2 leaders, because it is impossible to generate leaders if less than or equal to half, which makes it impossible for split-brain to occur no matter how the machines in the computer room are allocated.
0 Comments