Replica Assignment assigns SPUs to a replica set and Replica Election coordinates their roles. The election algorithm manages replica sets in an attempt to designate one active leader at all times. SPUs have a powerful multi-threaded engine that can process a large number of leaders and followers at the same time.
If an SPU becomes incapacitated, the election algorithm identifies all impacted replica sets and triggers a re-election. The following section describes the algorithm utilized by each replica set as it elects a new leader.
The Leader and Followers of a Replica Sets have different responsibilities.
All followers are in hot-standby and ready to take-over as leader.
Each data stream has a Live Replica Set (LRS) that describes the SPUs actively replicating data records in their local data store. LRS status can be viewed in show partitions CLI command.
Replica election covers two core cases:
comes back online
When an SPU goes offline, the SC identifies all impacted Replica Sets and triggers an election:
set Replica Set status to Election
choose leader candidate from the followers in LRS based on smallest lag behind the previous leader:
leader candidate found:
no eligible leader candidate available:
All SPUs in the Replica Set receive proposed leader candidate and perform the following operations:
SPU that matches leader candidate tries to promote follower replica to leader replica:
follower to leader promotion successful
follower to leader promotion failed
Other SPUs ignore the message
The SC perform the follower operations:
The SC chooses the next Leader Candidate from the LRS list and the process repeats.
If no eligible leader candidate left:
All SPU followers update their LRS:
When an known SPU comes back Online, the SC identifies all impacted Replica Sets and triggers a refresh.
For all Replica Sets with status Offline, the SC performs the following operations:
choose leader candidate from follower membership list based on smallest lag behind the previous leader:
no eligible leader candidate left:
The algorithm repeats the same steps as in the “SPU goes Offline” section.
Each SPU has a Leader Controller that manages leader replicas, and a Follower Controller that manages follower replicas. SPU utilizes Rust async framework to run a virtually unlimited number of leader and follower operations simultaneously.
Each Replica Set has a communication channel where for the leader and followers exchange replica information. It is the responsibility of the followers to establish a connection to the leader. Once a connection between two SPUs is created, it is shared by all replica sets.
For example, three replica sets a, b, and c that are distributed across SPU-1, SPU-2, and SPU-3:
The first follower (b, or c) from SPU-1 that tries to communicate with its leader in SPU-2 generates a TCP connection. Then, all subsequent communication from SPU-1 to SPU-2, irrespective of the replica set, will reuse the same connection.
Hence, each SPU pair will have at most 2 connections. For example:
Replicas use offsets to indicate the position of a record in a data stream. Offsets starts at zero and are incremented by one anytime a new record is appended.
Log End Offset (LEO) represents the offset of last record in the local store of a replica. A records is considered committed only when replicated by all live replicas. Live Replica Sets (LRS) is the set of active replicas in the membership list. High Watermark (HW) is the last offset of the record committed by the LRS.
Synchronization algorithm collects the LEOs, computes the HW, and manages the (LRS).
In this example:
If Follower-2 goes offline: LRS = 2 and HW = 3.
All replica followers send their replica status, LEO and HW, to their leader. The leader:
Replica followers receive the data records, LEO, and HW from the leader and perform the following operations:
And the cycle repeats.
If the leader detects any of follower’s HW is less than the LRS HW by a maximum number of records, the leader removes the follower from the LRS.
Followers removed from the LRS are ineligible for election but continue to receive records. If follower catches up with the leader it is added back to LRS and once again becomes eligible for election.
If a leader goes offline, an election is triggered and one of the followers takes over as leader. The rest of the followers connect to the new leader and synchronize their data store.
When the failed leader rejoins the replica set, it detects the new leader and turns itself into a follower. The replica set continue under the new leadership until a new election is triggered.
Replica leaders receive data records from producers and sends them to consumers.
Consumers can choose to receive either COMMITTED or UNCOMMITTED records. The second method is discouraged as it cannot deterministically survive various failure scenarios.
By default only COMMITTED messages are sent to consumers.