Paxos

来源：互联网发布：中国学术论文数据库编辑：程序博客网时间：2024/06/15 21:50

Basic Paxos

implementing a state machine

A simple way to implement a distributed system is as a collection of clients that issue commands to a central server. The server can be described as a deterministic state machine that performs client commands in some sequence. The state machine has a current state; it performs a step by taking as input a command and producing an output and a new state.

An implementation that uses a single central server fails if that server fails. We therefore instead use a collection of servers.

Because the state machine is deterministic, all the servers will produce the same sequences of states and outputs if they all execute the same sequence of commands. A client issuing a command can then use the output generated for it by any server.

In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals). Clients send commands to the leader, who decides where in the sequence each command should appear. If the leader decides that a certain client command should be the 135th command, it tries to have that command chosen as the value of the 135th instance of the consensus algorithm. It will usually succeed. It might fail because another server also believes itself to be the leader and has a different idea of what the 135th command should be.

Suppose the leader knows the values chosen in instances 1–134, 138, and 139 of the consensus algorithm. It then executes phase 1 of instances 135–137 and of all instances greater than 139. Suppose that the outcome of these executions determine the value to be proposed in instances 135 and 140, but leaves the proposed value unconstrained in all other instances. The leader then executes phase 2 for instances 135 and 140, thereby choosing commands 135 and 140.

The leader, as well as any other server that learns all the commands the leader knows, can now execute commands 1–135. However, it can’t execute commands 138–140, which it also knows, because commands 136 and 137 have yet to be chosen. The leader could take the next two commands requested by clients to be commands 136 and 137. Instead, we let it fill the gap immediately by proposing, as commands 136 and 137, a special “no- op” command that leaves the state unchanged. Once these no-op commands have been chosen, commands 138–140 can be executed.

When the leader fails to receive the expected response to its phase 2 messages in instance 141, it will retransmit those messages. If all goes well, its proposed command will be chosen. However, it could fail first, leaving a gap in the sequence of chosen commands.

A newly chosen leader executes phase 1 for infinitely many instances of the consensus algorithm—in the scenario above, for instances 135–137 and all instances greater than 139. In phase 1, an acceptor responds with more than a simple OK only if it has already received a phase 2 message from some proposer. (In the scenario, this was the case only for instances 135 and 140.)

two phase

Phase 1a: Prepare

A Proposer (the leader) creates a proposal identified with a number N. Then, it sends a Prepare message containing this proposal to a Quorum of Acceptors.

Phase 1b: Promise

If the proposal's number N is higher than any previous proposal number received from any Proposer by the Acceptor, then the Acceptor must return a promise to ignore all future proposals having a number less than N. Otherwise, the Acceptor can ignore the received proposal.

Phase 2a: Accept Request

If a Proposer receives enough promises from a Quorum of Acceptors, it needs to set a value to its proposal. If any Acceptors had previously accepted any proposal, then they'll have sent their values to the Proposer, who now must set the value of its proposal to the value associated with the highest proposal number reported by the Acceptors.

If none of the Acceptors had accepted a proposal up to this point, then the Proposer may choose any value for its proposal.

Phase 2b: Accepted

If an Acceptor receives an Accept Request message for a proposal N, it must accept it if and only if it has not already promised to any prepare proposals having an identifier greater than N. In this case, it should register the corresponding value v and send an Accepted message to the Proposer and every Learner. Else, it can ignore the Accept Request.

roles

Acceptor

Acceptors are collected into groups called Quorums.

Learner

Learners act as the replication factor for the protocol.

prove of safety

Only a singe value is chosen.

Assume that two different values, v1 and v2 are chosen. According to the Paxos algorithm, the only way for a value to be chosen is for the majority of acceptors to accept the same accept request from a proposer. Hence, a set of majority of acceptors A1 must have accepted an accept request with a proposal [n1, v1], and similarly a set of majority of acceptors A2 must have accepted an accept request with a proposal [n2, v2]. If the two proposal numbers are the same, i.e., n1=n2, considering that the two sets A1 and A2 must intersect in least one acceptor, this acceptor must have accepted two different proposals with the same proposal number. This is impossible because according to the Paxos algorithm, an acceptor would ignore the prepare and accept requests with a proposal number identical to that of the prepare and/or accept requests that it has accepted.If n1≠n2, without loss of generality, assume that n1<n2. We first further assume that n1 and n2 are for consecutive proposal rounds. A set of majority acceptor A1 must have accepted the accept request with a proposal number n1 before another set of majority acceptor A2 accepted the accept request with a proposal number n2 because an acceptor would ignore the prepare or accept request if it contains a proposal number smaller than the one it has acknowledged in response to a prepare request. Furthermore, according to the Paxos algorithm, the value selected by a proposer for the accept request must either come from an earlier proposal with the highest proposal number or a value of its own if no earlier proposal is included in the acknowledgement messages. Because A1 and A2 must intersect in at least one acceptor, and this acceptor must have accepted the accept request for the proposal [n1,v1] and the accept request for the proposal [n2, v2]. This is impossible because that acceptor would have included the proposal [n1, v1] in its acknowledgement to the prepare request for the proposal with proposal number n2, and the proposer must have selected the value v1 instead of v2.

Error case

node failure

case 1: failure of an acceptor when a Quorum of acceptors remains live

The simplest error case is the failure of an Acceptor when a Quorum of Acceptors remains live. In this case, the protocol requires no recovery. No additional rounds or messages are required.

case 2: proposer fails after proposing a value before agreement is reached

The next failure case is when a Proposer fails after proposing a value, but before agreement is reached.

Re-election not shown.

case 3: current leader fails and later recovers

The current leader may fail and later recover, but the other Proposers have already re-elected a new leader.

This is a liveness exception.

Let's try to find the solution to this from libevent Paxos.

If the old leader comes back after a new leader has been elected, it will receive the heart beat with a higher view id from the new leader. Then the old leader updates its own view and tries to catch up by asking other nodes to send history requests to it. In protocol, it can have a proposer issue a proposal to learn about the missing values.

On the there hand, if the old leader recovers before a new leader has been elected, the election will be aborted.

Note: How does the cohort handle the message sent from the old leader?

network partition

case 1: concurrent monotonic proposals

case 2: current monotonic proposals (majorities overlap)

Note: To learn that a value has been chosen, a learner must find out that a proposal has been accepted by a majority of acceptors.

case 3: sequential monotonic proposals

message loss

Phase 1a, 1b:

When the leader fails to receive the expected response in Phase 1 upon timeout, it will retransmit the Prepare message.

Phase 2a:

If the Accept Request message gets lost, start another round with a higher proposal number.

Phase 2b:

Because of message loss, a value could be chosen with no learner ever finding out. If a learner needs to know whether a value has been chosen, it can have a proposer issue a proposal.

0 0