一 Paxos算法与ZAB协议

所有一致性协议本质上要么是Paxos要么是其变体 —— Google Chubby 的作者Mike Burrows



有一个叫做Paxos的小岛(Island)上面住了一批居民,岛上面所有的事情由一些特殊的人决定,他们叫做议员(Senator)。议员的总数(Senator Count)是确定的,不能更改。岛上每次环境事务的变更都需要通过一个提议(Proposal),每个提议都有一个编号(PID),这个编号是一直增长的,不能倒退。每个提议都需要超过半数((Senator Count)/2 +1)的议员同意才能生效。每个议员只会同意大于当前编号的提议,包括已生效的和未生效的。如果议员收到小于等于当前编号的提议,他会拒绝,并告知对方:你的提议已经有人提过了。这里的当前编号是每个议员在自己记事本上面记录的编号,他不断更新这个编号。整个议会不能保证所有议员记事本上的编号总是相同的。现在议会有一个目标:保证所有的议员对于提议都能达成一致的看法。


现在看冲突的解决:假设总共有三个议员S1-S3,S1和S2同时发起了一个提议:1号提议,设定电费。S1想设为1元/度, S2想设为2元/度。结果S3先收到了S1的提议,于是他做了和前面同样的操作。紧接着他又收到了S2的提议,结果他一查记事本,咦,这个提议的编号小于等于我的当前编号1,于是他拒绝了这个提议:对不起,这个提议先前提过了。于是S2的提议被拒绝,S1正式发布了提议: 1号提议生效。S2向S1或者S3打听并更新了1号法令的内容,然后他可以选择继续发起2号提议。

1. Prepare阶段:提议者提供提议
2. Accept阶段:反馈意见,接受提议


ZAB(ZooKeeper Atomic Broadcast ),全称为:原子消息广播协议,是Paxos的简化。ZAB协议要求全局唯有Leader才能处理客户所有事务请求,发布提议,并设计了崩溃恢复机制:一旦Leader崩溃,不接受事务请求,直至快速选举产生新的Leader。唯一Leader可以保证事务的全局有序性,避免多人提出提议,加快算法收敛,尽快形成多数派。


二 Zookeeper源码实现



public class ZooKeeperServer    ...        public void processPacket(ServerCnxn cnxn, ByteBuffer incomingBuffer) throws IOException {        // We have the request, now process and setup for next        InputStream bais = new ByteBufferInputStream(incomingBuffer);        BinaryInputArchive bia = BinaryInputArchive.getArchive(bais);        RequestHeader h = new RequestHeader();        h.deserialize(bia, "header");        // Through the magic of byte buffers, txn will not be        // pointing to the start of the txn        incomingBuffer = incomingBuffer.slice();        if (h.getType() == OpCode.auth) {           ...        } else {            if (h.getType() == OpCode.sasl) {                ...            }            else { //普通的事务请求                Request si = new Request(cnxn, cnxn.getSessionId(), h.getXid(),                  h.getType(), incomingBuffer, cnxn.getAuthInfo());                si.setOwner(ServerCnxn.me);                // Always treat packet from the client as a possible                // local request.                setLocalSessionFlag(si);                submitRequest(si);            }        }        cnxn.incrOutstandingRequests(h);    }    public void submitRequest(Request si) {        if (firstProcessor == null) {            synchronized (this) {                try {                    // Since all requests are passed to the request                    // processor it should wait for setting up the request                    // processor chain. The state will be updated to RUNNING                    // after the setup.                    while (state == State.INITIAL) {                        wait(1000);                    }                } catch (InterruptedException e) {                    LOG.warn("Unexpected interruption", e);                }                if (firstProcessor == null || state != State.RUNNING) {                    throw new RuntimeException("Not started");                }            }        }        //firstProcessor 以完成初始化        try {            touch(si.cnxn);            boolean validpacket = Request.isValid(si.type);            if (validpacket) {                firstProcessor.processRequest(si);                if (si.cnxn != null) {                    incInProcess();                }            }            ...    }    protected RequestProcessor firstProcessor;    protected void setupRequestProcessors() {        RequestProcessor finalProcessor = new FinalRequestProcessor(this);        RequestProcessor syncProcessor = new SyncRequestProcessor(this,                finalProcessor);        ((SyncRequestProcessor)syncProcessor).start();        firstProcessor = new PrepRequestProcessor(this, syncProcessor);        ((PrepRequestProcessor)firstProcessor).start();    }    public synchronized void startup() {        if (sessionTracker == null) {            createSessionTracker();        }        startSessionTracker();        setupRequestProcessors();        registerJMX();        setState(State.RUNNING);        notifyAll();    }






Leader的Processor链:LeaderRequestProcessor -> PrepRequestProcessor-> ProposalRequestProcessor -> CommitProcessor -> Leader.ToBeAppliedRequestProcessor -> FinalRequestProcessor。



    public void run() {        try {            while (true) {                Request request = submittedRequests.take();                ...                pRequest(request);            }        ...    }    protected void pRequest(Request request) throws RequestProcessorException {        request.setHdr(null);        request.setTxn(null);        try {            switch (request.type) {            case OpCode.createContainer:            case OpCode.create:            case OpCode.create2:                CreateRequest create2Request = new CreateRequest();                pRequest2Txn(request.type, zks.getNextZxid(), request, create2Request, true);                break;            case OpCode.createTTL:                CreateTTLRequest createTtlRequest = new CreateTTLRequest();                pRequest2Txn(request.type, zks.getNextZxid(), request, createTtlRequest, true);                break;            case OpCode.deleteContainer:            case OpCode.delete:                DeleteRequest deleteRequest = new DeleteRequest();                pRequest2Txn(request.type, zks.getNextZxid(), request, deleteRequest, true);                break;            case OpCode.setData:                SetDataRequest setDataRequest = new SetDataRequest();                                pRequest2Txn(request.type, zks.getNextZxid(), request, setDataRequest, true);                break;            case OpCode.reconfig:                ReconfigRequest reconfigRequest = new ReconfigRequest();                ByteBufferInputStream.byteBuffer2Record(request.request, reconfigRequest);                pRequest2Txn(request.type, zks.getNextZxid(), request, reconfigRequest, true);                break;            case OpCode.setACL:                SetACLRequest setAclRequest = new SetACLRequest();                                pRequest2Txn(request.type, zks.getNextZxid(), request, setAclRequest, true);                break;            case OpCode.check:                CheckVersionRequest checkRequest = new CheckVersionRequest();                              pRequest2Txn(request.type, zks.getNextZxid(), request, checkRequest, true);                break;            case OpCode.multi:                ...                break;            //create/close session don't require request record            case OpCode.createSession:            case OpCode.closeSession:                if (!request.isLocalSession()) {                    pRequest2Txn(request.type, zks.getNextZxid(), request,                                 null, true);                }                break;            //All the rest don't need to create a Txn - just verify session            case OpCode.sync:            case OpCode.exists:            case OpCode.getData:            case OpCode.getACL:            case OpCode.getChildren:            case OpCode.getChildren2:            case OpCode.ping:            case OpCode.setWatches:            case OpCode.checkWatches:            case OpCode.removeWatches:                zks.sessionTracker.checkSession(request.sessionId,                        request.getOwner());                break;            default:                LOG.warn("unknown type " + request.type);                break;            }        }         ...        request.zxid = zks.getZxid();        nextProcessor.processRequest(request);    }protected void pRequest2Txn(int type, long zxid, Request request,                                Record record, boolean deserialize)        throws KeeperException, IOException, RequestProcessorException    {        request.setHdr(new TxnHeader(request.sessionId, request.cxid, zxid,                Time.currentWallTime(), type));        switch (type) {            case OpCode.create:            case OpCode.create2:            case OpCode.createTTL:            case OpCode.createContainer: {                pRequest2TxnCreate(type, request, record, deserialize);                break;            }            case OpCode.deleteContainer: {                ...                addChangeRecord(parentRecord);                addChangeRecord(new ChangeRecord(request.getHdr().getZxid(), path, null, -1, null));                break;            }            case OpCode.delete:                ...                break;            case OpCode.setData:                zks.sessionTracker.checkSession(request.sessionId, request.getOwner());                SetDataRequest setDataRequest = (SetDataRequest)record;                if(deserialize)                    ByteBufferInputStream.byteBuffer2Record(request.request, setDataRequest);                path = setDataRequest.getPath();                validatePath(path, request.sessionId);                nodeRecord = getRecordForPath(path);                checkACL(zks, nodeRecord.acl, ZooDefs.Perms.WRITE, request.authInfo);                int newVersion = checkAndIncVersion(nodeRecord.stat.getVersion(), setDataRequest.getVersion(), path);                request.setTxn(new SetDataTxn(path, setDataRequest.getData(), newVersion));                nodeRecord = nodeRecord.duplicate(request.getHdr().getZxid());                nodeRecord.stat.setVersion(newVersion);                addChangeRecord(nodeRecord);                break;            case OpCode.reconfig:                ...                break;                                     case OpCode.setACL:                zks.sessionTracker.checkSession(request.sessionId, request.getOwner());                SetACLRequest setAclRequest = (SetACLRequest)record;                if(deserialize)                    ByteBufferInputStream.byteBuffer2Record(request.request, setAclRequest);                path = setAclRequest.getPath();                validatePath(path, request.sessionId);                List<ACL> listACL = fixupACL(path, request.authInfo, setAclRequest.getAcl());                nodeRecord = getRecordForPath(path);                checkACL(zks, nodeRecord.acl, ZooDefs.Perms.ADMIN, request.authInfo);                newVersion = checkAndIncVersion(nodeRecord.stat.getAversion(), setAclRequest.getVersion(), path);                request.setTxn(new SetACLTxn(path, listACL, newVersion));                nodeRecord = nodeRecord.duplicate(request.getHdr().getZxid());                nodeRecord.stat.setAversion(newVersion);                addChangeRecord(nodeRecord);                break;            case OpCode.createSession:                request.request.rewind();                int to = request.request.getInt();                request.setTxn(new CreateSessionTxn(to));                request.request.rewind();                if (request.isLocalSession()) {                    // This will add to local session tracker if it is enabled                    zks.sessionTracker.addSession(request.sessionId, to);                } else {                    // Explicitly add to global session if the flag is not set                    zks.sessionTracker.addGlobalSession(request.sessionId, to);                }                zks.setOwner(request.sessionId, request.getOwner());                break;            case OpCode.closeSession:                // We don't want to do this check since the session expiration thread                // queues up this operation without being the session owner.                // this request is the last of the session so it should be ok                //zks.sessionTracker.checkSession(request.sessionId, request.getOwner());                Set<String> es = zks.getZKDatabase()                        .getEphemerals(request.sessionId);                synchronized (zks.outstandingChanges) {                    for (ChangeRecord c : zks.outstandingChanges) {                        if (c.stat == null) {                            // Doing a delete                            es.remove(c.path);                        } else if (c.stat.getEphemeralOwner() == request.sessionId) {                            es.add(c.path);                        }                    }                    for (String path2Delete : es) {                        addChangeRecord(new ChangeRecord(request.getHdr().getZxid(), path2Delete, null, 0, null));                    }                    zks.sessionTracker.setSessionClosing(request.sessionId);                }                LOG.info("Processed session termination for sessionid: 0x"                        + Long.toHexString(request.sessionId));                break;            case OpCode.check:                zks.sessionTracker.checkSession(request.sessionId, request.getOwner());                CheckVersionRequest checkVersionRequest = (CheckVersionRequest)record;                if(deserialize)                    ByteBufferInputStream.byteBuffer2Record(request.request, checkVersionRequest);                path = checkVersionRequest.getPath();                validatePath(path, request.sessionId);                nodeRecord = getRecordForPath(path);                checkACL(zks, nodeRecord.acl, ZooDefs.Perms.READ, request.authInfo);                request.setTxn(new CheckVersionTxn(path, checkAndIncVersion(nodeRecord.stat.getVersion(),                        checkVersionRequest.getVersion(), path)));                break;            default:                LOG.warn("unknown type " + type);                break;        }    }    //create 请求处理    private void pRequest2TxnCreate(int type, Request request, Record record, boolean deserialize) throws IOException, KeeperException {        if (deserialize) {            ByteBufferInputStream.byteBuffer2Record(request.request, record);        }        int flags;        String path;        List<ACL> acl;        byte[] data;        long ttl;        if (type == OpCode.createTTL) {            CreateTTLRequest createTtlRequest = (CreateTTLRequest)record;            flags = createTtlRequest.getFlags();            path = createTtlRequest.getPath();            acl = createTtlRequest.getAcl();            data = createTtlRequest.getData();            ttl = createTtlRequest.getTtl();        } else {            CreateRequest createRequest = (CreateRequest)record;            flags = createRequest.getFlags();            path = createRequest.getPath();            acl = createRequest.getAcl();            data = createRequest.getData();            ttl = 0;        }        CreateMode createMode = CreateMode.fromFlag(flags);        validateCreateRequest(createMode, request);        String parentPath = validatePathForCreate(path, request.sessionId);        List<ACL> listACL = fixupACL(path, request.authInfo, acl);        ChangeRecord parentRecord = getRecordForPath(parentPath);        checkACL(zks, parentRecord.acl, ZooDefs.Perms.CREATE, request.authInfo);        int parentCVersion = parentRecord.stat.getCversion();        if (createMode.isSequential()) {            path = path + String.format(Locale.ENGLISH, "%010d", parentCVersion);        }        validatePath(path, request.sessionId);        try {            if (getRecordForPath(path) != null) {                throw new KeeperException.NodeExistsException(path);            }        } catch (KeeperException.NoNodeException e) {            // ignore this one        }        boolean ephemeralParent = EphemeralType.get(parentRecord.stat.getEphemeralOwner()) == EphemeralType.NORMAL;        if (ephemeralParent) {            throw new KeeperException.NoChildrenForEphemeralsException(path);        }        int newCversion = parentRecord.stat.getCversion()+1;        if (type == OpCode.createContainer) {            request.setTxn(new CreateContainerTxn(path, data, listACL, newCversion));        } else if (type == OpCode.createTTL) {            request.setTxn(new CreateTTLTxn(path, data, listACL, newCversion, ttl));        } else {            request.setTxn(new CreateTxn(path, data, listACL, createMode.isEphemeral(),                    newCversion));        }        StatPersisted s = new StatPersisted();        if (createMode.isEphemeral()) {            s.setEphemeralOwner(request.sessionId);        }        parentRecord = parentRecord.duplicate(request.getHdr().getZxid());        parentRecord.childCount++;        parentRecord.stat.setCversion(newCversion);        addChangeRecord(parentRecord);        addChangeRecord(new ChangeRecord(request.getHdr().getZxid(), path, s, 0, listACL));    }    //ChangeRecord在zks中被设置为共享数据,与其他Processor共享    private void addChangeRecord(ChangeRecord c) {        synchronized (zks.outstandingChanges) {            zks.outstandingChanges.add(c);            zks.outstandingChangesForPath.put(c.path, c);        }    }

PrepRequestProcessor 接收请求,不同的请求不同的处理,事务请求能形成ChangeRecord。

ProposalRequestProcessor :转发request给内部封装的两个Processor, SyncRequestProcessor->AckRequestProcessor

SyncRequestProcessor : 线程实现,记录请求,并按照条件滚动日志,生成快照,将消息同步到zkDataBase中。

    public void run() {        try {            int logCount = 0;            // we do this in an attempt to ensure that not all of the servers            // in the ensemble take a snapshot at the same time            // 这里一个随机数,可以让所有节点不在同一个时间产生快照。            int randRoll = r.nextInt(snapCount/2);            while (true) {                Request si = null;                if (toFlush.isEmpty()) {                    si = queuedRequests.take();                } else {                    si = queuedRequests.poll();                    if (si == null) {                        flush(toFlush);                        continue;                    }                }                if (si == requestOfDeath) {                    break;                }                if (si != null) {                    // track the number of records written to the log                    if (zks.getZKDatabase().append(si)) {                        logCount++;                        if (logCount > (snapCount / 2 + randRoll)) {                            randRoll = r.nextInt(snapCount/2);                            // roll the log                            zks.getZKDatabase().rollLog();                            // take a snapshot                            if (snapInProcess != null && snapInProcess.isAlive()) {                            //同时只能有一个线程才能产生快照                                LOG.warn("Too busy to snap, skipping");                            } else {                                snapInProcess = new ZooKeeperThread("Snapshot Thread") {                                        public void run() {                                            try {                                                zks.takeSnapshot();                                            } catch(Exception e) {                                                LOG.warn("Unexpected exception", e);                                            }                                        }                                    };                                snapInProcess.start();                            }                            logCount = 0;                        }                    } else if (toFlush.isEmpty()) {                        // optimization for read heavy workloads                        // iff this is a read, and there are no pending                        // flushes (writes), then just pass this to the next                        // processor                        if (nextProcessor != null) {                            nextProcessor.processRequest(si);                            if (nextProcessor instanceof Flushable) {                                ((Flushable)nextProcessor).flush();                            }                        }                        continue;                    }                    toFlush.add(si);                    if (toFlush.size() > 1000) {                        flush(toFlush);                    }                }            }        } catch (Throwable t) {            handleException(this.getName(), t);        } finally{            running = false;        }        LOG.info("SyncRequestProcessor exited!");    }

AckRequestProcessor: 将发送过来的请求作为Ack转发给leader,表示自己同意了这个请求,并通过processAck告诉Leader。Leader 本身在循环中,通过与其他节点的ping中,接收processAck请求,记录其他节点的投票情况。并尝试提交,一旦成功提交,就广播了这个请求已被commit。

 public void processRequest(Request request) {        QuorumPeer self = leader.self;        if(self != null)            leader.processAck(self.getId(), request.zxid, null);        else            LOG.error("Null QuorumPeer");    }Leader:synchronized public void processAck(long sid, long zxid, SocketAddress followerAddr) {                if (!allowedToCommit) return; // last op committed was a leader change - from now on                                      // the new leader should commit                if (LOG.isTraceEnabled()) {            LOG.trace("Ack zxid: 0x{}", Long.toHexString(zxid));            for (Proposal p : outstandingProposals.values()) {                long packetZxid = p.packet.getZxid();                LOG.trace("outstanding proposal: 0x{}",                        Long.toHexString(packetZxid));            }            LOG.trace("outstanding proposals all");        }        ...        if (outstandingProposals.size() == 0) {            if (LOG.isDebugEnabled()) {                LOG.debug("outstanding is 0");            }            return;        }        ...        Proposal p = outstandingProposals.get(zxid);        ...        //记录下谁响应了这个Proposal        p.addAck(sid);           boolean hasCommitted = tryToCommit(p, zxid, followerAddr);        ...        if (hasCommitted && p.request!=null && p.request.getHdr().getType() == OpCode.reconfig){               long curZxid = zxid;           while (allowedToCommit && hasCommitted && p!=null){               curZxid++;               p = outstandingProposals.get(curZxid);               if (p !=null) hasCommitted = tryToCommit(p, curZxid, null);                        }        }    } synchronized public boolean tryToCommit(Proposal p, long zxid, SocketAddress followerAddr) {              ...                // in order to be committed, a proposal must be accepted by a quorum            outstandingProposals.remove(zxid);        // getting a quorum from all necessary configurations        if (!p.hasAllQuorums()) { //是否过半数           return false;                         }        if (p.request != null) {             toBeApplied.add(p);        }        if (p.request == null) {            LOG.warn("Going to commmit null: " + p);        } else if (p.request.getHdr().getType() == OpCode.reconfig) {                                               ...        } else {            commit(zxid);            inform(p);        }        zk.commitProcessor.commit(p.request);        if(pendingSyncs.containsKey(zxid)){            for(LearnerSyncRequest r: pendingSyncs.remove(zxid)) {                sendSync(r);            }                       }         return  true;       }    public void commit(long zxid) {        synchronized(this){            lastCommitted = zxid;        }        QuorumPacket qp = new QuorumPacket(Leader.COMMIT, zxid, null, null);        sendPacket(qp);    }


 @Override    public void run() {        Request request;        try {            while (!stopped) {                synchronized(this) {                    while (                        !stopped &&                        ((queuedRequests.isEmpty() || isWaitingForCommit() || isProcessingCommit()) &&                         (committedRequests.isEmpty() || isProcessingRequest()))) {                        wait();                    }                }                /*                 * Processing queuedRequests: Process the next requests until we                 * find one for which we need to wait for a commit. We cannot                 * process a read request while we are processing write request.                 */                while (!stopped && !isWaitingForCommit() &&                       !isProcessingCommit() &&                       (request = queuedRequests.poll()) != null) {                    if (needCommit(request)) { //写请求,需要进一步处理                        nextPending.set(request);                    } else { //读请求,直接交给下个Processor处理                        sendToNextProcessor(request);                    }                }                /*                 * Processing committedRequests: check and see if the commit                 * came in for the pending request. We can only commit a                 * request when there is no other request being processed.                 */                processCommitted();            }        }     }   protected void processCommitted() {        Request request;        if (!stopped && !isProcessingRequest() &&                (committedRequests.peek() != null)) {            request = committedRequests.poll();            Request pending = nextPending.get();            if (pending != null &&                pending.sessionId == request.sessionId &&                pending.cxid == request.cxid) {                // we want to send our version of the request.                // the pointer to the connection in the request                pending.setHdr(request.getHdr());                pending.setTxn(request.getTxn());                pending.zxid = request.zxid;                // Set currentlyCommitting so we will block until this                // completes. Cleared by CommitWorkRequest after                // nextProcessor returns.                currentlyCommitting.set(pending);                nextPending.set(null);                sendToNextProcessor(pending);            } else {                // this request came from someone else so just                // send the commit packet                currentlyCommitting.set(request);                sendToNextProcessor(request); //sendToNextProcessor会在工作线程中将request交给下个Processor,也做些清理工作。            }        }          }


        public void processRequest(Request request) throws RequestProcessorException {            next.processRequest(request);            // The only requests that should be on toBeApplied are write            // requests, for which we will have a hdr. We can't simply use            // request.zxid here because that is set on read requests to equal            // the zxid of the last write op.            if (request.getHdr() != null) {                long zxid = request.getHdr().getZxid();                Iterator<Proposal> iter = leader.toBeApplied.iterator();                if (iter.hasNext()) {                    Proposal p = iter.next();                    if (p.request != null && p.request.zxid == zxid) {                        iter.remove();                        return;                    }                }                LOG.error("Committed request not found on toBeApplied: "                          + request);            }        }

FinalRequestProcessor: 和PrepRequestProcessor一样,对不同的request作最后不同的处理,形成结果,并触发watcher机制。





FollowerRequestProcessor: 处理读请求,会将写请求转发给Leader



