第三章 QuorumPeer 选举

来源:互联网 发布:大数据实验室解决方案 编辑:程序博客网 时间:2024/05/16 15:20

一、选举模块的创建。
QuorumPeer的start方法中,有调用startLeaderElection来启动选举相关的功能,并且设置默认leader为自身。

 synchronized public void startLeaderElection() {        try {            currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());        } catch(IOException e) {            RuntimeException re = new RuntimeException(e.getMessage());            re.setStackTrace(e.getStackTrace());            throw re;        }        for (QuorumServer p : getView().values()) {            if (p.id == myid) {                myQuorumAddr = p.addr;                break;            }        }        if (myQuorumAddr == null) {            throw new RuntimeException("My id " + myid + " not in the peer list");        }        if (electionType == 0) {            try {                udpSocket = new DatagramSocket(myQuorumAddr.getPort());                responder = new ResponderThread();                responder.start();            } catch (SocketException e) {                throw new RuntimeException(e);            }        }        this.electionAlg = createElectionAlgorithm(electionType);    }

二、选举模块的创建,根据上一章讲的启动参数设置选举算法,默认情况下是paxos算法的变种FastLeaderElection实现类。
1、创建LOOKING节点接收Set,已完成选举节点Set,逻辑时钟logicalclock加1,代表自己当前开始了一个新的选举周期,同时,通过updateProposal()方法设置初始议案
2、初始议案默认自己为Leader,广播议案到其他server,进入接收循环。
3、直到选举完成或被强行停止,都循环下面的步骤
4、从接收队列中获取其他服务器的投票信息,会有超时机制。
5、如果在最大超时时间没收到投票,重连,进入下一个循环。
6、如果收到来自集群的投票信息,进入投票信息处理switch
7、如果收到的投票信息来自LOOKING节点,如果对方electionEpoch 大于本机logicalclock,清空投票接收Set,并将对方的设置为当前议案,然后将当前议案广播给其他服务器;如果对方electionEpoch 小于本机logicalclock,忽略投票;如果相等,则根据 epoch、zxid、serverId的顺序比较,大的投票胜出成为议案,然后将当前议案广播给其他服务器;
8、上一步中会将投票信息放入recvSet,现在判断其中能否选出Leader,通过QuorumVerifier的containsQuorum方法可以判断。默认是超过一半的服务器选议案中的vote,如果能选举出进入下一步。否则进入下一个循环
9、如果上一步recvSet成功选出Leader,以两百毫秒的超时poll 接收队列中的投票,如果有更新的,放到接收队列中,并跳出选举,进入下一个循环。
10、如果第九步没有执行,此时已经可以判断出leader了,根据选举结果设置当前服务器状态,清除接收队列,返回leader节点。
11、如果收到的投票信息来自Following或者Leader节点。
12、如果投票信息与当前logicalclock一致、将投票信息放到接收Set,判断LOOKING投票接收Set中是不是大部分服务器都选举该投票中的节点。并且该节点自己也已经成为Leader状态,满足的话就承认该投票为Leader。不满足进入下一步。
13、将投票放入已完成选举节点Set,判断已完成选举节点Set中是不是大部分服务器同意选举投票中的节点,并且投票节点也自认为是Leader,满足的话承认Leader完成选举,否则进入下一个循环

呕心沥血画了个图
这里写图片描述

 HashMap<Long, Vote> recvset = new HashMap<Long, Vote>(); HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();int notTimeout = finalizeWait; synchronized(this){                logicalclock++;                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());            }            LOG.info("New election. My id =  " + self.getId() +                    ", proposed zxid=0x" + Long.toHexString(proposedZxid));            sendNotifications();            /*             * Loop in which we exchange notifications until we find a leader             */            while ((self.getPeerState() == ServerState.LOOKING) &&                    (!stop)){                /*                 * Remove next notification from queue, times out after 2 times                 * the termination time                 */                Notification n = recvqueue.poll(notTimeout,                        TimeUnit.MILLISECONDS);                /*                 * Sends more notifications if haven't received enough.                 * Otherwise processes new notification.                 */                if(n == null){                    if(manager.haveDelivered()){                        sendNotifications();                    } else {                        manager.connectAll();                    }                    /*                     * Exponential backoff                     */                    int tmpTimeOut = notTimeout*2;                    notTimeout = (tmpTimeOut < maxNotificationInterval?                            tmpTimeOut : maxNotificationInterval);                    LOG.info("Notification time out: " + notTimeout);                }                else if(self.getVotingView().containsKey(n.sid)) {                    /*                     * Only proceed if the vote comes from a replica in the                     * voting view.                     */                    switch (n.state) {                    case LOOKING:                        // If notification > current, replace and send messages out                        if (n.electionEpoch > logicalclock) {                            logicalclock = n.electionEpoch;                            recvset.clear();                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {                                updateProposal(n.leader, n.zxid, n.peerEpoch);                            } else {                                updateProposal(getInitId(),                                        getInitLastLoggedZxid(),                                        getPeerEpoch());                            }                            sendNotifications();                        } else if (n.electionEpoch < logicalclock) {                            if(LOG.isDebugEnabled()){                                LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"                                        + Long.toHexString(n.electionEpoch)                                        + ", logicalclock=0x" + Long.toHexString(logicalclock));                            }                            break;                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,                                proposedLeader, proposedZxid, proposedEpoch)) {                            updateProposal(n.leader, n.zxid, n.peerEpoch);                            sendNotifications();                        }                        if(LOG.isDebugEnabled()){                            LOG.debug("Adding vote: from=" + n.sid +                                    ", proposed leader=" + n.leader +                                    ", proposed zxid=0x" + Long.toHexString(n.zxid) +                                    ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));                        }                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));                        if (termPredicate(recvset,                                new Vote(proposedLeader, proposedZxid,                                        logicalclock, proposedEpoch))) {                            // Verify if there is any change in the proposed leader                            while((n = recvqueue.poll(finalizeWait,                                    TimeUnit.MILLISECONDS)) != null){                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,                                        proposedLeader, proposedZxid, proposedEpoch)){                                    recvqueue.put(n);                                    break;                                }                            }                            /*                             * This predicate is true once we don't read any new                             * relevant message from the reception queue                             */                            if (n == null) {                                self.setPeerState((proposedLeader == self.getId()) ?                                        ServerState.LEADING: learningState());                                Vote endVote = new Vote(proposedLeader,                                                        proposedZxid,                                                        logicalclock,                                                        proposedEpoch);                                leaveInstance(endVote);                                return endVote;                            }                        }                        break;                    case OBSERVING:                        LOG.debug("Notification from observer: " + n.sid);                        break;                    case FOLLOWING:                    case LEADING:                        /*                         * Consider all notifications from the same epoch                         * together.                         */                        if(n.electionEpoch == logicalclock){                            recvset.put(n.sid, new Vote(n.leader,                                                          n.zxid,                                                          n.electionEpoch,                                                          n.peerEpoch));                            if(ooePredicate(recvset, outofelection, n)) {                                self.setPeerState((n.leader == self.getId()) ?                                        ServerState.LEADING: learningState());                                Vote endVote = new Vote(n.leader,                                         n.zxid,                                         n.electionEpoch,                                         n.peerEpoch);                                leaveInstance(endVote);                                return endVote;                            }                        }                        /*                         * Before joining an established ensemble, verify                         * a majority is following the same leader.                         */                        outofelection.put(n.sid, new Vote(n.version,                                                            n.leader,                                                            n.zxid,                                                            n.electionEpoch,                                                            n.peerEpoch,                                                            n.state));                        if(ooePredicate(outofelection, outofelection, n)) {                            synchronized(this){                                logicalclock = n.electionEpoch;                                self.setPeerState((n.leader == self.getId()) ?                                        ServerState.LEADING: learningState());                            }                            Vote endVote = new Vote(n.leader,                                                    n.zxid,                                                    n.electionEpoch,                                                    n.peerEpoch);                            leaveInstance(endVote);                            return endVote;                        }                        break;                    default:                        LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",                                n.state, n.sid);                        break;                    }                } else {                    LOG.warn("Ignoring notification from non-cluster member " + n.sid);                }            }            return null;

QuorumCnxManager是管理选举中所使用的链接的
其中端口通过server.x=[hostname]:n:n 中的第二个n来设置。而且是serverId大的向serverid小的发起链接,serverId小的则只需要accept,这样可以减少使用的链接数。