Hadoop源码分析之数据节点的握手，注册，上报数据块和心跳

来源：互联网发布：hbase性能优化方法总结编辑：程序博客网时间：2024/05/21 17:54

在上一篇文章Hadoop源码分析之DataNode的启动与停止中分析了DataNode节点的启动大致过程，下面来重点分析DataNode节点中启动过程中的与NameNode节点的几个通信过程。

IPC对象创建

在DataNode类中有一个成员变量namenode，它是DatanodeProtocol类型，DatanodeProtocol接口是DataNode节点与NameNode节点间进行IPC通信的接口，这种情况下，DataNode节点是客户端，NameNode节点是服务器端，DataNode节点通过这个接口向NameNode节点报告一些信息，同步信息到名字节点，同时，该接口的一些方法的返回值会带回NameNode节点指令，根据这些指令，DataNode节点或移动，或删除，或恢复本地磁盘上的数据块，或者执行其他操作，在DataNode.startDataNode()方法中对namenode对象进行初始化，代码如下：

this.namenode = (DatanodeProtocol)       RPC.waitForProxy(DatanodeProtocol.class,                       DatanodeProtocol.versionID,                       nameNodeAddr, conf);

RPC.waitForProxy()方法掉用了RPC.getProxy()方法创建一个用于IPC通信(关于IPC请参考前面几篇分析Hadoop IPC的博文)的对象，方法返回后赋值给namenode，RPC.waitForProxy()方法如下：

static VersionedProtocol waitForProxy(Class<? extends VersionedProtocol> protocol,                                      long clientVersion, InetSocketAddress addr,                                      Configuration conf, int rpcTimeout,                                      long connTimeout)throws IOException {     long startTime = System.currentTimeMillis();    IOException ioe;    while (true) {      try {        return getProxy(protocol, clientVersion, addr, conf, rpcTimeout);      } catch(ConnectException se) {  // namenode has not been started        LOG.info("Server at " + addr + " not available yet, Zzzzz...");        ioe = se;      } catch(SocketTimeoutException te) {  // namenode is busy        LOG.info("Problem connecting to server: " + addr);        ioe = te;      }      // check if timed out      if (System.currentTimeMillis()-connTimeout >= startTime) {        throw ioe;      }      // wait for retry      try {        Thread.sleep(1000);      } catch (InterruptedException ie) {        // IGNORE      }    }  }

之所以调用waitForProxy方法再间接调用getProxy方法，而不是直接调用getProxy方法是因为，在waitForProxy方法中对getProxy方法的调用是在一个while循环中进行的，直到由于RPC.getProxy()方法要与NameNode进行网络连接，所以可能不是调用getProxy方法之后就连接成功，那么在一个循环中进行就可以尝试多次连接，直到连接上或者连接超时再返回。调用getProxy方法返回后，如果调用成功，则就马上返回，如果调用失败，则程序睡眠1秒，再尝试，如果反复调用getProxy的时间超过了超时时间，则抛出异常。

这样调用waitForProxy()之后，如果没有异常，就创建好了namenode对象，就可以执行namenode上的IPC方法了。

握手

握手的过程是在DataNode节点启动过程中执行的，在DataNode.startDataNode()方法中调用handshake()方法即是与NameNode握手的过程。DataNode.handshake()方法的代码如下：

private NamespaceInfo handshake() throws IOException {    NamespaceInfo nsInfo = new NamespaceInfo();    while (shouldRun) {      try {        nsInfo = namenode.versionRequest();//调用远程方法，获取名字节点的信息        break;      } catch(SocketTimeoutException e) {  // namenode is busy        LOG.info("Problem connecting to server: " + getNameNodeAddr());        try {          Thread.sleep(1000);        } catch (InterruptedException ie) {}      }    }    if (!isPermittedVersion(nsInfo)) {      String errorMsg = "Shutting down. Incompatible version or revision." +          "DataNode version '" + VersionInfo.getVersion() +          "' and revision '" + VersionInfo.getRevision() +          "' and NameNode version '" + nsInfo.getVersion() +          "' and revision '" + nsInfo.getRevision() +          " and " + CommonConfigurationKeys.HADOOP_RELAXED_VERSION_CHECK_KEY +          " is " + (relaxedVersionCheck ? "enabled" : "not enabled") +          " and " + CommonConfigurationKeys.HADOOP_SKIP_VERSION_CHECK_KEY +          " is " + (noVersionCheck ? "enabled" : "not enabled");      LOG.fatal(errorMsg);      notifyNamenode(DatanodeProtocol.NOTIFY, errorMsg);        throw new IOException( errorMsg );    }    assert FSConstants.LAYOUT_VERSION == nsInfo.getLayoutVersion() :      "Data-node and name-node layout versions must be the same."      + "Expected: "+ FSConstants.LAYOUT_VERSION + " actual "+ nsInfo.getLayoutVersion();    return nsInfo;  }

在该方法中调用了DatanodeProtocol.versionRequest()方法，请求连接到的NameNode节点的信息，其返回值是NamespaceInfo类型，这个类主要保存了整个HDFS的版本信息，其类和成员变量的定义代码如下：

public class NamespaceInfo extends StorageInfo implements Writable {  /**系统构建的版本号**/  String revision;  /**Hadoop版本**/  String version;  /**用于数据节点升级前进行版本检查**/  int distributedUpgradeVersion;}

NamespaceInfo继承自StorageInfo，所以除了上面的三个成员变量之外，还保存着layoutVersion、namespaceID和cTime等信息。在服务器端，NameNode接收到请求后，调用NameNode.versionRequest()方法，在NameNode.versionRequest()方法中调用了FSNamesystem.getNamespaceInfo()方法，创建一个NamespaceInfo对象返回。

DataNode.handshake()中对DatanodeProtocol.versionRequest()方法调用完成后就调用DataNode.isPermittedVersion()比较NameNode与DataNode的版本信息，关于这个方法可以参考博文Hadoop源码分析之DataNode的启动与停止。除了版本信息，Datanode的layoutVersion与NameNode的layoutVersion也必须一致，如果不一致就退出程序，DataNode停止，因为layoutVersion表示HDFS存储系统信息结构的版本号。

注册

DataNode节点启动时，有一个必要的步骤就是注册，这样DataNode就可以在一个具体的HDFS系统之中工作，在注册的过程中会得到一个注册id，这个id用于NameNode发现注册过DataNode。注册的过程是调用DataNode.register()方法完成的，这个方法在博文Hadoop源码分析之DataNode的启动与停止中分析过了，下面来看看NameNode.register()方法，代码如下：

/**    * 处理数据节点注册   */  public DatanodeRegistration register(DatanodeRegistration nodeReg                                       ) throws IOException {    verifyVersion(nodeReg.getVersion());    namesystem.registerDatanode(nodeReg);          return nodeReg;  }

这个方法接收一个参数nodeReg，是DatanodeRegistration类的一个对象，这个类包含了所有NameNode用于与DataNode的通信过程中识别和确认DataNode节点的信息，DataNode通过RPC调用NameNode方法的时候，会传递这些信息。

早NameNode.register()方法中，首先调用verifyVersion()确认DataNode的layoutVersion是否与NameNode一致，如果不一致则抛出异常。然后调用FSNamesystem.registerDatanode()方法完成具体的注册逻辑，这个方法的代码如下：

public synchronized void registerDatanode(DatanodeRegistration nodeReg                                            ) throws IOException {    String dnAddress = Server.getRemoteAddress();//获得数据节点的地址    if (dnAddress == null) {      // Mostly called inside an RPC.      // But if not, use address passed by the data-node.      dnAddress = nodeReg.getHost();    }          // check if the datanode is allowed to be connect to the namenode    // 该数据节点是否允许连接到这个名字节点（根据include和exclude文件）    if (!verifyNodeRegistration(nodeReg, dnAddress)) {      throw new DisallowedDatanodeException(nodeReg);    }    String hostName = nodeReg.getHost();          // update the datanode's name with ip:port    //使用IP:端口更新数据节点    DatanodeID dnReg = new DatanodeID(dnAddress + ":" + nodeReg.getPort(),                                      nodeReg.getStorageID(),                                      nodeReg.getInfoPort(),                                      nodeReg.getIpcPort());    nodeReg.updateRegInfo(dnReg);    nodeReg.exportedKeys = getBlockKeys();          NameNode.stateChangeLog.info(                                 "BLOCK* registerDatanode: "                                 + "node registration from " + nodeReg.getName()                                 + " storage " + nodeReg.getStorageID());    DatanodeDescriptor nodeS = datanodeMap.get(nodeReg.getStorageID());    DatanodeDescriptor nodeN = host2DataNodeMap.getDatanodeByName(nodeReg.getName());          if (nodeN != null && nodeN != nodeS) {//数据节点使用新的存储标识进行注册      NameNode.LOG.info("BLOCK* registerDatanode: "                        + "node from name: " + nodeN.getName());      // nodeN previously served a different data storage,       // which is not served by anybody anymore.      removeDatanode(nodeN);      // physically remove node from datanodeMap      wipeDatanode(nodeN);      nodeN = null;    }    if (nodeS != null) {//重复注册      if (nodeN == nodeS) {        // The same datanode has been just restarted to serve the same data         // storage. We do not need to remove old data blocks, the delta will        // be calculated on the next block report from the datanode        NameNode.stateChangeLog.debug("BLOCK* registerDatanode: "                                      + "node restarted");      } else {        // nodeS is found        /* The registering datanode is a replacement node for the existing           data storage, which from now on will be served by a new node.          If this message repeats, both nodes might have same storageID           by (insanely rare) random chance. User needs to restart one of the          nodes with its data cleared (or user can just remove the StorageID          value in "VERSION" file under the data directory of the datanode,          but this is might not work if VERSION file format has changed        */                NameNode.stateChangeLog.info( "BLOCK* registerDatanode: "                                      + "node " + nodeS.getName()                                      + " is replaced by " + nodeReg.getName() +                                       " with the same storageID " +                                      nodeReg.getStorageID());      }      // update cluster map      clusterMap.remove(nodeS);      nodeS.updateRegInfo(nodeReg);      nodeS.setHostName(hostName);            // resolve network location      resolveNetworkLocation(nodeS);      clusterMap.add(nodeS);              // also treat the registration message as a heartbeat      synchronized(heartbeats) {        if( !heartbeats.contains(nodeS)) {          heartbeats.add(nodeS);          //update its timestamp          nodeS.updateHeartbeat(0L, 0L, 0L, 0);          nodeS.isAlive = true;        }      }      return;    }     // this is a new datanode serving a new data storage    if (nodeReg.getStorageID().equals("")) {      // this data storage has never been registered      // it is either empty or was created by pre-storageID version of DFS      nodeReg.storageID = newStorageID();      NameNode.stateChangeLog.debug(                                    "BLOCK* registerDatanode: "                                    + "new storageID " + nodeReg.getStorageID() + " assigned");    }    // register new datanode，登记新的数据节点    DatanodeDescriptor nodeDescr       = new DatanodeDescriptor(nodeReg, NetworkTopology.DEFAULT_RACK, hostName);    resolveNetworkLocation(nodeDescr);    unprotectedAddDatanode(nodeDescr);    clusterMap.add(nodeDescr);          // also treat the registration message as a heartbeat    // 加入到心跳检查列表中，注册信息相当与心跳    synchronized(heartbeats) {      heartbeats.add(nodeDescr);      nodeDescr.isAlive = true;      // no need to update its timestamp      // because its is done when the descriptor is created    }    if (safeMode != null) {      safeMode.checkMode();    }    return;  }

在方法中，首先获取到调用注册方法的客户端（DataNode）的网络地址，然后判断这个发起注册请求的DataNode是否允许连接到NameNode，这个功能通过NameNode的include文件和exclude文件来实现。在NameNode端对DataNode进行管理时，可以动态增加或者DataNode节点，这样就可以动态的扩大或者缩小HDFS集群的规模，实现这样的功能，需要保证NameNode能够对连接到的DataNode节点进行明确的管理，以保证所有DataNode节点受到集群的控制，也可以防止配置出错的DataNode节点误连接到NameNode。HDFS提供了${dfs.hosts}和${dfs.hosts.exclude}配置来完成这个功能，这两个配置项分别指定一个文件路径，其中${dfs.hosts}配置项指定的文件称为include文件，它存储可以连接到NameNode节点的DataNode节点，${dfs.hosts.exclude}配置项指定的文件称为exclude文件，它存储不能连接到NameNode节点的DataNode节点。include文件和exclude文件中，一行表示一个数据节点，这行内容可以是DataNode节点的IP地址，也可以由DatanodeID类的成员变量name指定，也可以是DatanodeInfo的成员变量hostname。

接下来使用NameNode端得到的关于这个发送注册请求的DataNde的信息来更新DataNode的信息，保存在nodeReg这个参数中，在NameNode.register()方法中会将nodeReg对象返回。

然后就是处理DataNode节点的注册了，注册过程中分三种情况处理：

1.这个DataNode节点还未在NameNode节点上注册；

2.这个DataNode节点已经在NameNode节点上注册过了，这次是重复注册；

3.这个DataNode节点已经在NameNode节点上注册过了，但是这次注册使用了新的节点存储标识storageID，表明次节点的存储空间已经被清理过了，原有的数据块副本已经被删除。

针对这三种情况，在FSNamesystem.registerDatanode()方法中进行处理。在FSNamesystem类中有两个成员变量来处理这些情况，分别是FSNamesystem.datanodeMap和FSNamesystem.host2DataNodeMap，其中FSNamesystem.datanodeMap成员变量记录了在当前NameNode节点注册过的所有DataNode节点，键值对为StorageID -> DatanodeDescriptor，这个变量定义为TreeMap类型，可以快速的根据键值storageID查找到对应的值DatanodeDescriptor对象。FSnamesystem.host2DataNodeMap变量提供了在NameNode节点上注册的DataNode节点名称和其DatanodeDescriptor对象的映射，即可以根据DataNode的名称查找该DataNode的注册信息。在注册方法中通过存储标识storageID在datanodeMap中获取数据节点描述符对象为nodeS，通过DataNode节点服务器名和端口号在host2DataNodeMap中获取的描述符为nodeN。那么NameNode节点为什么要保存两个注册信息的映射呢？试想如果有一个DataNode节点在NameNode上注册过了，但是这个DataNode节点在某个时间点进行了格式化，再重新启动，那么这个数据节点需要重新向NameNode节点发送一个注册信息，此时nodeS对象（为null）和nodeN对象就不相等，这就是上面所说的情况3，这时，nodeN就是一个过时的节点信息对象，那么就使用方法FSNamesystem.removeDatanode()方法和FSNamesystem.wipeDatanode()方法清理原有节点在NameNode节点上注册过的信息，再将nodeN赋值为null，这样后面的处理就和情况1一样了。

如果nodeS不为null，那么就是情况2，即当前注册的DataNode节点重复注册了，那么就只需要更新该DataNode节点在网络中的位置和心跳信息。

对于情况1，就需要创建一个DatanodeDescriptor对象，然后获取节点的网络拓扑位置，将节点其加入到datanodeMap和host2DataNodeMap集合中，并更新你心跳信息。

上报数据块

在DataNode节点启动后，进入DataNode.offerService()方法，在这个方法中DataNode节点循环的发送心跳信息，上报最近接收到的数据块信息，上报所有的数据块信息，其代码如下：

public void offerService() throws Exception {    while (shouldRun) {      try {        long startTime = now();        //        // Every so often, send heartbeat or block-report        //                if (startTime - lastHeartbeat > heartBeatInterval) {          //每隔一定的时间就发送一次心跳          lastHeartbeat = startTime;          DatanodeCommand[] cmds = namenode.sendHeartbeat(dnRegistration,//数据节点的标记                                                       data.getCapacity(),//数据节点的存储容量                                                       data.getDfsUsed(),//目前已经使用的容量                                                       data.getRemaining(),//剩余容量                                                       xmitsInProgress.get(),//正在进行数据块拷贝的线程数                                                       getXceiverCount());//DataXceiverServer中的服务线程数          myMetrics.addHeartBeat(now() - startTime);          if (!processCommand(cmds))//NameNode节点会在心跳过程中返回指令，DataNode节点执行这些指令            continue;        }        // check if there are newly received blocks，检测最近是否接受到数据块        Block [] blockArray=null;        String [] delHintArray=null;        synchronized(receivedBlockList) {//receivedBlockList保存着上次上报后接收到的数据块          synchronized(delHints) {            int numBlocks = receivedBlockList.size();            if (numBlocks > 0) {              if(numBlocks!=delHints.size()) {                LOG.warn("Panic: receiveBlockList and delHints are not of the same length" );              }              // Send newly-received blockids to namenode              blockArray = receivedBlockList.toArray(new Block[numBlocks]);              delHintArray = delHints.toArray(new String[numBlocks]);            }          }        }        if (blockArray != null) {          if(delHintArray == null || delHintArray.length != blockArray.length ) {            LOG.warn("Panic: block array & delHintArray are not the same" );          }          namenode.blockReceived(dnRegistration, blockArray, delHintArray);//上报最近接收到的数据块          synchronized (receivedBlockList) {//上报完成之后就清空receivedBlockList列表            synchronized (delHints) {              for(int i=0; i<blockArray.length; i++) {                receivedBlockList.remove(blockArray[i]);                delHints.remove(delHintArray[i]);              }            }          }        }        //每隔一段时间，数据节点会上报它管理的所有数据块        if (startTime - lastBlockReport > blockReportInterval) {          if (data.isAsyncBlockReportReady()) {            // Create block report            long brCreateStartTime = now();            Block[] bReport = data.retrieveAsyncBlockReport();                        // Send block report            long brSendStartTime = now();            DatanodeCommand cmd = namenode.blockReport(dnRegistration,                    BlockListAsLongs.convertToArrayLongs(bReport));                        // Log the block report processing stats from Datanode perspective            long brSendCost = now() - brSendStartTime;            long brCreateCost = brSendStartTime - brCreateStartTime;            myMetrics.addBlockReport(brSendCost);            LOG.info("BlockReport of " + bReport.length                + " blocks took " + brCreateCost + " msec to generate and "                + brSendCost + " msecs for RPC and NN processing");            // If we have sent the first block report, then wait a random            // time before we start the periodic block reports.            if (resetBlockReportTime) {              lastBlockReport = startTime -                  R.nextInt((int)(blockReportInterval));              resetBlockReportTime = false;            } else {              /* say the last block report was at 8:20:14. The current report                * should have started around 9:20:14 (default 1 hour interval).                * If current time is :               *   1) normal like 9:20:18, next report should be at 10:20:14               *   2) unexpected like 11:35:43, next report should be at               *      12:20:14               */              lastBlockReport += (now() - lastBlockReport) /                                  blockReportInterval * blockReportInterval;            }            processCommand(cmd);          } else {            data.requestAsyncBlockReport();            if (lastBlockReport > 0) { // this isn't the first report              long waitingFor =                  startTime - lastBlockReport - blockReportInterval;              String msg = "Block report is due, and been waiting for it for " +                  (waitingFor/1000) + " seconds...";              if (waitingFor > LATE_BLOCK_REPORT_WARN_THRESHOLD) {                LOG.warn(msg);              } else if (waitingFor > LATE_BLOCK_REPORT_INFO_THRESHOLD) {                LOG.info(msg);              } else if (LOG.isDebugEnabled()) {                LOG.debug(msg);              }            }          }        }        // start block scanner        if (blockScanner != null && blockScannerThread == null &&            upgradeManager.isUpgradeCompleted()) {          LOG.info("Starting Periodic block scanner");          blockScannerThread = new Daemon(blockScanner);          blockScannerThread.start();        }                    //        // There is no work to do;  sleep until hearbeat timer elapses,         // or work arrives, and then iterate again.        //        long waitTime = heartBeatInterval - (System.currentTimeMillis() - lastHeartbeat);        synchronized(receivedBlockList) {          if (waitTime > 0 && receivedBlockList.size() == 0) {            try {              receivedBlockList.wait(waitTime);            } catch (InterruptedException ie) {            }            delayBeforeBlockReceived();          }        } // synchronized      } catch(RemoteException re) {        String reClass = re.getClassName();        if (UnregisteredDatanodeException.class.getName().equals(reClass) ||            DisallowedDatanodeException.class.getName().equals(reClass) ||            IncorrectVersionException.class.getName().equals(reClass)) {          LOG.warn("DataNode is shutting down: " +                    StringUtils.stringifyException(re));          shutdown();          return;        }        LOG.warn(StringUtils.stringifyException(re));      } catch (IOException e) {        LOG.warn(StringUtils.stringifyException(e));      }    } // while (shouldRun)  } // offerService

在上面的代码中首先是发送心跳的代码，每隔heartBeatInterval时间向NameNode发送一次心跳，其中NameNode发送给DataNode的指令会随着心跳一起返回，所以DatanoeProtocol.sendHeartbeat()方法会返回DatanodeCommand[]类型的对象。在DataNode.offerService()方法中向NameNode上报数据块的方法有两个，分别是DatanodeProtocol.blockReceived()和DataNodeProtocol.blockReceived()方法这两个方法分别由NameNode.blockReceived()和NameNode.blockReceived()方法进行处理。NameNode端处理数据块上报的方法比较复杂，以后再针对NameNode上的数据块进行分析。

DatanodeProtocol.blockReport()方法有两个参数，第一个参数是DataNode节点的注册信息对象，第二个参数是一个long类型数组，它存储该DataNode节点的所有数据块的信息。为什么是一个long类型的数组，而不是一个Block类型数组呢？因为Block类有三个成员变量，全部是long类型，这里调用BlockListAsLongs.convertToArrayLongs()方法将所有Block对象的三个long类型变量转换为数组来存放，这样发送到服务器端，就不需要再创建Block对象了，由于发送的是Block类的三个成员变量，所以这个long类型的数组长度也是3的倍数，具体代码在BlockListAsLongs这个类中。

看完这个方法有个疑问，offerService是在DataNode.run()中调用的，也就是说DataNode线程启动之后就开始调用了，在DataNode.run()方法中while循环的条件是shouldRun，而offerService方法中循环的条件也是shouldRun，这岂不重复吗？

心跳

DataNode要周期性的向NameNode发送心跳信息，告诉NameNode，当前DataNode处于正常状态，如果NameNode长时间接收不到某个DataNode节点的心跳，那么就会认为该DataNode已经失效。如果NameNode有些指令发给DataNode执行，则会在处理心跳的方法中返回这些指令，也就是DatanodeProtocol.sendHeartbeat()方法的返回值。

在NameNode节点上由NameNode.sendHeartbeat()方法对心跳信息进行处理，这个方法有6个参数，分别为：

nodeReg，表示当前DataNode在NameNode上的注册信息；
capacity，表示当前DataNode的存储容量；
dfsUsed，表示当前DataNode目前已经使用的容量；
remaining，表示当前DataNode剩余的容量；
xmitsInProgress，表示当前DataNode中正在进行数据块拷贝的线程数；
xceiverCount，表示当前DataNode中DataXceiverServer中的服务线程数。

在NameNode.sendHeartbeat()方法中，首先调用NameNode.verifyRequest()方法对比DataNode与NameNode的构建版本，并且检查该DataNode是否在这个NameNode上注册过，然后调用FSNamesystem.handleHeartbeat()来处理心跳信息。FSNamesystem.handleHeartbeat()方法的代码如下：

DatanodeCommand[] handleHeartbeat(DatanodeRegistration nodeReg,      long capacity, long dfsUsed, long remaining,      int xceiverCount, int xmitsInProgress) throws IOException {    DatanodeCommand cmd = null;    synchronized (heartbeats) {      synchronized (datanodeMap) {        DatanodeDescriptor nodeinfo = null;        try {          nodeinfo = getDatanode(nodeReg);//从datanodeMap中得到当前DataNode信息        } catch(UnregisteredDatanodeException e) {          return new DatanodeCommand[]{DatanodeCommand.REGISTER};        }                  // Check if this datanode should actually be shutdown instead.         // 检查当前DataNode节点状态是否是AdminStates.DECOMMISSIONED,如果是，表明该节点不允许连接到NameNode节点        if (nodeinfo != null && shouldNodeShutdown(nodeinfo)) {          setDatanodeDead(nodeinfo);          throw new DisallowedDatanodeException(nodeinfo);        }        if (nodeinfo == null || !nodeinfo.isAlive) {          return new DatanodeCommand[]{DatanodeCommand.REGISTER};        }        updateStats(nodeinfo, false);//先减去这个DataNode节点上次心跳上报的数据        nodeinfo.updateHeartbeat(capacity, dfsUsed, remaining, xceiverCount);//更新DataNode节点的数据        updateStats(nodeinfo, true);//加上这个DataNode节点这次上报的数据                //check lease recovery，更新租约        cmd = nodeinfo.getLeaseRecoveryCommand(Integer.MAX_VALUE);        if (cmd != null) {          return new DatanodeCommand[] {cmd};        }        //返回的指令        ArrayList<DatanodeCommand> cmds = new ArrayList<DatanodeCommand>();        //check pending replication，复制副本指令        cmd = nodeinfo.getReplicationCommand(              maxReplicationStreams - xmitsInProgress);        if (cmd != null) {          cmds.add(cmd);        }        //check block invalidation，数据块删除指令        cmd = nodeinfo.getInvalidateBlocks(blockInvalidateLimit);        if (cmd != null) {          cmds.add(cmd);        }        // check access key update        if (isAccessTokenEnabled && nodeinfo.needKeyUpdate) {          cmds.add(new KeyUpdateCommand(accessTokenHandler.exportKeys()));          nodeinfo.needKeyUpdate = false;        }        // check for balancer bandwidth update        if (nodeinfo.getBalancerBandwidth() > 0) {          cmds.add(new BalancerBandwidthCommand(nodeinfo.getBalancerBandwidth()));          // set back to 0 to indicate that datanode has been sent the new value          nodeinfo.setBalancerBandwidth(0);        }        if (!cmds.isEmpty()) {          return cmds.toArray(new DatanodeCommand[cmds.size()]);        }      }    }    //check distributed upgrade    cmd = getDistributedUpgradeCommand();    if (cmd != null) {      return new DatanodeCommand[] {cmd};    }    return null;  }

在方法中先从datanodeMap中取出与当前DataNode节点注册的信息（可能没有对应值），并判断是否注册过，如果没有注册，就返回DatanodeCommand.REGISTER指令提醒DataNode注册，然后使用FSNamesystem.shouldNodeShutdown()方法判断这个DataNode节点是否处于AdminStates.DECOMMISSIONED,如果是，表明该节点不允许连接到NameNode节点，即如果这个DataNode节点被撤销了，那么它的状态就处于已经被撤销的状态（AdminStates.DECOMMISSIONED），那么这次心跳是无效的。

之后就是利用心跳发送过来的数据更新整个HDFS集群的负载信息，调用了两次FSNamesystem.updateStats()方法，前一次是调用这个方法，减去当前DataNode节点上次心跳上报的负载信息，然后调用DatanodeInfo.updateHeartbeat()方法更新当前DataNode节点的负载信息，再调用FSNamesystem.updateStats()方法加上这个DataNode节点的负载信息，这样就将整个HDFS的负载信息更新完成。

接下来就是得到上次发送的心跳到此时这个过程中对这个DataNode的指令，如告诉这个DataNode删除数据块，复制数据块等。

Reference

《Hadoop技术内幕：深入理解Hadoop Common和HDFS架构设计与实现原理》

0 0