图说Hadoop源码-NetworkTopology

来源：互联网发布：提高优化效率的词语编辑：程序博客网时间：2024/06/06 00:50

NetworkTopology定义了InnerNode为根节点的成员变量clusterMap, 其中的方法都是通过clusterMap调用InnerNode中的相应方法并更新一些其他变量.

比如NetworkTopology.add(Node)会通过clusterMap.add(node)来完成, 并更新NetworkTopology的变量numOfRacks和depthOfAllLeaves.

下面的方法都会在try和finally中, 通过锁netLock执行. 为了篇幅, 下面的方法都省略了try和finally语句.

  /** Check if the tree contains Node node*/  public boolean contains(Node node) {      Node parent = node.getParent();      for(int level=node.getLevel(); parent!=null&&level>0; parent=parent.getParent(), level--) {        if (parent == clusterMap)          return true;      }    return false;   }  /** Check if two nodes are on the same rack */  public boolean isOnSameRack( Node node1,  Node node2) {      return node1.getParent()==node2.getParent();  }

contains: 从叶子节点开始, 一层一层往上寻找父节点. 最终的父节点如果是根节点即clusterMap. 那么节点就在拓扑图中.

isOnSameRack: (叶子/DN)节点的父节点如果相同, 那么他们就是在同一个机架上.

getDistance

  /** Return the distance between two nodes. It is assumed that the distance from one node to its parent is 1 节点到父节点的举例为1   * The distance between two nodes is calculated by summing up their distances to their closest common ancestor. 寻找最近的共同祖先. */  public int getDistance(Node node1, Node node2) {    if (node1 == node2) return 0;    Node n1=node1, n2=node2;    int dis = 0;      int level1=node1.getLevel(), level2=node2.getLevel();      while(n1!=null && level1>level2) {        n1 = n1.getParent();        level1--;        dis++;      }      while(n2!=null && level2>level1) {        n2 = n2.getParent();        level2--;        dis++;      }      while(n1!=null && n2!=null && n1.getParent()!=n2.getParent()) {        n1=n1.getParent();        n2=n2.getParent();        dis+=2;      }    return dis+2;  }

计算节点间的距离: 从一个节点到另一个节点在网络拓扑图中经过的边的数量.

下面的网络拓扑图中每个叶子节点的level都为3. 所以代码主要集中在第三个while循环的处理. 以<h1, h3>为例分析距离的计算过程:

pseudoSortByDistance

  /* swap two array items */  static private void swap(Node[] nodes, int i, int j) {    Node tempNode;    tempNode = nodes[j];    nodes[j] = nodes[i];    nodes[i] = tempNode;  }    /** Sort nodes array by their distances to reader   * It linearly scans the array, if a local node is found, swap it with the first element of the array.   * If a local rack node is found, swap it with the first element following the local node.   * If neither local node or local rack node is found, put a random replica location at postion 0. It leaves the rest nodes untouched. */  public void pseudoSortByDistance( Node reader, Node[] nodes ) {    int tempIndex = 0;// 是否有本地节点的标记位. 如果有设置为1    if (reader != null ) {      int localRackNode = -1;// nodes中和reader同一机架的节点的索引. -1为没有节点和reader同机架, 标志位      for(int i=0; i<nodes.length; i++) {// scan the array to find the local node & local rack node        if(tempIndex == 0 && reader == nodes[i]){// [01] local node 在数组中找到一个本地节点. 将数组中的此节点交换到数组的第一个位置          if( i != 0 ) swap(nodes, 0, i);// [02] swap the local node and the node at position 0 数组中有localnode,但不在第一个位置才需要交换          tempIndex=1;// [03] 数组中有本地节点, 标记tempIndex=1          if(localRackNode != -1 ) {// [04] 在本地节点之前的节点有和reader同机架的, 已经设置了localRackNode为其节点索引            if(localRackNode == 0) {// [05] 数组的第一个节点就在和reader同一机架.              localRackNode = i;// [06] 因为本地节点其实也可以看做是和reader同机架的. 所以设置localRackNode=0.            }            break;// [07] 只要本地节点前有和reader同机架的, 不管是不是数组的第一项. 都会跳出循环          }        } else if(localRackNode == -1 && isOnSameRack(reader, nodes[i])) { // [08] local rack 数组中有节点和read在同一机架          localRackNode = i;// [09] 在数组nodes中找到和reader同一机架的节点, 设置标记位localRackNode为此节点的索引          if(tempIndex != 0 ) break;// [10]         }      }            if(localRackNode != -1 && localRackNode != tempIndex ) {// [11] swap the local rack node and the node at position tempIndex        swap(nodes, tempIndex, localRackNode);// [12]        tempIndex++;      }    }        if(tempIndex == 0 && nodes.length != 0) {// [13] put a random node at position 0 if it is not a local/local-rack node      swap(nodes, 0, r.nextInt(nodes.length));// [14]    }  }

假设reader=h0(为了和索引从0开始相对应, 记第一个DN为h0), 待排序的DN数组有h0, h1, h2. 其中数组的h0称作local node.

数组的排序有6(3!=3*2*1=6)种方式: [h0, h1, h2] [h0, h2, h1] [h1, h0, h2] [h1, h2, h0] [h2, h0, h1] [h2, h1, h0]

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

A. [h0, h1, h2]

[01] nodes[0]=h0=reader 数组中存在本地节点

[02] .. i=0, 本地节点不需要交换到数组前面, 因为已经在最前面了

[03] tempIndex=1

[04] localRackNode=-1 ... 继续处理数组的下一个节点h1

[08] h1和reader在同一个机架上.

[09] localRackNode=1, 表示和reader同一机架的节点的索引为1, 即nodes[1]和reader同机架.

[10] tempIndex=1, break

------------------------------------------------------------------------------------------------------------------------------------------------------------------

nodes: [h0, h1, h2]

localRackNode = 1: 本机架节点nodes[1]=h1

tempIndex = 1: 存在本地节点

[..] 因为localRackNode = tempIndex = 1. 所以后面都不执行了【h0, h1, h2】

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

B. [h0, h2, h1]

[01] nodes[0]=h0=reader 数组中存在本地节点

[02] .. i=0, 本地节点不需要交换到数组前面, 因为已经在最前面了

[03] tempIndex=1

[04] localRackNode=-1 ... 继续处理数组的下一个节点h1

[..] h2都不满足本地节点和本地机架. 继续处理数组的下一个节点h1

[08] h1和reader在同一个机架上.

[09] localRackNode=2, 表示和reader同一机架的节点的索引为2, 即nodes[2]和reader同机架.

[10] tempIndex=1, break

------------------------------------------------------------------------------------------------------------------------------------------------------------------

nodes: [h0, h2, h1]

localRackNode = 2: 本机架节点nodes[2]=h1

tempIndex = 1: 存在本地节点

[11] localRackNode != tempIndex

[12] 交换localRackNode和tempIndex. 即交换数组中2(localRackNode的值)和1(tempIndex的值)的节点. 即h2和h1交换: [h0, h1, h2]

PS: 交换节点时, localRackNode和tempIndex的值会不会互换? 不会互换, 这两个方法内的变量只起到标志位的作用, 我们的目的是排序数组.

[..] tempIndex++, [13]就不会执行了【h0, h1, h2】

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

C. [h1, h0, h2]

[08] h1和reader在同一个机架上.

[09] localRackNode=0, 表示和reader同一机架的节点的索引为0, 即nodes[0]和reader同机架.

[10] tempIndex=0

[01] nodes[1]=h0=reader 数组中存在本地节点

[02] .. i=1, 交换第一个节点(h1)和本地节点(h0)的位置 [h0, h1, h2]

[03] tempIndex=1

[04][05] localRackNode=0, 即数组第一个节点和reader同机架(本地节点不在数组第一个).

[06] 设置localRackNode=i=1. 因为本机架节点(第一个)和本地节点交换. 交换后本机架节点的索引就是交换前的本地节点的位置

[07] break 只要本地节点前有和reader同机架的, 不管是不是数组的第一项. 都会跳出循环

------------------------------------------------------------------------------------------------------------------------------------------------------------------

nodes: [h1, h0, h2] -> [h0, h1, h2]

localRackNode = 1: 本机架节点nodes[1]=h1

tempIndex = 1: 存在本地节点

localRackNode = tempIndex. 同A. [h0, h1, h2]【h0, h1, h2】

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

D. [h1, h2, h0]

[08] h1和reader在同一个机架上.

[09] localRackNode=0, 表示和reader同一机架的节点的索引为0, 即nodes[0]和reader同机架.

[10] tempIndex=0

[..] h2都不满足本地节点和本地机架. 继续处理数组的下一个节点h1

[01] nodes[2]=h0=reader 数组中存在本地节点

[02] .. i=2, 交换第一个节点(h1)和本地节点(h0)的位置 [h0, h2, h1]

[03] tempIndex=1

[04][05] localRackNode=0, 即数组第一个节点和reader同机架(本地节点不在数组第一个).

[06] 设置localRackNode=i=2. 因为本机架节点(第一个)和本地节点交换. 交换后本机架节点的索引就是交换前的本地节点的位置

------------------------------------------------------------------------------------------------------------------------------------------------------------------

nodes: [h1, h2, h0] -> [h0, h2, h1]

localRackNode = 2: 本机架节点nodes[2]=h1

tempIndex = 1: 存在本地节点

localRackNode != tempIndex, 同B. [h0, h2, h1] -> [h0, h1 , h2]【h0, h1, h2】

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

E. [h2, h0, h1]

[..] h2都不满足本地节点和本地机架. 继续处理数组的下一个节点h1

[01] nodes[1]=h0=reader 数组中存在本地节点

[02] .. i=1!=0, 交换第一个节点(h2)和本地节点(h0)的位置 [h0, h2, h1]

[03] tempIndex=1

[04] localRackNode=-1 ... 继续处理数组的下一个节点h1

[08] h1和reader在同一个机架上.

[09] localRackNode=2, 表示和reader同一机架的节点的索引为2, 即nodes[2]和reader同机架.

[10] tempIndex=1, break

------------------------------------------------------------------------------------------------------------------------------------------------------------------

nodes: [h2, h0, h1] -> [h0, h2, h1]

localRackNode = 2: 本机架节点nodes[2]=h1

tempIndex = 1: 存在本地节点

localRackNode != tempIndex, 同B. [h0, h2, h1] -> [h0, h1, h2]【h0, h1, h2】

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

F. [h2, h1, h0]

[..] h2都不满足本地节点和本地机架. 继续处理数组的下一个节点h1

[08] h1和reader在同一个机架上.

[09] localRackNode=1, 表示和reader同一机架的节点的索引为1, 即nodes[1]和reader同机架.

[10] tempIndex=0

[01] nodes[2]=h0=reader 数组中存在本地节点

[02] .. i=2!=0, 交换第一个节点(h2)和本地节点(h0)的位置 [h0, h1, h2]

[03] tempIndex=1

[04] localRackNode=1

[07] break

------------------------------------------------------------------------------------------------------------------------------------------------------------------

nodes: [h2, h1, h0] -> [h0, h1, h2]

localRackNode = 1: 本机架节点nodes[1]=h1

tempIndex = 1: 存在本地节点

localRackNode = tempIndex. 同A. [h0, h1, h2]【h0, h1, h2】

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

**[*, *, *] -> [h0, h1, h2]**

存在本地节点和本机架节点的情况下, 最后数组都会排列成[h0, h1, h2]

**循环次数:**

A. [h0, h1, h2]

C. [h1, h0, h2] -> [h0, h1, h2]

如果数组的前两个节点分别是本地节点和本机架节点, 则后面的节点不会再执行了. 说明只排序了两次, 所以也叫做伪排序.

**tempIndex == localRackNode**

tempIndex取值要么为0表示不存在本地节点, 要么为1表示存在本地节点.

localRackNode表示数组中和reader同一机架的节点的位置. 取值可以为数组的任意一项.

localRackNode如果等于tempIndex=0/1. 即本机架节点在数组的第一项nodes[0]或者第二项nodes[1].

localRackNode = tempIndex = 0时表示, 不存在本地节点, 本机架节点为nodes[0].

不存在本地节点, 但存在本机架节点. 那么本机架节点应该排列在数组的第一个位置.

localRackNode = tempIndex = 1时表示, 本机架节点为nodes[1], 本地节点为nodes[0]

存在本地节点, 本地节点优于本机架节点. 所以本地节点应该排在数组的第一个位置.

**localRackNode !=-1, tempIndex != localRackNode**

tempIndex=0, localRackNode=1,2.... 不存在本地节点, 本机架节点应该在数组的第一项, 进行交换操作, 交换本地节点到nodes[0]

tempIndex=1, localRackNode=2,...存在本地节点, 因为本地节点优于本机架节点. 所以本地节点已经在第一项了.

localRackNode应该从数组第三项即nodes[2]开始(不能为0, 因为已经被本地节点占了, 不能为1, 因为!=的约束).

那么本机架节点要交换到数组的第二项, 即nodes[1]处. 紧跟着本地节点后面.

也就是说如果本地节点和本机架几点都存在的情况下, 排列顺序依次是本地节点, 本机架节点...

测试用例3 - pseudoSort

  public void printLocalNodeAndLocalRack(DatanodeDescriptor[] testNodes){  cluster.pseudoSortByDistance(dataNodes[0], testNodes );  assertTrue(testNodes[0] == dataNodes[0]);  assertTrue(testNodes[1] == dataNodes[1]);  assertTrue(testNodes[2] == dataNodes[2]);  }  public void nodeIndex(DatanodeDescriptor[] testNodes, int[] nodeIndex){  for(int i=0;i<3;i++){  testNodes[i] = dataNodes[nodeIndex[i]];  }  }    public void testPseudoSortByDistance() throws Exception {    DatanodeDescriptor[] testNodes = new DatanodeDescriptor[3];    int[][] index = new int[][]{    {0,1,2}, {0,2,1}, {1,0,2}, {1,2,0}, {2,0,1}, {2,1,0},    };    for(int i=0; i<index.length;i++){    nodeIndex(testNodes, index[i]);    printLocalNodeAndLocalRack(testNodes);　　}  }

  public void testPseudoSortByDistance() throws Exception {    DatanodeDescriptor[] testNodes = new DatanodeDescriptor[3];        // array contains both local node & local rack node    testNodes[0] = dataNodes[1];    testNodes[1] = dataNodes[2];    testNodes[2] = dataNodes[0];    cluster.pseudoSortByDistance(dataNodes[0], testNodes );    assertTrue(testNodes[0] == dataNodes[0]);    assertTrue(testNodes[1] == dataNodes[1]);    assertTrue(testNodes[2] == dataNodes[2]);    // array contains local node    testNodes[0] = dataNodes[1];    testNodes[1] = dataNodes[3];    testNodes[2] = dataNodes[0];    cluster.pseudoSortByDistance(dataNodes[0], testNodes );    assertTrue(testNodes[0] == dataNodes[0]);    assertTrue(testNodes[1] == dataNodes[1]);    assertTrue(testNodes[2] == dataNodes[3]);    // array contains local rack node    testNodes[0] = dataNodes[5];    testNodes[1] = dataNodes[3];    testNodes[2] = dataNodes[1];    cluster.pseudoSortByDistance(dataNodes[0], testNodes );    assertTrue(testNodes[0] == dataNodes[1]);    assertTrue(testNodes[1] == dataNodes[3]);    assertTrue(testNodes[2] == dataNodes[5]);  }