HDFS的文件操作流(2)——读操作

来源：互联网发布：大学生如何开淘宝网店编辑：程序博客网时间：2024/06/06 18:33

为么么知道，在HDFS中文件是分块存储的，每一个块还有多个备份，同时不同的块的备份被存在不同的机器上，而且，这些组成文件的块也放在不同的数据节点上，那么，HDFS是如何实现文件的读取呢？比如：当客户端准备读取某一个文件的一个数据块时，若这个数据块有多个副本，那么这个客户端应该读取来个副本呢？

在上一篇文章中，我讲述了有关HDFS中文件写入流的实现方式与操作过程，所以在本文，我将重点分析HDFS中文件的读取过程。通过对这种分布式文件流的工作过程进行详细的阐述，来让大家能够清楚的了解HDFS是如何实现文件的读取的。

用过HDFS的API的人都知道，我们要读取一个文件，需要调用DistributedFileSystem的open(Path,int)方法，该方法会返回一个数据流对象——FSDataInputStream，它实际是自己的子类DFSDataInputStream。好了，我们先来看看它们的类的继承体系结构图吧，以便能够更好地认识它们。

既然DistributedFileSystem的open方法返回的是FSDataInputStream类型的对象，那么我就不打算讲解DFSDataInputStream中的三个方法了，而是将以FSDataInputStream中关于文件读取操作的方法为例展开，通过对它们的处理过程进行详细的阐述，来逐步剖析HDFS读取文件的奥秘。

我根据这个数据流的创建创建过程，绘制与之相关的序列图：

从上面的序列图，我们可以发现在创建底层的DFSInputStream时，调用了自身的openInfo()方法，该内部方法主要通过ClientProtocol协议调用远程方法getBlockLocations()，从NameNode获取关于当前打开的文件的前若干个数据块Blocks位置信息，代码如下：

其中，prefetchSize的默认值是十个数据块的大小，当然也可以在配置文件中通过dfs.read.prefect.size的值来设置。

其实，FSDataInputStream的read(long,byte[],int,int)和read(byte[],int,int)方法的底层实现是一样的，只不过前者是实现了随机读，后者是顺序读。另外，read(long,byte[],int,int)方法虽然是随机读，当在执行完之后，并不会影响当前文件的读指针，为了让大家有一个直观的认识，我还是贴上一幅该操作的序列图(这里以read(long,byte[],int,int)为例)：

对于数据流底层的read(byte[],int,int)方法的实现如下：

public synchronized int read(byte buf[], int off, int len) throws IOException {
      checkOpen();
      if (closed) {
        throw new IOException("Stream closed");
      }
      failures = 0;

      //检查当前文件指针是否超过文件末尾
      if (pos < getFileLength()) {
        int retries = 2;
        while (retries > 0) {
          try {

            if (pos > blockEnd) {//如果当前文件指针不在当前数据块，则定位到当前文件指针所在的数据块
               LOG.debug("the current file pos is not at current block,so start to seek the right block.");
              currentNode = blockSeekTo(pos);
            }
            int realLen = Math.min(len, (int) (blockEnd - pos + 1));//计算真正能够读取的数据长度

            int result = readBuffer(buf, off, realLen);//读文件

            if (result >= 0) {
              pos += result;
            } else {
              // got a EOS from reader though we expect more data on it.
              throw new IOException("Unexpected EOS from the reader");
            }
            if (stats != null && result != -1) {
              stats.incrementBytesRead(result);
            }
            return result;
          } catch (ChecksumException ce) {
            throw ce;
          } catch (IOException e) {
            if (retries == 1) {
              LOG.warn("DFS Read: " + StringUtils.stringifyException(e));
            }
            blockEnd = -1;
            if (currentNode != null) { addToDeadNodes(currentNode); }
            if (--retries == 0) {
              throw e;
            }
          }
        }
      }
      return -1;
    }

private synchronized DatanodeInfo blockSeekTo(long target) throws IOException {
      if (target >= getFileLength()) {
        throw new IOException("Attempted to read past end of file");
      }

      if ( blockReader != null ) {
        blockReader.close();
        blockReader = null;
      }

      if (s != null) {
        s.close();
        s = null;
      }

      //获取文件位置target所在的数据块的位置信息
      LocatedBlock targetBlock = getBlockAt(target);
      assert (target==this.pos) : "Wrong postion " + pos + " expect " + target;

      long offsetIntoBlock = target - targetBlock.getStartOffset();//计算文件位置target在该数据块中的起始位置

      DatanodeInfo chosenNode = null;
      while (s == null) {

//从数据块的副本中选取一个副本，即该数据块所在的一个数据节点信息

        DNAddrPair retval = chooseDataNode(targetBlock);

        chosenNode = retval.info;
        InetSocketAddress targetAddr = retval.addr;

        try {
          s = socketFactory.createSocket();
          LOG.debug("start to connect to datanode: "+targetAddr);
          NetUtils.connect(s, targetAddr, socketTimeout);
          s.setSoTimeout(socketTimeout);
          Block blk = targetBlock.getBlock();

          LOG.debug("create a BlockReader for Block["+blk.getBlockId()+"] of file["+src+"].");

//创建一个数据块的读取器

          blockReader = BlockReader.newBlockReader(s, src, blk.getBlockId(),blk.getGenerationStamp(),offsetIntoBlock, blk.getNumBytes() - offsetIntoBlock, buffersize, verifyChecksum, clientName);

          return chosenNode;

        } catch (IOException ex) {
          // Put chosen node into dead list, continue
          LOG.debug("Failed to connect to " + targetAddr + ":"
                    + StringUtils.stringifyException(ex));
          addToDeadNodes(chosenNode);
          if (s != null) {
            try {
              s.close();
            } catch (IOException iex) {
            }
          }
          s = null;
        }
      }
      return chosenNode;
    }

private synchronized int readBuffer(byte buf[], int off, int len)throws IOException {
      IOException ioe;

      boolean retryCurrentNode = true;

      while (true) {
        // retry as many times as seekToNewSource allows.
        try {
          return blockReader.read(buf, off, len);//从数据块读取器中读取len个字节的数据
        } catch ( ChecksumException ce ) {
          LOG.warn("Found Checksum error for " + currentBlock + " from " +
                   currentNode.getName() + " at " + ce.getPos());
          reportChecksumFailure(src, currentBlock, currentNode);
          ioe = ce;
          retryCurrentNode = false;
        } catch ( IOException e ) {
          if (!retryCurrentNode) {
            LOG.warn("Exception while reading from " + currentBlock +
                     " of " + src + " from " + currentNode + ": " +
                     StringUtils.stringifyException(e));
          }
          ioe = e;
        }
        boolean sourceFound = false;
        if (retryCurrentNode) {
          sourceFound = seekToBlockSource(pos);
        } else {
          addToDeadNodes(currentNode);
          sourceFound = seekToNewSource(pos);
        }
        if (!sourceFound) {
          throw ioe;
        }
        retryCurrentNode = false;
      }
    }

private LocatedBlock getBlockAt(long offset) throws IOException {
      assert (locatedBlocks != null) : "locatedBlocks is null";

      // 从缓存中查找文件位置offset所在的数据块地址信息
      int targetBlockIdx = locatedBlocks.findBlock(offset);

      if (targetBlockIdx < 0) { //缓存中没有
        targetBlockIdx = LocatedBlocks.getInsertIndex(targetBlockIdx);
        // fetch more blocks
        LocatedBlocks newBlocks;

//从NameNode获取文件src从offset到offset+prefetchSize的内容所在的数据块的位置信息

newBlocks = callGetBlockLocations(namenode, src, offset, prefetchSize);
assert (newBlocks != null) : "Could not find target position " + offset;

//将新获取的数据块位置信息加入缓存

        locatedBlocks.insertRange(targetBlockIdx, newBlocks.getLocatedBlocks());
      }

      LocatedBlock blk = locatedBlocks.get(targetBlockIdx);

      //更新当前文件指针、数据块信息
      this.pos = offset;
      this.blockEnd = blk.getStartOffset() + blk.getBlockSize() - 1;
      this.currentBlock = blk.getBlock();

      return blk;
    }
从上面的源代码，我们可以看出，对于FSDataInputStream的read(long,byte[],int,int)和read(byte[],int,int)方法的一次调用中并不会主动地跨数据块读取数据，也就是说，在底层DFSInputStream只会在当前数据块内尽可能读取len个字节的数据(当前数据块有多少数据就读多少数据，知道满足len长度，若当前数据块不够len，也不会跳到下个数据块继续读，而是直接返回)。另外，BlockReader主要是用来接收某一个数据节点发送来的数据块的数据，它的实现很简单，有兴趣的话可以阅读它的源代码。

再来简单的看看readFully(long,byte[],int,int)和readFully(long,byte[])方法，它们的底层实现都是一样的，都会不断的调用read(long,byte[],int,int)方法，直到读取到len长度的字节。值得注意的是，如果整个文件从position开始没有len长度的数据，就会抛出异常。它们的序列图如下：

到这里，我已经全部介绍完了有关HDFS的I/O流的实现，希望对如何提高HDFS文件读写速度感兴趣的盆友有所帮助。