DataNode之BlockSender分析

来源:互联网 发布:ftp主进程打开的端口 编辑:程序博客网 时间:2024/05/17 06:33

一 BlockSender发送数据的格式详解

BlockSender主要负责从DataNode的磁盘读取数据块,然后发送数据块到接收方。需要注意的是,BlockSender发送的数据是以一定的结构组织的。


BlockSender发送的数据格式包括两部分:

校验信息头(ChecksumHeader)和数据包序列(packets)

1.1 ChecksumHeader

用于描述当前DataNode使用的校验方式等信息。如下所示,一个校验头信息也包括2个部分:

CHECKSUM_TYPE:校验类型

数据校验类型:包括三种校验—空校验,CRC32以及CRC32C,在这里使用1 byte描述数据校验类型,空校验,CRC32以及CRC32C,分别对应着0,1,2

BYTES_PER_CHECKSUM:校验块大小

校验快大小:也就是多少字节数据产生一个校验值。在这里CRC32为例,一把情况下是512字节数据产生一个4字节的checksum,我们把这512字节的数据叫做一个校验块(Chunk),chunk是HDFS读写数据块操作的最小单元

1.2数据包序列(packets)

BlockSender会将数据块切分成若干数据包对外发送,当数据发送完毕,会以一个空的数据包作为结束。每一个数据包包括一个变长的包头,校验数据和若干字节的实际数据

1.2.1数据包头

用于描述当前数据包信息,是通过PtotoBuf序列化的包括4字节的全包长度,以及2字节的包头长度

=>当前数据包在整个数据块中的位置

=>数据包在管道中的序列号

=>当前数据包是不是数据块中的最后一个数据包

=>当前数据包数据部分的长度

=>是否需要DN同步

1.2.2校验数据

校验数据是对实际数据做校验操作产生的,它将实际数据以校验块为单位,每一个校验块产生一个checksum,校验数据中包含了所有校验块的checksum.校验数据的大小=(实际数据长度+校验块大小)/ 校验块大小 *校验和长度

 

1.2.3实际数据

数据包中的实际数据就是数据块文件中保存的数据,实际数据的传输是以校验块为单位的,一个校验块对应产生一个checksum的实际数据。在数据包中会将校验块和校验数据分开发送,首先将所有校验块的校验数据发送出去,然后再发所有的校验块

 

二 BlockSender的实现

数据块的发送主要是由BlockSender来实现的,其发送过程包括:

发送准备,发送数据,清理工作

2.1发送准备

BlockSender数据块发送准备工作主要是在构造过程中执行的,

  BlockSender(ExtendedBlock block, long startOffset,long length,

             boolean corruptChecksumOk, booleanverifyChecksum,

             boolean sendChecksum, DataNode datanode, StringclientTraceFmt,

              CachingStrategycachingStrategy)

      throws IOException {

    try {

      this.block =block;

      this.corruptChecksumOk =corruptChecksumOk;

      this.verifyChecksum =verifyChecksum;

      this.clientTraceFmt =clientTraceFmt;

 

      /*

       *如果缓存策略readDropBehind为空,我们则按照配置文件

       * dfs.datanode.drop.cache.behind.reads来初始化

       * dropCacheBehindLargeReads

       */

      if (cachingStrategy.getDropBehind() ==null) {

       this.dropCacheBehindAllReads =false;

       this.dropCacheBehindLargeReads =

           datanode.getDnConf().dropCacheBehindReads;

      }else {

       this.dropCacheBehindAllReads =

           this.dropCacheBehindLargeReads =

                cachingStrategy.getDropBehind().booleanValue();

      }

      /*

       *如果缓存策略readahead为空,那么我们则按照配置文件

       * dfs.datanode.readahead.bytes的处理,默认是

       * 4Mb

       */

      if (cachingStrategy.getReadahead() ==null) {

       this.alwaysReadahead =false;

       this.readaheadLength =datanode.getDnConf().readaheadLength;

      }else {

       this.alwaysReadahead =true;

       this.readaheadLength =cachingStrategy.getReadahead().longValue();

      }

      this.datanode =datanode;

      //如果需要验证校验数据

      if (verifyChecksum) {

       // To simplify implementation, callers maynot specify verification

       // without sending.

        Preconditions.checkArgument(sendChecksum,

           "If verifying checksum, currently mustalso send it.");

      }

     

      final Replicareplica;

      final long replicaVisibleLength;

      synchronized(datanode.data) {

       replica = getReplica(block,datanode);

       replicaVisibleLength = replica.getVisibleLength();

      }

      // if there is a write in progress

      ChunkChecksumchunkChecksum = null;

      if (replicainstanceof ReplicaBeingWritten) {

       final ReplicaBeingWritten rbw = (ReplicaBeingWritten)replica;

       waitForMinLength(rbw, startOffset + length);

       chunkChecksum = rbw.getLastChecksumAndDataLen();

      }

 

      if (replica.getGenerationStamp() <block.getGenerationStamp()) {

       throw newIOException("Replica gen stamp < block genstamp, block="

            +block + ", replica=" +replica);

      }else if (replica.getGenerationStamp() >block.getGenerationStamp()) {

       if (DataNode.LOG.isDebugEnabled()) {

          DataNode.LOG.debug("Bumpingup the client provided"

              +" block's genstamp to latest " + replica.getGenerationStamp()

              +" for block " + block);

        }

       block.setGenerationStamp(replica.getGenerationStamp());

      }

      if (replicaVisibleLength <0) {

       throw newIOException("Replica is not readable, block="

            +block + ", replica=" +replica);

      }

      if (DataNode.LOG.isDebugEnabled()) {

        DataNode.LOG.debug("block=" +block + ",replica=" +replica);

      }

 

      //是否开启transferTo模式

      this.transferToAllowed =datanode.getDnConf().transferToAllowed &&

        (!is32Bit ||length <= Integer.MAX_VALUE);

 

      // Obtain a reference before reading data

      this.volumeRef =datanode.data.getVolume(block).obtainReference();

 

      /*

       * (corruptChecksumOK, meta_file_exist):operation

       * True,  True: will verify checksum 

       * True, False: No verify, e.g., need to read data from a corrupted file

       * False, True: will verify checksum

       * False, False: throws IOException filenot found

       */

      //获取checksum信息:从Meta文件中获取当前数据块的校验算法、校验和长度,以及多少字节产生一个校验值

      //也就是校验块的大小

      DataChecksumcsum = null;

      if (verifyChecksum ||sendChecksum) {

        LengthInputStreammetaIn = null;

       boolean keepMetaInOpen =false;

       try {

         metaIn = datanode.data.getMetaDataInputStream(block);

         if (!corruptChecksumOk ||metaIn != null) {

           if (metaIn ==null) {

             //need checksum but meta-data not found

             throw newFileNotFoundException("Meta-data not found for " +

                 block);

            }

           if (metaIn.getLength() > BlockMetadataHeader.getHeaderSize()){

             checksumIn = newDataInputStream(newBufferedInputStream(

                 metaIn, HdfsConstants.IO_FILE_BUFFER_SIZE));

 

             csum = BlockMetadataHeader.readDataChecksum(checksumIn,block);

             keepMetaInOpen = true;

            }

          }else {

           LOG.warn("Couldnot find metadata file for " +block);

          }

        }finally {

         if (!keepMetaInOpen) {

            IOUtils.closeStream(metaIn);

          }

        }

      }

      if (csum ==null) {

       csum = DataChecksum.newDataChecksum(DataChecksum.Type.NULL,512);

      }

 

      /*

       * If chunkSize is very large, then themetadatafile is mostly

       * corrupted. For now just truncatebytesPerchecksum to blockLength.

       */      

      int size =csum.getBytesPerChecksum();

      if (size >10*1024*1024 &&size > replicaVisibleLength) {

       csum = DataChecksum.newDataChecksum(csum.getChecksumType(),

            Math.max((int)replicaVisibleLength,10*1024*1024));

       size = csum.getBytesPerChecksum();       

      }

      chunkSize = size;//校验块大小

      checksum = csum;//校验算法

      checksumSize = checksum.getChecksumSize();//校验和长度

      length = length < 0 ? replicaVisibleLength : length;

 

      // end is either last byte on disk or the length forwhich we have a

      // checksum

      long end =chunkChecksum != null ?chunkChecksum.getDataLength()

          :replica.getBytesOnDisk();

      if (startOffset <0 || startOffset >end

          || (length +startOffset) > end) {

        Stringmsg = " Offset " + startOffset + "and length " + length

        +" don't match block " + block + " (blockLen " + end + " )";

       LOG.warn(datanode.getDNRegistrationForBP(block.getBlockPoolId()) +

           ":sendBlock() : " + msg);

       throw newIOException(msg);

      }

     

      //计算offset和endOffset,offset用于标识要去读取的数据在数据块的起始位置

      //endOffset:用于标识结束位置。由于读取位置往往不会落在某个校验块的起始位置,

      //所以在准备工作中,需要确保offset的校验块的起始位置,endOffset在校验块的

      //的结束位置。这样读取时就可以校验块为单位读取,方便校验和的操作

      offset = startOffset - (startOffset %chunkSize);

      if (length >=0) {

       // Ensure endOffset points to end of chunk.

       long tmpLen =startOffset + length;

       if (tmpLen %chunkSize != 0) {

         tmpLen += (chunkSize -tmpLen % chunkSize);

        }

       if (tmpLen <end) {

         // will use on-disk checksum here since theend is a stable chunk

         end = tmpLen;

        }else if (chunkChecksum !=null) {

         // last chunk is changing. flag that weneed to use in-memory checksum

         this.lastChunkChecksum =chunkChecksum;

        }

      }

      endOffset = end;

 

      //寻找正确的offSET

      if (offset >0 && checksumIn !=null) {

       long checksumSkip = (offset /chunkSize) * checksumSize;

       // note blockInStream is seeked whencreated below

       if (checksumSkip >0) {

         // Should we use seek() for checksum fileas well?

          IOUtils.skipFully(checksumIn,checksumSkip);

        }

      }

      seqno = 0;

 

      if (DataNode.LOG.isDebugEnabled()) {

        DataNode.LOG.debug("replica=" +replica);

      }

      blockIn = datanode.data.getBlockInputStream(block,offset); // seek to offset

      if (blockIninstanceof FileInputStream) {

       blockInFd = ((FileInputStream)blockIn).getFD();

      }else {

       blockInFd = null;

      }

    } catch (IOException ioe) {

      IOUtils.closeStream(this);

      IOUtils.closeStream(blockIn);

      throw ioe;

    }

  }

2.2发送数据块

首先会进行预读取和丢弃,调用manageOsCache操作

  /*

   *预读取或者丢弃

   *

   */

  private voidmanageOsCache() throws IOException {

    if (blockInFd ==null) return;

 

    //按照条件触发预读取操作

    if ((readaheadLength >0)&& (datanode.readaheadPool !=null)&&

          (alwaysReadahead ||isLongRead())) {

      //满足预读取条件,则调用ReadaheadPool.readaheadStream方法触发预读取

      curReadahead = datanode.readaheadPool.readaheadStream(

         clientTraceFmt, blockInFd,offset, readaheadLength, Long.MAX_VALUE,

         curReadahead);

    }

 

    //丢弃刚才从缓存中读取的数据,因为不再需要使用这些数据了

    if (dropCacheBehindAllReads ||

        (dropCacheBehindLargeReads &&isLongRead())) {

      //丢弃数据的位置

      long nextCacheDropOffset =lastCacheDropOffset + CACHE_DROP_INTERVAL_BYTES;

      if (offset >=nextCacheDropOffset) {

       //如果下一次读取数据的位置大于丢弃的数据的位置,则将读取数据位置前的数据全部丢弃

       long dropLength =offset - lastCacheDropOffset;

        NativeIO.POSIX.getCacheManipulator().posixFadviseIfPossible(

           block.getBlockName(),blockInFd, lastCacheDropOffset,

           dropLength, NativeIO.POSIX.POSIX_FADV_DONTNEED);

       lastCacheDropOffset = offset;

      }

    }

  }

  /**

   *读取数据块和他的元数据,然后发送数据到啊客户端或者其他的datanode

   * out:将数据写到那儿

   * throttler: 用于发送数据

   */

  long sendBlock(DataOutputStream out, OutputStream baseStream,

                 DataTransferThrottlerthrottler) throws IOException {

    TraceScopescope = datanode.tracer.

       newScope("sendBlock_" +block.getBlockId());

    try {

      return doSendBlock(out,baseStream, throttler);

    } finally {

      scope.close();

    }

  }

 

privatelong doSendBlock(DataOutputStream out, OutputStream baseStream,

        DataTransferThrottlerthrottler) throws IOException {

    if (out ==null) {

      throw newIOException( "out stream is null" );

    }

    initialOffset = offset;

    long totalRead =0;

    OutputStream streamForSendChunks =out;

   

    lastCacheDropOffset = initialOffset;

 

    if (isLongRead() &&blockInFd != null) {

      // Advise that this file descriptor will be accessedsequentially.

      NativeIO.POSIX.getCacheManipulator().posixFadviseIfPossible(

         block.getBlockName(),blockInFd, 0,0,

          NativeIO.POSIX.POSIX_FADV_SEQUENTIAL);

    }

   

    //预读取&丢弃

    manageOsCache();

 

    final long startTime = ClientTraceLog.isDebugEnabled() ? System.nanoTime() :0;

    try {

      int maxChunksPerPacket;

      //构造一个Packet Buffer,也就是能容纳一个数据包的大小,对于2中不同发送数据包模式:

      //transferTo和ioStream,缓冲区大小是不一样的。在transferTo模式中,数据块文件

      //是通过零拷贝的方式直接传输给客户端,不需要将数据块文件写入缓冲区,所以Packet Buffer

      //只需要缓冲校验数据即可;而ioStream模式则需要将实际数据以及校验数据都缓存下来

      int pktBufSize = PacketHeader.PKT_MAX_HEADER_LEN;

      boolean transferTo =transferToAllowed && !verifyChecksum

          &&baseStream instanceof SocketOutputStream

          &&blockIn instanceof FileInputStream;

      if (transferTo) {

       FileChannel fileChannel = ((FileInputStream)blockIn).getChannel();

       blockInPosition = fileChannel.position();

       streamForSendChunks = baseStream;

       maxChunksPerPacket = numberOfChunks(TRANSFERTO_BUFFER_SIZE);

       

       // Smaller packet size to only holdchecksum when doing transferTo

       pktBufSize += checksumSize *maxChunksPerPacket;

      }else {

       maxChunksPerPacket = Math.max(1,

           numberOfChunks(HdfsConstants.IO_FILE_BUFFER_SIZE));

       // Packet size includes both checksum anddata

       pktBufSize += (chunkSize +checksumSize) * maxChunksPerPacket;

      }

 

      ByteBuffer pktBuf = ByteBuffer.allocate(pktBufSize);

      //循环调用sendPacket发送数据包序列

      while (endOffset >offset && !Thread.currentThread().isInterrupted()) {

       //预读取

       manageOsCache();

       long len =sendPacket(pktBuf,maxChunksPerPacket, streamForSendChunks,

           transferTo, throttler);

       offset += len;//更新offset

       totalRead += len + (numberOfChunks(len) *checksumSize);

       seqno++;

      }

      // If this thread was interrupted, then it did not sendthe full block.

      if (!Thread.currentThread().isInterrupted()) {

       try {

         //发送一个空的数据包泳衣标志数据块的结束

         sendPacket(pktBuf,maxChunksPerPacket, streamForSendChunks, transferTo,

             throttler);

         out.flush();

        }catch (IOException e) {//socket error

         throw ioeToSocketException(e);

        }

 

       sentEntireByteRange = true;

      }

    } finally {

      if ((clientTraceFmt !=null)&& ClientTraceLog.isDebugEnabled()) {

       final longendTime = System.nanoTime();

       ClientTraceLog.debug(String.format(clientTraceFmt,totalRead,

           initialOffset, endTime -startTime));

      }

      close();

    }

    return totalRead;

  }

 

  private intsendPacket(ByteBufferpkt, intmaxChunks, OutputStreamout,

      boolean transferTo, DataTransferThrottlerthrottler) throws IOException {

    int dataLen = (int) Math.min(endOffset -offset,

                             (chunkSize * (long)maxChunks));

    //数据包中包含多少校验块

    int numChunks =numberOfChunks(dataLen);

    //校验数据长度

    int checksumDataLen =numChunks * checksumSize;

    //数据包长度

    int packetLen =dataLen + checksumDataLen +4;

    boolean lastDataPacket =offset + dataLen ==endOffset && dataLen >0;

 

    //将数据包头写入缓存

    int headerLen =writePacketHeader(pkt,dataLen, packetLen);

   

    //数据包头在缓存中的位置

    int headerOff =pkt.position() -headerLen;

    //校验数据在缓存中的位子

    int checksumOff =pkt.position();

    byte[] buf =pkt.array();

   

    if (checksumSize >0&& checksumIn !=null) {

      //校验数据写入缓存

      readChecksum(buf, checksumOff, checksumDataLen);

 

      // write in progress that we need to use to get lastchecksum

      if (lastDataPacket &&lastChunkChecksum != null) {

       int start =checksumOff + checksumDataLen -checksumSize;

       byte[] updatedChecksum =lastChunkChecksum.getChecksum();

       

       if (updatedChecksum !=null) {

          System.arraycopy(updatedChecksum,0, buf, start, checksumSize);

        }

      }

    }

   

    int dataOff =checksumOff + checksumDataLen;

    //在普通模式下下将数据写入缓存

    if (!transferTo) {//normal transfer

      IOUtils.readFully(blockIn,buf, dataOff,dataLen);

      //确认校验和数据

      if (verifyChecksum) {

       verifyChecksum(buf,dataOff, dataLen,numChunks, checksumOff);

      }

    }

   

    try {

      if (transferTo) {

        SocketOutputStreamsockOut = (SocketOutputStream)out;

       //首先将头和校验和数据写入缓存

       sockOut.write(buf,headerOff, dataOff -headerOff);

       

       //使用transfer方式,将数据通过0拷贝的方式写入IO流

       FileChannel fileCh = ((FileInputStream)blockIn).getChannel();

        LongWritablewaitTime = newLongWritable();

        LongWritabletransferTime = newLongWritable();

       sockOut.transferToFully(fileCh,blockInPosition, dataLen,

           waitTime, transferTime);

       datanode.metrics.addSendDataPacketBlockedOnNetworkNanos(waitTime.get());

       datanode.metrics.addSendDataPacketTransferNanos(transferTime.get());

       blockInPosition += dataLen;

      }else {

       //普通模式下数据写入IO

       out.write(buf,headerOff, dataOff +dataLen - headerOff);

      }

    } catch (IOException e) {

      if (einstanceof SocketTimeoutException) {

      }else {

        Stringioem = e.getMessage();

       if (!ioem.startsWith("Broken pipe") && !ioem.startsWith("Connectionreset")) {

         LOG.error("BlockSender.sendChunks()exception: ",e);

        }

       datanode.getBlockScanner().markSuspectBlock(

             volumeRef.getVolume().getStorageID(),

             block);

      }

      throw ioeToSocketException(e);

    }

 

    if (throttler !=null) { // rebalancing so throttle

      throttler.throttle(packetLen);

    }

 

    return dataLen;

  }


原创粉丝点击
热门问题 老师的惩罚 人脸识别 我在镇武司摸鱼那些年 重生之率土为王 我在大康的咸鱼生活 盘龙之生命进化 天生仙种 凡人之先天五行 春回大明朝 姑娘不必设防,我是瞎子 孕妇6个月喉咙痒怎么办 怀孕了咳嗽的厉害怎么办 买衣服眼光太差怎么办 我买衣服没主见怎么办 海澜之家皮带长怎么办 针织衫袖口松了怎么办 宝宝裤子腰大了怎么办 两岁了头发稀少怎么办 孩子抵抗力差爱生病怎么办 三岁宝宝总生病怎么办 冬天穿伴娘服冷怎么办 白衬衣粘了黑毛怎么办 黄衣服钻进黑毛怎么办 单位没给上社保怎么办 轮毂中心孔大了怎么办 钻戒大了怎么办小窍门 shift加f4没反应怎么办 多洗联系不上怎么办 弹力靴筒往下掉怎么办 子宫内膜厚月经停不了怎么办 雪纺衣服有静电怎么办 阿胶糕熬的稀了怎么办 感冒吃了阿胶糕怎么办 身上起湿疹很痒怎么办 身上起小疹子痒怎么办 吊带裙带子断了怎么办 裙子洗完缩水了怎么办 衬衫洗后缩水了怎么办 羊毛被洗缩水了怎么办 蕾丝洗后缩水了怎么办 天丝针织衫缩水怎么办? 氨纶面料起球怎么办? 新买的衬衫很硬怎么办 孕晚期睡不好觉怎么办 全棉衣服缩水了怎么办 棉麻的衣服起球怎么办 布艺沙发起球怎么办 加绒的衣服掉毛怎么办 新裤子有刺鼻味怎么办 新买鞋子味道大怎么办 天窗下水管堵了怎么办