Spark源码解析SparkStreaming数据接收

来源:互联网 发布:守望先锋吧被关 知乎 编辑:程序博客网 时间:2024/05/21 10:39

在上一篇博文中,我们讲述了一个SparkStreaming应用程序启动后开始的准备工作,即在executors启动receiver
这里我们将讲述接收数据到存储数据的过程
首先接受数据是在receiver的onStart方法里,在这里我们还是以SocketReceiver为例,在SocketReceiver的OnStart方法中启动一个线程,在该线程中调用receive方法,进行接收数据的处理

def receive() {  var socket: Socket = null  try {    logInfo("Connecting to " + host + ":" + port)    socket = new Socket(host, port)    logInfo("Connected to " + host + ":" + port)    val iterator = bytesToObjects(socket.getInputStream())    while(!isStopped && iterator.hasNext) {      store(iterator.next)    }   ...  } catch {  ...  } finally {    if (socket != null) {      socket.close()      logInfo("Closed socket to " + host + ":" + port)    }  }}

我们可以看到在这里,接收到数据后,会调用store方法存储接收到的数据,store方法是一个重载的方法,其有很多的实现,大致如下:

def store(dataItem: T) {    supervisor.pushSingle(dataItem)}def store(dataBuffer: ArrayBuffer[T]) {  supervisor.pushArrayBuffer(dataBuffer, None, None)}def store(dataBuffer: ArrayBuffer[T], metadata: Any) {  supervisor.pushArrayBuffer(dataBuffer, Some(metadata), None)}def store(dataIterator: Iterator[T]) {  supervisor.pushIterator(dataIterator, None, None)}...

还有很多的重载方法,但都是去调用supervisor的pushXXXX方法,在ReceiverSupervisorImpl内的这些方法,都会最终调用ReceiverSupervisorImpl#pushAndReportBlock

def pushAndReportBlock(    receivedBlock: ReceivedBlock,    metadataOption: Option[Any],    blockIdOption: Option[StreamBlockId]  ) {  val blockId = blockIdOption.getOrElse(nextBlockId)  val time = System.currentTimeMillis  //存储block到BlockManager中,这里我们就可以看到预写日志的机制  val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)  val numRecords = blockStoreResult.numRecords  //封装一个ReceivedBlockInfo对象,里面有一个streamId  val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)  //向ReceiverTracker发送AddBlock的消息,这个样例类中包含了block的相关信息blockInfo  trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))  logDebug(s"Reported block $blockId")}

我们在这里需要注意一下receivedBlockHandler,这个对象的初始化如下:

private val receivedBlockHandler: ReceivedBlockHandler = { //如果开启了预写日志机制,默认为false //那么receivedBlockHandler就是WriteAheadLogBasedBlockHandler //如果没有开启预写日志机制,那么receivedBlockHandler就是BlockManagerBasedBlockHandler if (WriteAheadLogUtils.enableReceiverLog(env.conf)) {   if (checkpointDirOption.isEmpty) {     throw new SparkException(       "Cannot enable receiver write-ahead log without checkpoint directory set. " +         "Please use streamingContext.checkpoint() to set the checkpoint directory. " +         "See documentation for more details.")   }   new WriteAheadLogBasedBlockHandler(env.blockManager, receiver.streamId,     receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get) } else {   new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel) }}

我们可以看到对于是否开启预写日志机制将会创建不同的子类对象,在这里我们以WriteAheadLogBasedBlockHandler的storeBlock方法为例:

def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = {   var numRecords = None: Option[Long]   // Serialize the block so that it can be inserted into both   //先用BlockManager序列化数据   val serializedBlock = block match {     case ArrayBufferBlock(arrayBuffer) =>       numRecords = Some(arrayBuffer.size.toLong)       blockManager.dataSerialize(blockId, arrayBuffer.iterator)     case IteratorBlock(iterator) =>       val countIterator = new CountingIterator(iterator)       val serializedBlock = blockManager.dataSerialize(blockId, countIterator)       numRecords = countIterator.count       serializedBlock     case ByteBufferBlock(byteBuffer) =>       byteBuffer     case _ =>       throw new Exception(s"Could not push $blockId to block manager, unexpected block type")   }   // Store the block in block manager   //将数据保存到BlockManager,默认的持久化策略是_SER,_2的,会序列化,会复制一份副本到其他Executor上的BlockManager中   //以供容错需要   val storeInBlockManagerFuture = Future {     val putResult =       blockManager.putBytes(blockId, serializedBlock, effectiveStorageLevel, tellMaster = true)     if (!putResult.map { _._1 }.contains(blockId)) {       throw new SparkException(         s"Could not store $blockId to block manager with storage level $storageLevel")     }   }   // Store the block in write ahead log   //将block存入预写日志   val storeInWriteAheadLogFuture = Future {     writeAheadLog.write(serializedBlock, clock.getTimeMillis())   }   // Combine the futures, wait for both to complete, and return the write ahead log record handle   val combinedFuture = storeInBlockManagerFuture.zip(storeInWriteAheadLogFuture).map(_._2)   val walRecordHandle = Await.result(combinedFuture, blockStoreTimeout)   WriteAheadLogBasedStoreResult(blockId, numRecords, walRecordHandle)}

在这个方法中我们先将block进行序列化,然后将序列化后的数据写入BlockManager,最后将序列化后的数据写入日志

我们再回到上面的pushAndReportBlock方法,上面我们解析的是其push部分,现在我们需要解析的是其Report部分
我们可以看到封装了一个blockInfo对象,然后向ReceiverTracker发送了AddBlock消息,在ReceiverTracker中接收到这个消息后的处理:

case AddBlock(receivedBlockInfo) =>    context.reply(addBlock(receivedBlockInfo))private def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {  receivedBlockTracker.addBlock(receivedBlockInfo)}def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = synchronized {     try {       writeToLog(BlockAdditionEvent(receivedBlockInfo))       getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo       logDebug(s"Stream ${receivedBlockInfo.streamId} received " +         s"block ${receivedBlockInfo.blockStoreResult.blockId}")       true     } catch {       case e: Exception =>         logError(s"Error adding block $receivedBlockInfo", e)         false     }}

这段代码中最重要的无非就是getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo即将接收到的blockInfo存放在一个队列中,然后再将这个队列放入一个以streamId为key的map中

综上我们可以发现Receiver在接受到数据后,主要就是做了两件事:
1. 将接收到的消息存入BlockManager中
2. 将接受到的消息的blockInfo作为AddBlock消息的参数发送给ReceiverTracker,ReceiverTrakcer中使用一个Map[streamId,BlockQuqueu[blockInfo]]的数据结构存储之

原创粉丝点击