Spark源码解析SparkStreaming数据接收
来源:互联网 发布:守望先锋吧被关 知乎 编辑:程序博客网 时间:2024/05/21 10:39
在上一篇博文中,我们讲述了一个SparkStreaming应用程序启动后开始的准备工作,即在executors启动receiver
这里我们将讲述接收数据到存储数据的过程
首先接受数据是在receiver的onStart方法里,在这里我们还是以SocketReceiver为例,在SocketReceiver的OnStart方法中启动一个线程,在该线程中调用receive方法,进行接收数据的处理
def receive() { var socket: Socket = null try { logInfo("Connecting to " + host + ":" + port) socket = new Socket(host, port) logInfo("Connected to " + host + ":" + port) val iterator = bytesToObjects(socket.getInputStream()) while(!isStopped && iterator.hasNext) { store(iterator.next) } ... } catch { ... } finally { if (socket != null) { socket.close() logInfo("Closed socket to " + host + ":" + port) } }}
我们可以看到在这里,接收到数据后,会调用store方法存储接收到的数据,store方法是一个重载的方法,其有很多的实现,大致如下:
def store(dataItem: T) { supervisor.pushSingle(dataItem)}def store(dataBuffer: ArrayBuffer[T]) { supervisor.pushArrayBuffer(dataBuffer, None, None)}def store(dataBuffer: ArrayBuffer[T], metadata: Any) { supervisor.pushArrayBuffer(dataBuffer, Some(metadata), None)}def store(dataIterator: Iterator[T]) { supervisor.pushIterator(dataIterator, None, None)}...
还有很多的重载方法,但都是去调用supervisor的pushXXXX方法,在ReceiverSupervisorImpl内的这些方法,都会最终调用ReceiverSupervisorImpl#pushAndReportBlock
def pushAndReportBlock( receivedBlock: ReceivedBlock, metadataOption: Option[Any], blockIdOption: Option[StreamBlockId] ) { val blockId = blockIdOption.getOrElse(nextBlockId) val time = System.currentTimeMillis //存储block到BlockManager中,这里我们就可以看到预写日志的机制 val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock) val numRecords = blockStoreResult.numRecords //封装一个ReceivedBlockInfo对象,里面有一个streamId val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult) //向ReceiverTracker发送AddBlock的消息,这个样例类中包含了block的相关信息blockInfo trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo)) logDebug(s"Reported block $blockId")}
我们在这里需要注意一下receivedBlockHandler,这个对象的初始化如下:
private val receivedBlockHandler: ReceivedBlockHandler = { //如果开启了预写日志机制,默认为false //那么receivedBlockHandler就是WriteAheadLogBasedBlockHandler //如果没有开启预写日志机制,那么receivedBlockHandler就是BlockManagerBasedBlockHandler if (WriteAheadLogUtils.enableReceiverLog(env.conf)) { if (checkpointDirOption.isEmpty) { throw new SparkException( "Cannot enable receiver write-ahead log without checkpoint directory set. " + "Please use streamingContext.checkpoint() to set the checkpoint directory. " + "See documentation for more details.") } new WriteAheadLogBasedBlockHandler(env.blockManager, receiver.streamId, receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get) } else { new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel) }}
我们可以看到对于是否开启预写日志机制将会创建不同的子类对象,在这里我们以WriteAheadLogBasedBlockHandler的storeBlock方法为例:
def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = { var numRecords = None: Option[Long] // Serialize the block so that it can be inserted into both //先用BlockManager序列化数据 val serializedBlock = block match { case ArrayBufferBlock(arrayBuffer) => numRecords = Some(arrayBuffer.size.toLong) blockManager.dataSerialize(blockId, arrayBuffer.iterator) case IteratorBlock(iterator) => val countIterator = new CountingIterator(iterator) val serializedBlock = blockManager.dataSerialize(blockId, countIterator) numRecords = countIterator.count serializedBlock case ByteBufferBlock(byteBuffer) => byteBuffer case _ => throw new Exception(s"Could not push $blockId to block manager, unexpected block type") } // Store the block in block manager //将数据保存到BlockManager,默认的持久化策略是_SER,_2的,会序列化,会复制一份副本到其他Executor上的BlockManager中 //以供容错需要 val storeInBlockManagerFuture = Future { val putResult = blockManager.putBytes(blockId, serializedBlock, effectiveStorageLevel, tellMaster = true) if (!putResult.map { _._1 }.contains(blockId)) { throw new SparkException( s"Could not store $blockId to block manager with storage level $storageLevel") } } // Store the block in write ahead log //将block存入预写日志 val storeInWriteAheadLogFuture = Future { writeAheadLog.write(serializedBlock, clock.getTimeMillis()) } // Combine the futures, wait for both to complete, and return the write ahead log record handle val combinedFuture = storeInBlockManagerFuture.zip(storeInWriteAheadLogFuture).map(_._2) val walRecordHandle = Await.result(combinedFuture, blockStoreTimeout) WriteAheadLogBasedStoreResult(blockId, numRecords, walRecordHandle)}
在这个方法中我们先将block进行序列化,然后将序列化后的数据写入BlockManager,最后将序列化后的数据写入日志
我们再回到上面的pushAndReportBlock方法,上面我们解析的是其push部分,现在我们需要解析的是其Report部分
我们可以看到封装了一个blockInfo对象,然后向ReceiverTracker发送了AddBlock消息,在ReceiverTracker中接收到这个消息后的处理:
case AddBlock(receivedBlockInfo) => context.reply(addBlock(receivedBlockInfo))private def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { receivedBlockTracker.addBlock(receivedBlockInfo)}def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = synchronized { try { writeToLog(BlockAdditionEvent(receivedBlockInfo)) getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo logDebug(s"Stream ${receivedBlockInfo.streamId} received " + s"block ${receivedBlockInfo.blockStoreResult.blockId}") true } catch { case e: Exception => logError(s"Error adding block $receivedBlockInfo", e) false }}
这段代码中最重要的无非就是getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo
即将接收到的blockInfo存放在一个队列中,然后再将这个队列放入一个以streamId为key的map中
综上我们可以发现Receiver在接受到数据后,主要就是做了两件事:
1. 将接收到的消息存入BlockManager中
2. 将接受到的消息的blockInfo作为AddBlock消息的参数发送给ReceiverTracker,ReceiverTrakcer中使用一个Map[streamId,BlockQuqueu[blockInfo]]的数据结构存储之
- Spark源码解析SparkStreaming数据接收
- Spark集成Kafka源码分析——SparkStreaming从kafak中接收数据
- Spark-streaming-2.0-Kafka数据接收并行度源码解析
- Spark源码解析之SparkStreaming中Receiver的启动
- Spark源码解析之SparkStreaming数据处理及流动
- Spark Streaming源码解读之流数据不断接收详解
- 10.Spark Streaming源码分析:Receiver数据接收全过程详解
- 10.Spark Streaming源码分析:Receiver数据接收全过程详解
- SparkStreaming的运行流程解析(源码)
- tomcat源码解析(三)--请求过程之数据的接收
- Thrift源码系列----4.数据的解析与发送、接收
- Spark源码解析-spark-shell
- Spark Streaming 数据接收过程
- Spark Streaming 数据接收优化
- 16.Spark Streaming源码解读之数据清理机制解析
- spark源码解析1
- spark rdd 源码解析
- Spark源码解析
- 声纹识别技术助力远程身份认证
- Mybatis框架个人总结
- junit 测试问题 No qualifying bean of type [javax.servlet.http.HttpServletRequest] found for dependency
- CSS hack介绍
- 数组中出现次数超过一半的数字
- Spark源码解析SparkStreaming数据接收
- Android SD卡文件存储
- IP 协议帧格式
- 江南大学吴小俊:深度学习不能代表人工智能的全部
- 使用js动态绘制报表
- 【Angular】Angular+Ionic报错:No provider for *Service!
- ORA-12505, TNS:listener does not currently know of SID given in connect descriptor
- mt6392介绍
- linux iio_dev iio_inio 成员