解密SparkStreaming运行机制和架构进阶之Job和容错(第三篇)
来源:互联网 发布:帝国cms模板安装教程 编辑:程序博客网 时间:2024/05/17 07:35
本期要点:
1、探讨Spark Streaming Job架构和运行机制
2、探讨Spark Streaming 容错机制
关于SparkStreaming我们在前面的博客中其实有所探讨,SparkStreaming是运行在SparkCode之前的一个子框架,下面我们通过一个简单例子来逐一探讨SparkStreaming运行机制和架构
- SparkStreaming运行机制和架构
//新浪微博:http://weibo.com/ilovepains/SparkConf conf = new SparkConf().setMaster("spark://Master:7077").setAppName("WordCountOnline"); JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(5)); JavaReceiverInputDStream lines = jsc.socketTextStream("Master", 9999); JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() { @Override public Iterable<String> call(String line) throws Exception { return Arrays.asList(line.split(" ")); } }); JavaPairDStream<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() { @Override public Tuple2<String, Integer> call(String word) throws Exception { return new Tuple2<String, Integer>(word, 1); } }); JavaPairDStream<String, Integer> wordsCount = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() { @Override public Integer call(Integer v1, Integer v2) throws Exception { return v1 + v2; } }); wordsCount.print(); jsc.start(); jsc.awaitTermination(); jsc.close();
这是一个SparkStreaming单词记数的例子
在SparkStreaming程序中是StreamingContext是SparkStreaming应用程序所有功能的起始点和程序调度的核心,我们来看一下StreamingContext初始化的部分源码:
//StreamingContext.scala 183行private[streaming] val scheduler = new JobScheduler(this)
我们可以看到在构建StreamingContext的时候,StreamingContext初始化了JobScheduler,而在JobScheduler中又初始化了JobGenerator,同时定义了receiverTracker变量,如下
//JobScheduler.scala 50行 private val jobGenerator = new JobGenerator(this) val clock = jobGenerator.clock val listenerBus = new StreamingListenerBus() // These two are created only when scheduler starts. // eventLoop not being null means the scheduler has been started and not stopped var receiverTracker: ReceiverTracker = null
下面我们来看jsc.socketTextStream(“Master”, 9999)创建DStream背后的部分源码:
// StreamingContext.scala 327行 def socketTextStream( hostname: String, port: Int, storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2 ): ReceiverInputDStream[String] = withNamedScope("socket text stream") { socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel) } // StreamingContext.scala 345行 def socketStream[T: ClassTag]( hostname: String, port: Int, converter: (InputStream) => Iterator[T], storageLevel: StorageLevel ): ReceiverInputDStream[T] = { new SocketInputDStream[T](this, hostname, port, converter, storageLevel) }
从上面我们可以看到StreamingContext对socketStream方法进行了方法重载,最终调用的是SocketInputDStream,那我们接着来看一下SocketInputDStream
private[streaming]class SocketInputDStream[T: ClassTag]( ssc_ : StreamingContext, host: String, port: Int, bytesToObjects: InputStream => Iterator[T], storageLevel: StorageLevel ) extends ReceiverInputDStream[T](ssc_) { def getReceiver(): Receiver[T] = { new SocketReceiver(host, port, bytesToObjects, storageLevel) }}
在SocketInputDStream中定了接受数据的getReceiver方法,当然咋们看到的这些都处于方法定义或者对象初始化的阶段,还没真正开始执行
那现在我们接着来看jsc.start()开始启动程序执行方法
def start(): Unit = synchronized { state match { case INITIALIZED => startSite.set(DStream.getCreationSite()) StreamingContext.ACTIVATION_LOCK.synchronized { StreamingContext.assertNoOtherContextIsActive() try { validate() // Start the streaming scheduler in a new thread, so that thread local properties // like call sites and job groups can be reset without affecting those of the // current thread. ThreadUtils.runInNewThread("streaming-start") { sparkContext.setCallSite(startSite.get) sparkContext.clearJobGroup() sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false") scheduler.start() } state = StreamingContextState.ACTIVE } catch { case NonFatal(e) => logError("Error starting the context, marking it as stopped", e) scheduler.stop(false) state = StreamingContextState.STOPPED throw e } StreamingContext.setActiveContext(this) } shutdownHookRef = ShutdownHookManager.addShutdownHook( StreamingContext.SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown) // Registering Streaming Metrics at the start of the StreamingContext assert(env.metricsSystem != null) env.metricsSystem.registerSource(streamingSource) uiTab.foreach(_.attach()) logInfo("StreamingContext started") case ACTIVE => logWarning("StreamingContext has already been started") case STOPPED => throw new IllegalStateException("StreamingContext has already been stopped") } }
我们可以开到jsc.start(),其实做了很多工作,但我们重点关注一下:scheduler.start()
def start(): Unit = synchronized { if (eventLoop != null) return // scheduler has already been started logDebug("Starting JobScheduler") eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") { override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event) override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e) } eventLoop.start() // attach rate controllers of input streams to receive batch completion updates for { inputDStream <- ssc.graph.getInputStreams rateController <- inputDStream.rateController } ssc.addStreamingListener(rateController) listenerBus.start(ssc.sparkContext) //JobScheduler.scala 80行 receiverTracker = new ReceiverTracker(ssc) inputInfoTracker = new InputInfoTracker(ssc) receiverTracker.start() //JobScheduler.scala 83行 jobGenerator.start() logInfo("Started JobScheduler") }
我现在可以看到在JobScheduler的start方法中receiverTracker得到了初始化,并且调用了其start方法
//ReceiverTracker.scala 149行def start(): Unit = synchronized { if (isTrackerStarted) { throw new SparkException("ReceiverTracker already started") } if (!receiverInputStreams.isEmpty) { endpoint = ssc.env.rpcEnv.setupEndpoint( "ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv)) if (!skipReceiverLaunch) launchReceivers() logInfo("ReceiverTracker started") trackerState = Started } }//ReceiverTracker.scala 413行private def launchReceivers(): Unit = { val receivers = receiverInputStreams.map(nis => { val rcvr = nis.getReceiver() rcvr.setReceiverId(nis.id) rcvr }) runDummySparkJob() logInfo("Starting " + receivers.length + " receivers") endpoint.send(StartAllReceivers(receivers)) }
至此我们可以看到,在StreamingContext执行start方法时会调用JobScheduler的start方法,而在JobScheduler的start方法中会初始化ReceiverTracker并执行其start方法,ReceiverTracker执行start方法时最终是通过rpc通信的方式通知Worker中的excutor进程开始不断接受数据,并将元数据信息汇报给driver
下面我们接着回到JobScheduler.scala 83行,看jobGenerator.start()方法:
//JobGenerator.scala 79行,def start(): Unit = synchronized { if (eventLoop != null) return // generator has already been started // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock. // See SPARK-10125 checkpointWriter eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") { override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event) override protected def onError(e: Throwable): Unit = { jobScheduler.reportError("Error in job generator", e) } } eventLoop.start() if (ssc.isCheckpointPresent) { restart() } else { startFirstTime() } }
到这块已经完成了SparkStreaming启动ReceiverTracker接受数据并且通过JobGenerator Job生成器产生Job,运行在cluster之上
当然我们在程序当中可以看到源码当中其实有很多线程池的使用,笔者认为其中最大的好处在于可以减少创建新线程的时间消耗而又可以达到对线程的高度复用(类似于数据库的连接池是一个道理)
- Spark Streaming 容错机制:
Spark Streaming底层实际上就是RDD的集合,基于这种特性,它的容错机制主要就是两种:一是checkpoint,二是基于lineage(血统)的容错。当然如果lineage链条过于复杂和冗长,这时候就需要做checkpoint
由于RDD的依赖关系,如果stage之间都是窄依赖,此时一般基于lineage容错,方便高效。在stage之间如果是宽依赖,而宽依赖一般会产生shuffle操作,这时候我们就需要考虑checkpoint了
- 解密SparkStreaming运行机制和架构进阶之Job和容错(第三篇)
- 解密SparkStreaming运行机制和架构进阶之Job和容错
- 第3课:解密SparkStreaming运行机制和架构进阶之Job和容错
- 第3课:SparkStreaming 透彻理解三板斧之三:解密SparkStreaming运行机制和架构进阶之Job和容错
- 通过案例对SparkStreaming 透彻理解三板斧之三:解密SparkStreaming运行机制和架构进阶之Job和容错
- 解密SparkStreaming运行机制和架构进阶之Job
- Spark定制班第3课:通过案例对SparkStreaming 透彻理解三板斧之三:解密SparkStreaming运行机制和架构进阶之Job和容错
- 第3课:通过案例对SparkStreaming 透彻理解三板斧之三:解密SparkStreaming运行机制和架构进阶之Job和容错
- 第3课通过案例对SparkStreaming 透彻理解:解密SparkStreaming运行机制和架构进进阶之Job和容错
- Spark定制班第3课:通过案例对SparkStreaming透彻理解三板斧之三:解密Spark Streaming运行机制和架构进阶之Job和容错
- 通过案例对SparkStreaming透彻理解三板斧之二:解密SparkStreaming运行机制和架构进阶之运行机制和架构
- 通过案例对SparkStreaming 透彻理解三板斧之三:解密SparkStreaming运行机制和架构进阶
- 解密SparkStreaming运行机制和架构(第二篇)
- 解密SparkStreaming运行机制和架构
- Spark学习笔记(3)SparkStreaming架构进阶之Job和容错
- 第3课:通过案例对SparkStreaming 透彻理解三板斧之三:解密SparkStreaming运行机制和架构进阶.
- 第3课:通过案例对SparkStreaming 透彻理解三板斧之三:解密SparkStreaming运行机制和架构进阶
- 解密SparkStreaming运行机制和架构分析
- 联通光纤TCP劫持 - 基于策略的宽带信息推送系统
- this关键字的应用
- 查找指定的端口(杀进程)
- BRIEF描述子
- redis分布式锁-SETNX实现
- 解密SparkStreaming运行机制和架构进阶之Job和容错(第三篇)
- Thread.currentThread().getName()与this.getName()的区别
- 前端开发介绍(包含调试什么的)
- hihoCoder挑战赛20-题目2 : 展胜地的鲤鱼旗 -DP/分治
- freemark
- static的应用场景
- PhpStorm使用FTP进行远程编辑 及 问题解决
- shell文件属性判断
- 字符/字符串 查找函数