Spark学习笔记(18)Spark Streaming中空RDD处理
来源:互联网 发布:永久免费顶级域名 编辑:程序博客网 时间:2024/05/18 02:44
本期内容:
1 Spark Streaming中的空RDD处理
2 Spark Streaming程序的停止
1 Spark Streaming中的空RDD处理
在Spark Streaming应用程序中,无论使用什么 DStream,底层实际上就是操作RDD。
从一个应用程序片段开始,进行剖析:
...
val lines = ssc.socketTextStream("Master", 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords => {
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => {
val sql = "insert into streaming_itemcount(item,count) values('" + record._1 + "'," + record._2 + ")"
val stmt = connection.createStatement();
stmt.executeUpdate(sql);
})
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
...
程序中有一个这样的问题:wordCounts.foreachRDD里面,开始时并没有判断rdd是否为空,就进行处理了。
rdd为空时,也获取CPU core等计算资源,并进行里面的计算。这显然是不合适的。
虽然Spark中定义了EmptyRDD,且让其Compute时抛出异常,但实际Spark应用程序并没有使用EmptyRDD。
应该对每个rdd进行处理前,应该判断rdd是否为空。
再看看RDD.isEmpty:
def isEmpty(): Boolean = withScope {
partitions.length == 0 || take(1).length == 0
}
故前面应用程序的代码可以在加一行代码:
wordCounts.foreachRDD { rdd =>
if (!rdd.isEmpty) {
rdd.foreachPartition { partitionOfRecords => {
... }
...
2 Spark Streaming程序的停止
先看StreamingContext.top:
def stop(
stopSparkContext: Boolean = conf.getBoolean("spark.streaming.stopSparkContextByDefault", true)
): Unit = synchronized {
stop(stopSparkContext, false)
}
真正好的停止一个Spark Streaming应用程序,应该用另一个stop:
def stop(stopSparkContext: Boolean, stopGracefully: Boolean): Unit = {
var shutdownHookRefToRemove: AnyRef = null
if (AsynchronousListenerBus.withinListenerThread.value) {
throw new SparkException("Cannot stop StreamingContext within listener thread of" +
" AsynchronousListenerBus")
}
synchronized {
try {
state match {
case INITIALIZED =>
logWarning("StreamingContext has not been started yet")
case STOPPED =>
logWarning("StreamingContext has already been stopped")
case ACTIVE =>
scheduler.stop(stopGracefully)
// Removing the streamingSource to de-register the metrics on stop()
env.metricsSystem.removeSource(streamingSource)
uiTab.foreach(_.detach())
StreamingContext.setActiveContext(null)
waiter.notifyStop()
if (shutdownHookRef != null) {
shutdownHookRefToRemove = shutdownHookRef
shutdownHookRef = null
}
logInfo("StreamingContext stopped successfully")
}
} finally {
// The state should always be Stopped after calling `stop()`, even if we haven't started yet
state = STOPPED
}
}
if (shutdownHookRefToRemove != null) {
ShutdownHookManager.removeShutdownHook(shutdownHookRefToRemove)
}
// Even if we have already stopped, we still need to attempt to stop the SparkContext because
// a user might stop(stopSparkContext = false) and then call stop(stopSparkContext = true).
if (stopSparkContext) sc.stop()
}
stopGracefully参数默认是false,生产环境应该设置为 true,具体做法是配置文件中把spark.streaming.stopGeacefullyOnShutdown设置为true,这样能保证已运行的程序运行完再停止,以保证数据处理的完整。
Spark Streaming程序是怎么做到的呢?StreamingContext.stopShutDown调用了上面的stop。
StreamingContext.stopShutDown:
private def stopOnShutdown(): Unit = {
val stopGracefully = conf.getBoolean("spark.streaming.stopGracefullyOnShutdown", false)
logInfo(s"Invoking stop(stopGracefully=$stopGracefully) from shutdown hook")
// Do not stop SparkContext, let its own shutdown hook stop it
stop(stopSparkContext = false, stopGracefully = stopGracefully)
}
在StreamingContext.start中,会加一个hook来调用stopShutDown:
StreamingContext.start:
def start(): Unit = synchronized {
state match {
case INITIALIZED =>
startSite.set(DStream.getCreationSite())
StreamingContext.ACTIVATION_LOCK.synchronized {
StreamingContext.assertNoOtherContextIsActive()
try {
validate()
// Start the streaming scheduler in a new thread, so that thread local properties
// like call sites and job groups can be reset without affecting those of the
// current thread.
ThreadUtils.runInNewThread("streaming-start") {
sparkContext.setCallSite(startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
scheduler.start()
}
state = StreamingContextState.ACTIVE
} catch {
case NonFatal(e) =>
logError("Error starting the context, marking it as stopped", e)
scheduler.stop(false)
state = StreamingContextState.STOPPED
throw e
}
StreamingContext.setActiveContext(this)
}
shutdownHookRef = ShutdownHookManager.addShutdownHook(
StreamingContext.SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)
// Registering Streaming Metrics at the start of the StreamingContext
assert(env.metricsSystem != null)
env.metricsSystem.registerSource(streamingSource)
uiTab.foreach(_.attach())
logInfo("StreamingContext started")
case ACTIVE =>
logWarning("StreamingContext has already been started")
case STOPPED =>
throw new IllegalStateException("StreamingContext has already been stopped")
}
}
在StreamingContext启动时,就用了钩子,定义了在shutdown时必须调用有stopGracefully参数的stop方法。
阅读全文
0 0
- Spark学习笔记(18)Spark Streaming中空RDD处理
- 第18课:Spark Streaming中空RDD处理及流处理程序优雅的停止
- 第18课:Spark Streaming中空RDD处理及流处理程序优雅的停止
- Spark定制班第18课:Spark Streaming中空RDD处理及流处理程序优雅的停止
- Spark 定制版:018~Spark Streaming中空RDD处理及流处理程序优雅的停止
- spark学习笔记:Spark Streaming
- Spark Streaming学习笔记
- Spark Streaming 学习笔记
- Spark Streaming学习笔记
- Spark Streaming中空batches处理的两种方法
- Spark学习笔记(一)--RDD编程
- spark学习笔记(3)spark核心数据结构RDD
- Spark学习笔记 --- spark RDD加载文件
- Spark学习笔记二 RDD
- spark RDD解密学习笔记
- Spark学习笔记 --- RDD详解
- Spark学习笔记 --- 什么是RDD
- Spark学习笔记(22)Spark Streaming架构源码图解
- 2017校招准备 hadoop面试100道
- php将session保存在redis中
- Spark学习笔记(17)Spark Streaming资源动态申请剖析
- centos7安装配置MySQL
- [paper] ENet
- Spark学习笔记(18)Spark Streaming中空RDD处理
- Jdbc常见数据类型及PreparedStatement接口
- crond实现秒级定时任务
- 鸟类链路上的数据报传输标准(A Standard for the Transmission of IP Datagrams on Avian Carriers,IPoAC)
- 模拟键盘自动输入VBS
- 和女友谈谈快餐文化
- Windows DOS builtin命令
- 【leetcode】104,110,111总结
- Spark学习笔记(19)Spark Streaming架构设计和运行机制大总结