Spark学习笔记（18）Spark Streaming中空RDD处理

来源：互联网发布：永久免费顶级域名编辑：程序博客网时间：2024/05/18 02:44

本期内容：

1 Spark Streaming中的空RDD处理

2 Spark Streaming程序的停止

1 Spark Streaming中的空RDD处理

在Spark Streaming应用程序中，无论使用什么 DStream，底层实际上就是操作RDD。

从一个应用程序片段开始，进行剖析：

...

val lines = ssc.socketTextStream("Master", 9999)

val words = lines.flatMap(_.split(" "))

val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

wordCounts.foreachRDD { rdd =>

rdd.foreachPartition { partitionOfRecords => {

// ConnectionPool is a static, lazily initialized pool of connections

val connection = ConnectionPool.getConnection()

partitionOfRecords.foreach(record => {

val sql = "insert into streaming_itemcount(item,count) values('" + record._1 + "'," + record._2 + ")"

val stmt = connection.createStatement();

stmt.executeUpdate(sql);

})

ConnectionPool.returnConnection(connection) // return to the pool for future reuse

}

...

程序中有一个这样的问题：wordCounts.foreachRDD里面，开始时并没有判断rdd是否为空，就进行处理了。

rdd为空时，也获取CPU core等计算资源，并进行里面的计算。这显然是不合适的。

虽然Spark中定义了EmptyRDD，且让其Compute时抛出异常，但实际Spark应用程序并没有使用EmptyRDD。

应该对每个rdd进行处理前，应该判断rdd是否为空。

再看看RDD.isEmpty：

def isEmpty(): Boolean = withScope {

partitions.length == 0 || take(1).length == 0

}

故前面应用程序的代码可以在加一行代码：

wordCounts.foreachRDD { rdd =>

if (!rdd.isEmpty) {

rdd.foreachPartition { partitionOfRecords => {

...

}

...

2 Spark Streaming程序的停止

先看StreamingContext.top：

def stop(

stopSparkContext: Boolean = conf.getBoolean("spark.streaming.stopSparkContextByDefault", true)

): Unit = synchronized {

stop(stopSparkContext, false)

}

真正好的停止一个Spark Streaming应用程序，应该用另一个stop：

def stop(stopSparkContext: Boolean, stopGracefully: Boolean): Unit = {

var shutdownHookRefToRemove: AnyRef = null

if (AsynchronousListenerBus.withinListenerThread.value) {

throw new SparkException("Cannot stop StreamingContext within listener thread of" +

" AsynchronousListenerBus")

}

synchronized {

try {

state match {

case INITIALIZED =>

logWarning("StreamingContext has not been started yet")

case STOPPED =>

logWarning("StreamingContext has already been stopped")

case ACTIVE =>

scheduler.stop(stopGracefully)

// Removing the streamingSource to de-register the metrics on stop()

env.metricsSystem.removeSource(streamingSource)

uiTab.foreach(_.detach())

StreamingContext.setActiveContext(null)

waiter.notifyStop()

if (shutdownHookRef != null) {

shutdownHookRefToRemove = shutdownHookRef

shutdownHookRef = null

}

logInfo("StreamingContext stopped successfully")

}

} finally {

// The state should always be Stopped after calling `stop()`, even if we haven't started yet

state = STOPPED

}

if (shutdownHookRefToRemove != null) {

ShutdownHookManager.removeShutdownHook(shutdownHookRefToRemove)

}

// Even if we have already stopped, we still need to attempt to stop the SparkContext because

// a user might stop(stopSparkContext = false) and then call stop(stopSparkContext = true).

if (stopSparkContext) sc.stop()

}

stopGracefully参数默认是false，生产环境应该设置为 true，具体做法是配置文件中把spark.streaming.stopGeacefullyOnShutdown设置为true，这样能保证已运行的程序运行完再停止，以保证数据处理的完整。

Spark Streaming程序是怎么做到的呢？StreamingContext.stopShutDown调用了上面的stop。

StreamingContext.stopShutDown：

private def stopOnShutdown(): Unit = {

val stopGracefully = conf.getBoolean("spark.streaming.stopGracefullyOnShutdown", false)

logInfo(s"Invoking stop(stopGracefully=$stopGracefully) from shutdown hook")

// Do not stop SparkContext, let its own shutdown hook stop it

stop(stopSparkContext = false, stopGracefully = stopGracefully)

}

在StreamingContext.start中，会加一个hook来调用stopShutDown：

StreamingContext.start：

def start(): Unit = synchronized {

state match {

case INITIALIZED =>

startSite.set(DStream.getCreationSite())

StreamingContext.ACTIVATION_LOCK.synchronized {

StreamingContext.assertNoOtherContextIsActive()

try {

validate()

// Start the streaming scheduler in a new thread, so that thread local properties

// like call sites and job groups can be reset without affecting those of the

// current thread.

ThreadUtils.runInNewThread("streaming-start") {

sparkContext.setCallSite(startSite.get)

sparkContext.clearJobGroup()

sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")

scheduler.start()

}

state = StreamingContextState.ACTIVE

} catch {

case NonFatal(e) =>

logError("Error starting the context, marking it as stopped", e)

scheduler.stop(false)

state = StreamingContextState.STOPPED

throw e

}

StreamingContext.setActiveContext(this)

}

shutdownHookRef = ShutdownHookManager.addShutdownHook(

StreamingContext.SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)

// Registering Streaming Metrics at the start of the StreamingContext

assert(env.metricsSystem != null)

env.metricsSystem.registerSource(streamingSource)

uiTab.foreach(_.attach())

logInfo("StreamingContext started")

case ACTIVE =>

logWarning("StreamingContext has already been started")

case STOPPED =>

throw new IllegalStateException("StreamingContext has already been stopped")

}

在StreamingContext启动时，就用了钩子，定义了在shutdown时必须调用有stopGracefully参数的stop方法。

阅读全文

0 0