Spark学习笔记（21）动态Batch Size深入

来源：互联网发布：linux编译比特币编辑：程序博客网时间：2024/06/01 07:14

本期内容：

1. 动态Batch Size深入

2. RateController解析

1. 动态Batch Size深入

Dynamic Batch Size的方法实际在Spark Streaming中还没实现。论文中的解决方案：Fixed-point Iteration。

论文中有个比较重要的图：

基本思想：按100ms的批次发数据给Controller，Controller起初直接转给JobGenerator，再给Job Processor处理。Job Generator不是仅给出处理结果，还要把job统计结果发给Controller，Controller接收到统计结果，会动态的改变batch size来给Job发数据。

至于窗口操作，也要做一些调整。如图：

试验表明，对Filter、Reduce、Join、Window还是有好的效果。

突然有其它作业加入时，也能动态调整。图例：

但算法是否会复杂，消耗时间。

2. RateController解析

Spark Streaming提供了RateController。ReceiverRateController、DirectKafkaRateController是其子类。

如果消费数据的速度的设置值有改变，会在batch中最后的Job完成时，会触发速率调整。

速率调整的主流程图：

流程较长，暂剖析最后的ReceriverSuperImpl.registerBlockGenerator和中间的ReceiverInputDStream.rateController的相关代码。

ReceiverSupervisorImpl：

private val endpoint = env.rpcEnv.setupEndpoint(

"Receiver-" + streamId + "-" + System.currentTimeMillis(), new ThreadSafeRpcEndpoint {

override val rpcEnv: RpcEnv = env.rpcEnv

override def receive: PartialFunction[Any, Unit] = {

case StopReceiver =>

logInfo("Received stop signal")

ReceiverSupervisorImpl.this.stop("Stopped by driver", None)

case CleanupOldBlocks(threshTime) =>

logDebug("Received delete old batch signal")

cleanupOldBlocks(threshTime)

case UpdateRateLimit(eps) =>

logInfo(s"Received a new rate limit: $eps.")

registeredBlockGenerators.foreach { bg =>

bg.updateRate(eps)

}

})

bg是Spark Streaming中的RateLimiter子类。RateLimiter中有个成员rateLimiter，类型是Google Guava的限流工具类RateLimiter。

Google Guava的RateLimiter从概念上来讲，速率限制器会在可配置的速率下分配许可证。如果必要的话，每个acquire()会阻塞当前线程直到许可证可用后获取该许可证。一旦获取到许可证，不需要再释放许可证。

代码通过RateLimiter来更改速率。RateLimiter.updateRate：

private[receiver] def updateRate(newRate: Long): Unit =

if (newRate > 0) {

if (maxRateLimit > 0) {

rateLimiter.setRate(newRate.min(maxRateLimit))

} else {

rateLimiter.setRate(newRate)

}

如果maxRateLimit也有值(即设置了spark.streaming.receiver.maxRate值），则取newRate和maxRateLimit中间的最小值。

spark.streaming.receiver.maxRate控制了最大的接收速率。但有浪费资源的可能。配置最大速率不是太好的事情。

回到流程图中间的ReceiverInputDStream.rateController。

ReceiverInputDStream.rateController：

override protected[streaming] val rateController: Option[RateController] = {

if (RateController.isBackPressureEnabled(ssc.conf)) {

Some(new ReceiverRateController(id, RateEstimator.create(ssc.conf, ssc.graph.batchDuration)))

} else {

None

}

其中的RateController.isBackPressureEnabled获得是否允许反压机制。

RateController.isBackPressureEnabled：

object RateController {

def isBackPressureEnabled(conf: SparkConf): Boolean =

conf.getBoolean("spark.streaming.backpressure.enabled", false)

}

如果允许反压机制，那么InputDStream子类中的成员rateController被赋予新生成的RateController子类ReceiverRateController对象。否则为None。

生成ReceiverRateController对象时会用调用RateEstimator.create。

RateEstimator.create：

/**

* Return a new RateEstimator based on the value of `spark.streaming.RateEstimator`.

* The only known estimator right now is `pid`.

* @return An instance of RateEstimator

* @throws IllegalArgumentException if there is a configured RateEstimator that doesn't match any

* known estimators.

def create(conf: SparkConf, batchInterval: Duration): RateEstimator =

conf.get("spark.streaming.backpressure.rateEstimator", "pid") match {

case "pid" =>

val proportional = conf.getDouble("spark.streaming.backpressure.pid.proportional", 1.0)

val integral = conf.getDouble("spark.streaming.backpressure.pid.integral", 0.2)

val derived = conf.getDouble("spark.streaming.backpressure.pid.derived", 0.0)

val minRate = conf.getDouble("spark.streaming.backpressure.pid.minRate", 100)

new PIDRateEstimator(batchInterval.milliseconds, proportional, integral, derived, minRate)

case estimator =>

throw new IllegalArgumentException(s"Unkown rate estimator: $estimator")

}

目前spark.streaming.backpressure.rateEstimator配置只能是pid。另外还有4个反压的可配置项。

RateEstimator用于评估InputDStream消费数据的能力。根据消费数据的能力来调整接收数据的速率。RateEstimator.create给出了反压（back pressure）机制。这要比简单限制接收速率要好一些。

接着看其中生成的ReceiverRateController。ReceiverRateController是RateController子类。

继承关系：ReceiverRateController => RateController => StreamingListener => AsynchronousListenerBus => ListenerBus

如果允许反压机制，ReceiverInputDStream的rateController就不为None，才保证了上面流程图中RateController就能处理接收的消息，从而最终调整速率。

简单介绍一下BlockGenerator中的waitToPush方法。

BlockGenerator是RateLimiter子类。BlockGenerator利用waitToPush方法来限制receiver消费数据的速率。

BlockGenarator在生成Block时，BlockGenarator的加数据的方法addData、addDataWithCallback、addMultipleDataWithCallback中都调用了waitToPush。

有必要以后对waitToPush再做剖析。

注：Google Guava的限流工具类RateLimiter

　　RateLimiter从概念上来讲，速率限制器会在可配置的速率下分配许可证。如果必要的话，每个acquire() 会阻塞当前线程直到许可证可用后获取该许可证。一旦获取到许可证，不需要再释放许可证。

　　RateLimiter使用的是一种叫令牌桶的流控算法，RateLimiter会按照一定的频率往桶里扔令牌，线程拿到令牌才能执行，比如你希望自己的应用程序QPS不要超过1000，那么RateLimiter设置1000的速率后，就会每秒往桶里扔1000个令牌。

修饰符和类型方法和描述doubleacquire()
从RateLimiter获取一个许可，该方法会被阻塞直到获取到请求doubleacquire(int permits)
从RateLimiter获取指定许可数，该方法会被阻塞直到获取到请求static RateLimitercreate(double permitsPerSecond)
根据指定的稳定吞吐率创建RateLimiter，这里的吞吐率是指每秒多少许可数（通常是指QPS，每秒多少查询）static RateLimitercreate(double permitsPerSecond, long warmupPeriod, TimeUnit unit)
根据指定的稳定吞吐率和预热期来创建RateLimiter，这里的吞吐率是指每秒多少许可数（通常是指QPS，每秒多少个请求量），在这段预热时间内，RateLimiter每秒分配的许可数会平稳地增长直到预热期结束时达到其最大速率。（只要存在足够请求数来使其饱和）doublegetRate()
返回RateLimiter 配置中的稳定速率，该速率单位是每秒多少许可数voidsetRate(double permitsPerSecond)
更新RateLimite的稳定速率，参数permitsPerSecond 由构造RateLimiter的工厂方法提供。StringtoString()
返回对象的字符表现形式booleantryAcquire()
从RateLimiter 获取许可，如果该许可可以在无延迟下的情况下立即获取得到的话booleantryAcquire(int permits)
从RateLimiter 获取许可数，如果该许可数可以在无延迟下的情况下立即获取得到的话booleantryAcquire(int permits, long timeout, TimeUnit unit)
从RateLimiter 获取指定许可数如果该许可数可以在不超过timeout的时间内获取得到的话，或者如果无法在timeout 过期之前获取得到许可数的话，那么立即返回false （无需等待）booleantryAcquire(long timeout, TimeUnit unit)
从RateLimiter 获取许可如果该许可可以在不超过timeout的时间内获取得到的话，或者如果无法在timeout 过期之前获取得到许可的话，那么立即返回false（无需等待）

阅读全文

0 0