How to spread receivers over worker hosts in Spark streaming - draft

来源：互联网发布：淘宝客服建议编辑：程序博客网时间：2024/06/07 14:04

Spark Streaming中遇到的问题:
目标启动多个receiver以提高读取数据的并行度, 但是提示:
INFO ReceiverSupervisorImpl: Stopping receiver with message: Registered unsuccessfully because Driver refused to start receiver 2:
经过排查,是由org.apache.spark.streaming.receiver.ReceiverSupervisor中的startReceiver方法发出的报错,其中调用了onReceiverStart方法,该方法在org.apache.spark.streaming.receiver.ReceiverSupervisorImpl中得以实现.代码如下:

override protected def onReceiverStart(): Boolean = {    val msg = RegisterReceiver(      streamId, receiver.getClass.getSimpleName, host, executorId, endpoint)    trackerEndpoint.askWithRetry[Boolean](msg)  }

目测是注册receiver的时候发生了异常, 没有搜到很好的解决办法, 找到的相关文章如下:
转载自 : How to spread receivers over worker hosts in Spark streaming

How to spread receivers over worker hosts in Spark streaming

In Spark Streaming, you can spawn multiple receivers to increase parallelism, e.g., such that each receiver reads from one of the partitions in Kafka. Then you combine the resulting streams and process them by batches. The code is sketched as follows:

在Spark Streaming中,你可以孵化出多个receiver以提高并行度,每个receiver读取Kafka中的单个partition. 然后你可以将这些receiver读取到的streams进行合并并进行批处理.相关代码如下所示:

val ssc =  new StreamingContext(sc, Seconds(batchInterval))val dstream = if (numReceivers == 1)       ssc.receiverStream(new MyKafkaReceiver(storageLevel))     else {       val streams = (0 until numReceivers).map(receiverNo => {         ssc.receiverStream(new MyKafkaReceiver(storageLevel))       })       ssc.union(streams)     }dstream.foreachRDD(…) // your business logic

In this code snippet, the receivers will be “randomly” scheduled on a set of worker hosts. Sounds perfect, right? Ideally, the receivers should be spread over the worker hosts as even as possible to read from the data sources via separate network interfaces. However, this default random scheduling often cannot guarantee an even distribution: First, it depends on which executors are known to the scheduler at the moment. It takes time for executors to be created and registered, which is not simultaneous especially in a large distributed system. Secondly, the scheduling is to map receivers to the set of executors. Note that a worker host may run multiple executors. Consequently, sometimes you will observe that some hosts run several receivers while some others are not running any receiver. This inbalance also has performance implications on the subsequent stream processing due to data locality, yielding poor resource utilization. The default scheduler favors CPU/memory over networking resources.

在这小段代码中, receivers将被”随机”分配到一系列的worker中. 听起来很完美是吧? 理想状态下,这些receivers将被均匀的分配到整个集群中,并且通过不同的网络接口来均匀的读取数据源. 但是, 默认的随机分配制度经常无法保证均匀的分散: 首先, 该机制建立在分配器知晓executor的存在的情况下. executor的建立和注册需要时间去处理, 所以在一个大型的分布式系统中并不能保证该步骤的实时性; 其次,分配的本质是将receiver分配给executor. 值得注意的是,一个worker主机可能运行着多个executor. 因此,有时你会观察到,一些主机运行着多个receiver而另一些则一个都没有运行. 这种不均衡同样会因为数据本地性而导致在顺序流处理中的性能问题,进而导致资源的利用低效化. 默认的分配机制更倾向于CPU和内存,而不是网络资源.

To address these two limitations, we need to first figure out the entire set of worker hosts, namely, the machines that run executors in which your data receiving and processing will happen. Unfortunately, there is no convenient spark API that directly gives us this information. We have to solve this problem indirectly.

为了解决这些限制,我们需要首先找出整个worker主机的集群,也就是说,你接收和处理数据的地方. 不行的是, 没有相关的spark API能够直接告诉我们这些信息. 我们需要变相的解决这个问题.

One approach is to make up a fake workload and attempt to schedule it at a very aggressive level of parallelism. In the workload, we do nothing but register the local hostname to an accumulator. In the end, the driver knows the set of hosts from the accumulator and then provides hints to the scheduler. Note that we must repeat this a few times until convergence or a number of trials so as to get all hosts, according to discussions above. The code is as follows.

一个方法是构建一个虚拟的workload,并且尝试以一种较为侵略的并行方式去分配它. 在这个workload中, 我们除了向accumulator注册本地主机名外什么都不做. 结束的时候, driver将从accumulator知晓hosts的信息并且提供给scheduler. 需要注意的是,我们必须重复该步骤数次直到收敛,或者像上面说的,可以获取所有的hosts. 代码如下:

implicit object HostsAccParam extends AccumulatorParam[Set[String]] {   def addInPlace(t1: Set[String], t2: Set[String]): Set[String] = t1 ++ t2   def zero(initialValue: Set[String]) = Set[String]()}def getActiveWorkerHostSet(sc: SparkContext): Set[String] = {    val hostsAccumulator = sc.accumulator[Set[String]](Set[String]())    var foundNewHosts: Boolean = true    var trials: Int = 0    while (foundNewHosts && trials < 5 ) {      trials += 1      val oldSet = hostsAccumulator.value      val dataSet = (1 to estimatedNumberWorkers * 10000 * trials)      val numTasks = estimatedNumberWorkers * 100 * trials      sc.parallelize(dataSet, numTasks).foreach(_ => {         val hostname = InetAddress.getLocalHost.getHostName         hostsAccumulator += Set(hostname)      })      val newSet = hostsAccumulator.value      foundNewHosts = (newSet.size > oldSet.size)   }   hostsAccumulator.value}

The underlying assumption in the above approach is that the fake workload is aggressive “enough” with regard to the available resources. However, it breaks in one of my experimental environment in which the driver happens to run on a worker host with sufficient amount of resources. As a result, all the tasks are executed on that host alone and hence all receivers are scheduled to that one very host. In practice, it would be difficult to easily create a fake workload that is guaranteed to spread over all the worker hosts without the concern of wasting compute resources. That said, this approach may end up being worse than the default scheduling in some environments.

It turns out that a better approach is to leverage some API in SparkContext, the method getExecutorMemoryStatus(), which returns the block manager addresses that have been registered to the driver’s block manager master. When a SparkContext is created in system initialization, it creates a driver with a block manager master, to which block managers created in the driver and every executor will be registered. It covers all the hosts we need. The only concern is that the host addresses returned include not only those of the executors but also that of the driver. Fortunately, we can figure out the driver address and exclude it. The code is as follows.

def getActiveWorkerHosts(sc: SparkContext): Set[String] = {   val driverHost: String = sc.getConf.get(“spark.driver.host”)   var workerSet = Set[String]()   var foundNewHosts: Boolean = true   val beginTimeMillis = System.currentTimeMillis   var timeout = false   while (foundNewHosts && !timeout) {     Thread.sleep(3000)     val oldSet = workerSet     val allHosts = sc.getExecutorMemoryStatus.map(_._1.split(“:”)(0)).toList     workerSet = allHosts.diff(List(driverHost)).toSet     foundNewHosts = workerSet.diff(oldSet).nonEmpty     if (System.currentTimeMillis - beginTimeMillis >= 30000)        timeout = true   }   workerSet }

Now we can distribute the receivers evenly to the set of worker hosts, via overriding a method preferredLocation that is provided in class Receiver (which returns None by default). The following code shuffles the set and then suggests to the scheduler that receivers should be run on a set of worker hosts in a round-robin manner:

val workerSet = getActiveWorkerHosts(sc)val candidates = scala.util.Random.shuffle(workerSet.toSeq).toArray(0 until numReceivers).map(receiverNo => {    val host =  candidates(receiverNo % candidates.length)    ssc.receiverStream(new MyKafkaReceiver(storageLevel) {        override def preferredLocation: Option[String] = Some(host)    })})

Furthermore, we can be a bit more considerate when the master hosts also run workers/executors in some economical configurations. Masters themselves communicate a lot, so we avoid adding communication-bound receivers to those hosts if possible. In the following code, we give a higher priority to hosts that are only workers and a lower priority to those that also run master daemons. When the candidates are assigned in a round-robin manner, hosts in the rear array will be assigned less frequently than those in the front.

val masterSet: Set[String] = sc.master.split(“spark://”)(1).split(“,”).map(_.split(“:”)(0)).toSetval nonMasterWorkSet = workerSet - - masterSetval bothMasterWorkerSet = workerSet & masterSetval prioritizedSeq = scala.util.Random.shuffle(nonMasterWorkSet.toSeq) ++ bothMasterWorkerSet.toSeqcandidates = prioritizedSeq.toArray

0 0