Spark Streaming Job Troubleshooting of Dependency Chain
来源:互联网 发布:wifi计费认证软件 编辑:程序博客网 时间:2024/05/16 02:12
Background
Spark Streaming has many commons compared with Spark, it abstracts DStream which based on RDD, its transformations and outputs are similar to Spark. But due to its periodically running property, some problems which may not be serious in Spark will become a big deal in Spark Streaming. This article introduce a dependency chain problem which will delay streaming job gradually and make the job crash.
Why Streaming job runs gradually slower
Problem statement
Spark Streaming job’s running time gradually slows while input data size is almost the same. You can see the job’s running time chart as below.
my job is like this:
newGenRdd = input.filer(...).map(...).join(...).countByValue()
oldRdd.zipPartition(..., newGenRdd)
Here newGenRdd
will be calculated in each batch duration, oldRdd
is cached in Shark’s memoryMetadataManager
. Then oldRdd
will be zipped with newGenRdd
to get a zipped RDD, this zipped RDD will be next round’s oldRdd
. So in each batch duration, job runs as above code shows.
In my zipPartition
, I will filter out some records which are older than a specific time, to make sure that the total record numbers in the RDD will be stable.
So in a common sense, while the input data size is almost the same, the job’s running time of each batch duration should be stable as expected. But as above chart shows, running time gradually increase as time passes by.
Phenomenon
ClientDriver host’s network output grows gradually, and StandaloneExecutorBackend hosts’ network input grows gradually. You can see the below network graph of the whole cluster, in which sr412 is ClientDriver and others are StandaloneExecutorBackend
This graph shows that only ClientDriver’s network output and StandaloneExecutorBackends’ network input increases, which indicates that in each batch duration ClientDriver transmit data to all slaves, there might exists several possibilities:
- data structure created in ClientDriver transmit to slaves for closure to use.
- some dependent static data structures transmit to slaves when static functions are called on slaves’ closure.
- some control diagram transmit to slaves in Akka.
Also the growth of network traffic should be noticed, some data structures that transmited to slaves might be cumulative.
Serialized task size grows gradually. According to network traffic phenomenon, furtherly I dig out all the serialized task size in each job’s
zipPartition
stage, as the blow chart shows, task size gradually grows while the input data size of each batch duration is almost the same.Also this
zipPartition
stage running time is increased, as blow chart shows:After carefully examing my implementation in
zipPartition
, in which data structure is invariant in each batch duration job, so I think it might be the Spark framework introduced problem.I dig out all the dependencies of the
oldRdd
andnewGenRdd
recursively, I found that as job runs periodically, dependencies ofoldRdd
increase rapidly, while dependencies of thenewGenRdd
maintains the same, as below chart shows.
Reason
According to the above phenomena, it is obviously that the growth of dependency chain makes job being gradually slower, by investigating Spark’s code, in each batch duration, oldRdd
will add newGenRdd
’s dependency chain to itself, after several rounds, oldRdd
’s dependency chain becomes huge, serialization and deserialization which previously is trivial now becomes a time-consuming work. Taking below code as a example:
var rdd = ...
for (i <- 0 to 100)
rdd = rdd.map(x => x)
rdd = rdd.cache
Here as you iteratively use oldRdd
to do transformation, each iteration’s dependency chain will be added to recently used rdd, lastly thisrdd
will have a long dependency chain including all the iterative’s dependency. Also thls transformation will run gradually slower.
So as included, the growth of dependencies makes serialization and deserialization of each task be a main burden when job runs. Also this reason can explain why task deserialization will meet stack overflow exception even job is not so complicated.
I also tested without oldRdd
combined, each time newGenRdd
will be put in Shark’s memoryMetadataManager but without zip witholdRdd
, now the job running time becomes stable.
So I think for all the iterative job which will use previously calculated RDD will meet this problem. This problem will sometimes be hidden as GC problem or shuffle problem. For small iteratives this is not a big deal, but if you want to do some machine learning works that will iterate jobs for many times, this should be a problem.
This issue is also stated in Spark User Group:
Is there some way to break down the RDD dependency chain?
Spark Memory Question
One way to break down this RDD dependency chain is to write RDD to file and read it back to memory, this will clean all the dependencies of this RDD. Maybe a way to clean dependencies might also be a solution, but it is hard to implment.
- Spark Streaming Job Troubleshooting of Dependency Chain
- 6 Spark Streaming Job思考
- Spark Streaming 2.0 runDummySpark Job
- spark streaming job 耗时监控
- spark-streaming系列------- 1. spark-streaming的Job调度 上
- spark-streaming系列------- 2. spark-streaming的Job调度 下
- Spark Streaming job 远程debug方法
- Spark Streaming源码解读之Job详解
- Spark TroubleShooting
- spark-streaming系列------- 4. Spark-Streaming Job的生成和执行
- 3.spark streaming Job 架构和容错解析
- 事件生成JOB调优笔记(spark streaming)
- [spark streaming] 动态生成 Job 并提交执行
- Spark-Dependency
- Spark 定制版:006~Spark Streaming源码解读之Job动态生成和深度思考
- Spark Streaming生成RDD并执行Spark Job源码内幕解密
- Spark Streaming生成RDD并执行Spark Job源码内幕解密
- spark-troubleshooting-OOM
- AngularJS入门教程09:过滤器
- 传统的MapReduce框架慢在那里
- AngularJS入门教程10:事件处理器
- Sample2.1:myfirst.cpp
- 面试奇怪question
- Spark Streaming Job Troubleshooting of Dependency Chain
- delphi获取系统当前时间
- Spark Streaming Introduction
- 线程知识
- AngularJS入门教程11:REST和定制服务
- 菜单相关操作
- 【库函数】getcwd和getenv函数的用法
- poj1094 Sorting It All Out
- AngularJS入门教程:完结篇