Spark-Dependency
来源:互联网 发布:earpods知乎 编辑:程序博客网 时间:2024/05/21 23:13
1、Spark中采用依赖关系(Dependency)表示rdd之间的生成关系。Spark可利用Dependency计算出失效的RDD。在每个RDD中都存在一个依赖关系的列表
private var dependencies_ : Seq[Dependency[_]] = null
用以记录各rdd中各partition的parent partition。
2、Spark中存在两类Dependency:
1)NarrowDependency表示的是一个父partition仅对应于一个子partition。这样的依赖关系是不需要shuffle的。在这类依赖中,可以根据getParents方法获取某个partition的父partitions:
/** * :: DeveloperApi :: * Base class for dependencies where each partition of the parent RDD is used by at most one * partition of the child RDD. Narrow dependencies allow for pipelined execution. */@DeveloperApiabstract class NarrowDependency[T](rdd: RDD[T]) extends Dependency(rdd) { /** * 唯一的接口,获得该partition的所有parent partition * Get the parent partitions for a child partition. * @param partitionId a partition of the child RDD * @return the partitions of the parent RDD that the child partition depends upon */ def getParents(partitionId: Int): Seq[Int]}
这类又可分为:
a、OneToOneDependency:表示一一对应的依赖关系,由于在这种依赖中父partition与子partition Id是一致的,所以getParents直接原样返回。对应的转换操作有map和filter
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) { /** * 其实partitionId就是partition在RDD中的序号, 所以如果是一一对应, 那么parent和child中的partition的序号应该是一样的 */ override def getParents(partitionId: Int) = List(partitionId)//原样返回}
b、PruneDependency(org.apache.spark.rdd.PartitionPruningRDDPartition):未详
/** * Represents a dependency between the PartitionPruningRDD and its parent. In this * case, the child RDD contains a subset of partitions of the parents'. */private[spark] class PruneDependency[T](rdd: RDD[T], @transient partitionFilterFunc: Int => Boolean) extends NarrowDependency[T](rdd) { @transient val partitions: Array[Partition] = rdd.partitions .filter(s => partitionFilterFunc(s.index)).zipWithIndex .map { case(split, idx) => new PartitionPruningRDDPartition(idx, split) : Partition } override def getParents(partitionId: Int) = { List(partitions(partitionId).asInstanceOf[PartitionPruningRDDPartition].parentSplit.index) }}
c、RangeDependency:这种是父rdd的连续多个partitions对应子rdd中的连续多个partitions,对应的转换有union
/**Union * :: DeveloperApi :: * Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs. * @param rdd the parent RDD * @param inStart the start of the range in the parent RDD parent RDD中区间的起始点 * @param outStart the start of the range in the child RDD child RDD中区间的起始点 * @param length the length of the range */@DeveloperApiclass RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int) extends NarrowDependency[T](rdd) { override def getParents(partitionId: Int) = { if (partitionId >= outStart && partitionId < outStart + length) {//判断partitionId的合理性,必须在child RDD的合理partition范围 List(partitionId - outStart + inStart)//算出parent RDD中对应的partition id } else { Nil } }}
2)WideDependency:这种依赖是指一个父partition可以对应子rdd中多个partitions。由于需要对父partition进行划分,故需要用到shuffle,而shuffle一般是采用键值对的。
这里为每个shuffle分配了一个全局唯一的shuffleId。为了进行shuffle,需要指定如何进行shuffle,这对应于参数partitioner;由于shuffle是需要网络传输的,故需要进行序列化Serializer。在宽依赖中并无法获得partition对应的parent partitions?
/** * :: DeveloperApi :: * Represents a dependency on the output of a shuffle stage. * @param rdd the parent RDD * @param partitioner partitioner used to partition the shuffle output * @param serializer [[org.apache.spark.serializer.Serializer Serializer]] to use. If set to null, * the default serializer, as specified by `spark.serializer` config option, will * be used. */@DeveloperApiclass ShuffleDependency[K, V]( @transient rdd: RDD[_ <: Product2[K, V]], val partitioner: Partitioner,//需要给出partitioner, 指示如何完成shuffle val serializer: Serializer = null)//shuffle不象map可以在local进行, 往往需要网络传输或存储, 所以需要serializerClass extends Dependency(rdd.asInstanceOf[RDD[Product2[K, V]]]) { val shuffleId: Int = rdd.context.newShuffleId()//每个shuffle需要分配一个全局的id, context.newShuffleId()的实现就是把全局id累加 rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))}
2 0
- Spark-Dependency
- Spark-Dependency/Aggregator
- Spark RDD之Dependency
- spark core 2.0 Dependency
- Spark分析之Dependency
- Spark Streaming Job Troubleshooting of Dependency Chain
- RDD Dependency详解---Spark学习笔记9
- Spark源码解读之RDD依赖Dependency
- Dependency
- Spark:宽依赖(shuffle dependency)和窄依赖(narrow dependency)
- Spark视频之Scala中Dependency Injection实战详解
- 大数据:Spark Core (一) 什么是RDD的Transformation和Actions以及Dependency?
- Dependency walker
- Dependency Property
- Dependency Walker
- Maven Dependency
- Dependency Injection
- Dependency Injection
- 黑马程序员----什么是java,面型对象的思想以及如何为编写java程序配置相对应的环境
- webservice的网站
- NYOJ-13-Fibonacci数
- QCAR 基于vuforia做扩展 Unity - Android Plugins
- 位置权限设置
- Spark-Dependency
- CC2530 Note 1:( ProcessEvent, MSGpkt, MessageMSGCB)
- 《老罗Android第二季》ViewPager分页 服务器端Web开发
- HDU 2084 数塔
- HDUJ 2074 叠筐 模拟
- Spring+Ehcache
- C++:STL标准入门汇总
- PS常用快捷键
- 实现一个微型数据库