Spark学习4: RDD详解
来源:互联网 发布:网络用语浅草 编辑:程序博客网 时间:2024/05/22 06:32
1RDD经典定义
package org.apache.spark.rddimport java.util.Randomimport scala.collection.{mutable, Map}import scala.collection.mutable.ArrayBufferimport scala.reflect.{classTag, ClassTag}import com.clearspring.analytics.stream.cardinality.HyperLogLogPlusimport org.apache.hadoop.io.BytesWritableimport org.apache.hadoop.io.compress.CompressionCodecimport org.apache.hadoop.io.NullWritableimport org.apache.hadoop.io.Textimport org.apache.hadoop.mapred.TextOutputFormatimport org.apache.spark._import org.apache.spark.Partitioner._import org.apache.spark.SparkContext._import org.apache.spark.annotation.{DeveloperApi, Experimental}import org.apache.spark.api.java.JavaRDDimport org.apache.spark.broadcast.Broadcastimport org.apache.spark.partial.BoundedDoubleimport org.apache.spark.partial.CountEvaluatorimport org.apache.spark.partial.GroupedCountEvaluatorimport org.apache.spark.partial.PartialResultimport org.apache.spark.storage.StorageLevelimport org.apache.spark.util.{BoundedPriorityQueue, Utils}import org.apache.spark.util.collection.OpenHashMapimport org.apache.spark.util.random.{BernoulliSampler, PoissonSampler, SamplingUtils}/** * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, * partitioned collection of elements that can be operated on in parallel. This class contains the * basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition, * [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value * pairs, such as `groupByKey` and `join`; * [[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of * Doubles; and * [[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that * can be saved as SequenceFiles. * These operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)] * through implicit conversions when you `import org.apache.spark.SparkContext._`. * * Internally, each RDD is characterized by five main properties: * * - A list of partitions * - A function for computing each split * - A list of dependencies on other RDDs * - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) * - Optionally, a list of preferred locations to compute each split on (e.g. block locations for * an HDFS file) * * All of the scheduling and execution in Spark is done based on these methods, allowing each RDD * to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for * reading data from a new storage system) by overriding these functions. Please refer to the * [[http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Spark paper]] for more details * on RDD internals. */abstract class RDD[T: ClassTag]( @transient private var sc: SparkContext, @transient private var deps: Seq[Dependency[_]] ) extends Serializable with Logging { /** Construct an RDD with just a one-to-one dependency on one parent */ def this(@transient oneParent: RDD[_]) = this(oneParent.context , List(new OneToOneDependency(oneParent))) private[spark] def conf = sc.conf // ======================================================================= // Methods that should be implemented by subclasses of RDD // ======================================================================= /** * :: DeveloperApi :: * Implemented by subclasses to compute a given partition. */ @DeveloperApi def compute(split: Partition, context: TaskContext): Iterator[T] /** * Implemented by subclasses to return the set of partitions in this RDD. This method will only * be called once, so it is safe to implement a time-consuming computation in it. */ protected def getPartitions: Array[Partition] /** * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only * be called once, so it is safe to implement a time-consuming computation in it. */ protected def getDependencies: Seq[Dependency[_]] = deps /** * Optionally overridden by subclasses to specify placement preferences. */ protected def getPreferredLocations(split: Partition): Seq[String] = Nil /** Optionally overridden by subclasses to specify how they are partitioned. */ @transient val partitioner: Option[Partitioner] = None
从上面的定义中抽出最重要的一部分RDD的描述
* Internally, each RDD is characterized by five main properties:
*
* - A list of partitions
* - A function for computing each split
* - A list of dependencies on other RDDs
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
* an HDFS file)
然后我们再分别去看每部分的定义
1.1 返回一个partition的数组
/** * Implemented by subclasses to return the set of partitions in this RDD. This method will only * be called once, so it is safe to implement a time-consuming computation in it. */ protected def getPartitions: Array[Partition]
1.2
/** * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only * be called once, so it is safe to implement a time-consuming computation in it. */ protected def getDependencies: Seq[Dependency[_]] = deps
/** * :: DeveloperApi :: * Implemented by subclasses to compute a given partition. */ @DeveloperApi def compute(split: Partition, context: TaskContext): Iterator[T]
我们注意到compute的一个参数就是TaskContext,下面我们看看TaskContext的内容。
package org.apache.sparkimport scala.collection.mutable.ArrayBufferimport org.apache.spark.annotation.DeveloperApiimport org.apache.spark.executor.TaskMetricsimport org.apache.spark.util.TaskCompletionListener/** * :: DeveloperApi :: * Contextual information about a task which can be read or mutated during execution. * * @param stageId stage id * @param partitionId index of the partition * @param attemptId the number of attempts to execute this task * @param runningLocally whether the task is running locally in the driver JVM * @param taskMetrics performance metrics of the task */@DeveloperApiclass TaskContext( val stageId: Int, val partitionId: Int, val attemptId: Long, val runningLocally: Boolean = false, private[spark] val taskMetrics: TaskMetrics = TaskMetrics.empty) extends Serializable { @deprecated("use partitionId", "0.8.1") def splitId = partitionId // List of callback functions to execute when the task completes. @transient private val onCompleteCallbacks = new ArrayBuffer[TaskCompletionListener] // Whether the corresponding task has been killed. @volatile private var interrupted: Boolean = false // Whether the task has completed. @volatile private var completed: Boolean = false /** Checks whether the task has completed. */ def isCompleted: Boolean = completed /** Checks whether the task has been killed. */ def isInterrupted: Boolean = interrupted // TODO: Also track whether the task has completed successfully or with exception. /** * Add a (Java friendly) listener to be executed on task completion. * This will be called in all situation - success, failure, or cancellation. * * An example use is for HadoopRDD to register a callback to close the input stream. */ def addTaskCompletionListener(listener: TaskCompletionListener): this.type = { onCompleteCallbacks += listener this } /** * Add a listener in the form of a Scala closure to be executed on task completion. * This will be called in all situation - success, failure, or cancellation. * * An example use is for HadoopRDD to register a callback to close the input stream. */ def addTaskCompletionListener(f: TaskContext => Unit): this.type = { onCompleteCallbacks += new TaskCompletionListener { override def onTaskCompletion(context: TaskContext): Unit = f(context) } this } /** * Add a callback function to be executed on task completion. An example use * is for HadoopRDD to register a callback to close the input stream. * Will be called in any situation - success, failure, or cancellation. * @param f Callback function. */ @deprecated("use addTaskCompletionListener", "1.1.0") def addOnCompleteCallback(f: () => Unit) { onCompleteCallbacks += new TaskCompletionListener { override def onTaskCompletion(context: TaskContext): Unit = f() } } /** Marks the task as completed and triggers the listeners. */ private[spark] def markTaskCompleted(): Unit = { completed = true // Process complete callbacks in the reverse order of registration onCompleteCallbacks.reverse.foreach { _.onTaskCompletion(this) } } /** Marks the task for interruption, i.e. cancellation. */ private[spark] def markInterrupted(): Unit = { interrupted = true }}
1.4 这里getPreferredLocation是针对某个partition的,然后返回结果是一个序列,这里之所以有一个序列的返回结果是因为:如果是hadoop的HDFS情况下的partition的话,每个partition会有3个备份,那么这个3个地点都可以作为preferredlocation,如果是计算结果缓存在内存中,那么如果缓存了2份,那就会有2个preferredlocation作为返回结果。
/** * Optionally overridden by subclasses to specify placement preferences. */ protected def getPreferredLocations(split: Partition): Seq[String] = Nil
/** Optionally overridden by subclasses to specify how they are partitioned. */ @transient val partitioner: Option[Partitioner] = None下面看看Partitioner的具体情况。可见Partitioner针对key-value结构进行partition,把整个结构划分成指定的numPartition个数
package org.apache.sparkimport java.io.{IOException, ObjectInputStream, ObjectOutputStream}import scala.collection.mutableimport scala.collection.mutable.ArrayBufferimport scala.reflect.{ClassTag, classTag}import scala.util.hashing.byteswap32import org.apache.spark.rdd.{PartitionPruningRDD, RDD}import org.apache.spark.serializer.JavaSerializerimport org.apache.spark.util.{CollectionsUtils, Utils}import org.apache.spark.util.random.{XORShiftRandom, SamplingUtils}/** * An object that defines how the elements in a key-value pair RDD are partitioned by key. * Maps each key to a partition ID, from 0 to `numPartitions - 1`. */abstract class Partitioner extends Serializable { def numPartitions: Int def getPartition(key: Any): Int}
/** * A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using * Java's `Object.hashCode`. * * Java arrays have hashCodes that are based on the arrays' identities rather than their contents, * so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will * produce an unexpected or incorrect result. */class HashPartitioner(partitions: Int) extends Partitioner {另外一个系统提供的Partitioner是下面的
RangePartitioner,比如进行wordcount的时候最后sortByKey方法就会调用这个partitioner
/** * A [[org.apache.spark.Partitioner]] that partitions sortable records by range into roughly * equal ranges. The ranges are determined by sampling the content of the RDD passed in. * * Note that the actual number of partitions created by the RangePartitioner might not be the same * as the `partitions` parameter, in the case where the number of sampled records is less than * the value of `partitions`. */class RangePartitioner[K : Ordering : ClassTag, V]( @transient partitions: Int, @transient rdd: RDD[_ <: Product2[K,V]], private var ascending: Boolean = true) extends Partitioner {
2 常见RDD
2.1 HadoopRDD
每个block可能会存储3份。sc.textFile("HDFS://xxxxxx/ssss")可以得到HadoopRDD
分区: 每个HDFS block,对hadoop的inputSplit进行了包装,成为了hadoopPartition,个数一致
依赖:无,因为这里是读取数据源,没有依赖
函数:这里的compute本质上是定义了如何读取HDFS的数据,使用了Hadoop的API
最佳位置:HDFS block所在位置
分区策略:无
2.2 FilteredRDD, 这是一个典型的窄依赖关系,它所具有的特点也是窄依赖关系中的普遍特性
• 分区: 与父RDD一致
• 依赖: 与父RDD一对一
• 函数: 计算父RDD的每个分区并过滤
• 最佳位置: 无(与父RDD一致)
• 分区策略: 无
package org.apache.spark.rddimport scala.reflect.ClassTagimport org.apache.spark.{Partition, TaskContext}private[spark] class FilteredRDD[T: ClassTag]( prev: RDD[T], f: T => Boolean) extends RDD[T](prev) { override def getPartitions: Array[Partition] = firstParent[T].partitions override val partitioner = prev.partitioner // Since filter cannot change a partition's keys override def compute(split: Partition, context: TaskContext) = firstParent[T].iterator(split, context).filter(f)}
FilteredRDD的代码相当的简洁。
只是这里觉得有点奇怪,prev应该是和firstParent是同一个RDD的吧,暂时不明白这里有什么区别没有
2.3 JoinedRDD
- Spark学习4: RDD详解
- spark学习三 RDD详解
- Spark学习笔记 --- RDD详解
- RDD Dependency详解---Spark学习笔记9
- Spark RDD Transformation 详解---Spark学习笔记7
- Spark RDD Action 详解---Spark学习笔记8
- spark RDD 详解
- spark RDD API详解
- Spark RDD详解
- Spark RDD Transformation 详解
- spark RDD详解
- Spark RDD详解
- spark rdd 详解
- Spark RDD详解
- Spark RDD详解
- Spark RDD详解
- Spark RDD详解
- Spark RDD详解
- PAT Advanced Level 1002. A+B for Polynomials (25)(Java)
- Yale开放课程博弈论14
- 5.7多态:
- 每天一个linux命令:kill命令
- 在Windows 2008 R2高端机器上运行SQL Server 2008时,CPU个数的考量
- Spark学习4: RDD详解
- Javascript map 如何实现
- 1-4输出菱形
- 我的一些LINUX笔记(虚拟机+Ubuntu14.04)
- ImageButton图片背景的切换
- 一个用于批量下载网络图片的Shell脚本
- Ubuntu 12.04 搭建bind9域名服务器实验
- Nginx的进程模型
- 微软亚太区数据库技术支持组 官方博客