Spark RDD持久化策略

来源：互联网发布：js时间戳与php时间戳编辑：程序博客网时间：2024/06/05 03:31

RDD持久化

Spark一个重要的特性是将RDD持久化在内存中。当对RDD执行持久化操作时，每个节点都会将自己操作的RDD的partition持久化到内存中，并且在之后对该RDD的反复使用中，直接使用内存缓存的partition。这样的话，对于针对一个RDD反复执行多个操作的场景，就只要对RDD计算一次即可，后面直接使用该RDD，而不需要反复计算多次该RDD。
要持久化一个RDD，只要调用其cache()或者persist()方法即可。在该RDD第一次被计算出来时，就会直接缓存在每个节点中。而且Spark的持久化机制还是自动容错的，如果持久化的RDD的任何partition丢失了，那么Spark会自动通过其源RDD，使用transformation操作重新计算该partition。

测试

object Persist {  def main(args: Array[String]): Unit = {    val conf = new SparkConf().setAppName("Persist").setMaster("local")    val sc = new SparkContext(conf)    val lines = sc.textFile("C:\\Users\\qiang\\Desktop\\spark.txt").cache()    val begintime = System.currentTimeMillis()    val count = lines.count()    println("count=" + count)    val endtime = System.currentTimeMillis()    println("time= " + (endtime - begintime))    println("==========")    val begintime1= System.currentTimeMillis()    val count1 = lines.count()    println("count1=" + count1)    val endtime1 = System.currentTimeMillis()    println("time= " + (endtime1 - begintime1))  }}

不使用cache()

count=437474time= 735count1=437474time1= 384

使用cache():

count=437474time= 1360count1=437474time1= 29

调用cache()就是使用的MEMORY_ONLY策略。

0 0