Spark执行原理解惑

来源：互联网发布：openwrt mac地址编辑：程序博客网时间：2024/05/18 13:48

Spark运行原理

写好的程序（jar包）又叫驱动程序，提交后，会根据里面的参数生成spark上下文，然后根据里面rdd的处理方式，产生一个有向无环图，将这个图广播到集群的每个客户端，客户端进行rdd的相应处理，如果是转换操作的则只记录操作步骤并不实际转换，如果是行为操作如count函数，则会实际操作这个行动并且将行为操作的返回值（如long list等）返回给驱动程序。

驱动程序的概念理解

https://stackoverflow.com/questions/24637312/spark-driver-in-apache-spark#

The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master.
驱动程序声明对数据的RDD的转换和行为操作的程序，并将这些请求提交给master。

To explain a bit more on the different roles:
• The driver prepares the context and declares the operations on the data using RDD transformations and actions.
驱动程序生成上下文并使用RDD转换和操作声明对数据的操作。

• The driver submits the serialized RDD graph to the master. The master creates tasks out of it and submits them to the workers for execution. It coordinates the different job stages.
驱动程序将序列化的RDD有向无环执行图提交给master。master从中创建任务，并将其提交给worker执行。master协调不同的工作阶段。

• The workers is where the tasks are actually executed. They should have the resources and network connectivity required to execute the operations requested on the RDDs.
worker是执行任务的地方。他们应该具有执行RDD请求的操作所需的资源和网络连接。

另可参考，介绍了运行构架：https://spark.apache.org/docs/latest/cluster-overview.html

关于变量

普通变量在rdd的闭包中被操作并修改是不会返回驱动程序的，这个变量只会在每个集群的worker程序中分配一个初值，并不会最后被返回。只有spark提供的累加器能做到这一点。
参考：https://spark.apache.org/docs/latest/rdd-programming-guide.html#shared-variables

Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

通常，当传递给Spark（如map或reduce）的操作函数在远程集群节点上执行时，它可以在函数中使用的所有变量的单独副本上工作。这些变量被复制到每个机器，并且远程机器上的变量的更新不会传播回驱动程序。在任务之间支持一般的，读写共享变量将低效的。所以，Spark 为两种常用的使用模式提供了两种有限类型的共享变量：广播变量和累加器。

一个例子：

JavaRDD textRdd = sc.textFile("hdfs://wy:9000/testpath/test.txt");JavaRDD words = textRdd.flatMap(new FlatMapFunction<String,String>() {    public Iterator<String> call(String o) throws Exception {        return Arrays.asList(o.split(" ")).iterator();    }});final int[] num = {0};final AccumulatorV2 acc=sc.sc().longAccumulator();Broadcast<int[]> broadcastVar = sc.broadcast(new int[] {1, 2, 3});broadcastVar.value();words.foreach(new VoidFunction<String>() {    public void call(String o) throws Exception {        num[0] ++;        acc.add(1l);    }});System.out.println(num[0]);//0System.out.println(acc.value());//1977System.out.println(words.count());//1977

打印0 1977 1977，累加器可以共享读写，行为操作可以将返回值返还到驱动程序，所以都是1977。而程序中自定义的变量传入闭包后，是不会被返回的，因而打印的还是他没传入闭包的值0.

关于强制分区

repartition()方法就是coalesce()方法shuffle为true的情况。这俩函数可以将rdd进行分区，返回分区后的rdd，coalesce方法如果分区前有10个分区，要分到5个分区，则把5个不动另外5个移动到隔壁。repartition则是先全部打乱，再重新分区。

阅读全文

0 0