官方Spark Programming Guide学习心得(V2.1.0)

来源：互联网发布：linux项目开发实例编辑：程序博客网时间：2024/06/14 00:44

原文地址：http://spark.apache.org/docs/latest/programming-guide.html

写的简单明了，实际中要注意的几点：

1、客户端版本库引用，开始不支持JDK7：

spark引用groupId = org.apache.sparkartifactId = spark-core_2.11version = 2.1.0如果需要hdfs，需要引用groupId = org.apache.hadoopartifactId = hadoop-clientversion = <your-hdfs-version>

2、初始化环境

SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);JavaSparkContext sc = new JavaSparkContext(conf);

本地集成环境运行的话，要设置master为local

3、常用的RDD操作

Transformations

Actions

4、通过介绍reduceByKey来解释Shuffle operations

文档写的真棒！

推荐infoq上一篇文章，也非常好：

http://www.infoq.com/cn/articles/spark-core-rdd 理解Spark的核心RDD，摘要：

5、RDD，全称为Resilient Distributed Datasets，是一个容错的、并行的数据结构，可以让用户显式地将数据存储到磁盘和内存中，并能控制数据的分区。针对数据处理有几种常见模型，包括：Iterative Algorithms，Relational Queries，MapReduce，Stream Processing。例如Hadoop MapReduce采用了MapReduces模型，Storm则采用了Stream Processing模型。

6、图解RDD

本图来自Matei Zaharia撰写的论文An Architecture for Fast and General Data Processing on Large Clusters。图中，一个box代表一个RDD，一个带阴影的矩形框代表一个partition。

注意理解：map /groupBy/union/join的RDD操作的shuffle

7、推荐论文： An Architecture for Fast and General Data Processing on Large Clusters

1 0