spark学习笔记：初识spark

来源：互联网发布：马士兵java 编辑：程序博客网时间：2024/05/16 10:05

下载解压，配置环境变量。
使用命令进入spark的shell中：

spark-shell local[2]

local[N]表示从本地以N个线程启动。

更改启动显示信息

修改./conf/log4j.properties文件：

log4j.rootCategory=WARN, console

简单示例

val file=sc.textFile("/home/daya/test.txt")

此处不能用主目录符“~”

file.count

count():Return the number of elements in the dataset

file.take(3)

take(n):Return an array with the first n elements of the dataset

file.filter(l=>l.contains("ok")).count

filter(func):Return a new dataset formed by selecting those elements of the source on which func returns true

再看个复杂点的，刚学spark就多写几句

file.map(l=>l.split(" ").size).reduce((a,b)=>Math.max(a,b))

map(func):Return a new distributed dataset formed by passing each element of the source through a function func
reduce(func):Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel
size方法为得到条目中元素的个数
文件内容如下：

hadoop yarn mapreduce hadoop hellohow are youyeahok ok i'm fine

此句先将file文件内容的每行按空格拆开，转成单词数，再用map方法得到4个数据集，每个数据集内容为原文件每行的单词数。reducer方法对4个数据集进行筛选，返回最大的数据集。

spark语句是laziness的，在没必要进行计算前语句都不会执行，只会生成计划，比如对一个不存在的文件进行如下操作：

var file=sc.textFile("/home/daya/123.text")

此时不会报错，但对其进行操作时就会报错：

file.foreach(println)

spark提供了web管理页面，URL为：master:4040

可以看到计算作业：

阅读全文

0 0