Spark开发笔记（2017-05-04)

来源：互联网发布：网络配置出现问题编辑：程序博客网时间：2024/06/05 21:54

在一个rdd操作中是不能同时操作另一个rdd的。你是想 valuesRdd 里面每个值对于dicRdd 进行过滤,但是在分布式系统里面,每个RDD数据集都切割分发到各个分布式机器虚拟机jvm里,每一个jvm里的数据集不一样,所以,从jvm的角度来看,它是没办法在一块数据集里面操作另外一个整体的RDD

valuesRdd.foreach { i =>val samevalueKeys = dicRdd.filter { d => d._2.equals(i) }.map(d => d._1).collect()}//错误信息：Caused by: org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases: (1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.

foreach 的意思是对于每一个元素做一个操作没有返回值
filter 的意思是对于每一个元素做判断符合条件便留下来不符合便过滤掉
fileRdd.foreach 这个操作对 fileRdd 本身不会产生任何改变
函数式编程中有个约束是,不可变量就是定义为 val 的,它就永远不会变了,所以你的filedata map操作之后要定义一个新的变量
```
val  new_rdd = filedata.map( **************************** )
```

查看文件时报错No such file or directory，注意是否文件名后边有空格

[myhadoop@sunlight100 ~]$ hadoop fs -du /test/spark/hyp/nulllabelurl/test/baike     du: `/test/spark/hyp/nulllabelurl/test/baike': No such file or directory

0 0