Spark学习1

来源:互联网 发布:五线谱打谱软件mac 编辑:程序博客网 时间:2024/04/30 02:37
1, 知识点:

[1], Spark中addFile:如果有需求把数据分发到各个计算结点的时候;一种方法是通过HDFS,之后再由Spark去访问HDFS中的文件;但是如果文件较小(e.g. config file)。 就没有必要了,因为HDFS中的文件占用的block很大;这时, addFile,就可以派上用场了。如果想知道addFile把文件copy到哪里了,可以打开sc.setLogLevel(INFO) or sc.setLogLevel(DEBUG);


  1. var path = "/user/ip.txt"  
  2. sc.addFile(path)  
  3. val rdd = sc.textFile(SparkFiles.get(path)) 
问题: 
如果默认配置了读取HDFS, 上述命令找不到各个节点的本地文件路径。如何设置 ?
目前想到的方法是让sc默认从本地文件读取, 这样textFile就从本地文件找到分发的文件;

或者path传给的是hdfs文件路径,不过这样的话占用的block会浪费空间;

[2], > sc.SetLogLevel("INFO") 
INFO|DEBUG|WARN
   

[3] a broadcast value is immutable and it can be read by all the workers; but an accumulator can be writtern by all worker but only be read by driver;  Consider this example: The workers add 1 to the error count whenever they encounter a bad record. Now if you want a count of the records with errors, the driver will just need to check the error value, which will be the bad records from all the workers.

[4] groupByKey function provides an easy way to group data together by a key. This function is a special case for combineByKey.  A common thing to do with pairRDD is to perform groupByKey and then sum the results withgroupByKey().map({case (x, y)=>(x, y.sum})。actually, it  can be simplified toreduceByKey((x,y)=>x+y). 
and the benifit to use reduceByKey is no big shuffle is needed.The groupBy() function shuffles the data so that the values are together.Bear in mind that
performance-wise, reduce is log(n), while fold is O(n).

[5] coGroup : (rdd.coGroup(otherRDD)) : join the RDD together for processing.

[6]

2    best practices:
- 尽量避免在大的RDD上使用collect(), 这个函数把数据收集到driver所在的node的内存中,容易造成out of memory错误。一般用一个take(N),N是一个自己估计的可以容纳在某个内存中的值;

3, common scenarios:

[1] read csv file: use opencsv library;

e.g :

import au.com.bytecode.opencsv.CSVReader
import java.io.StringReader
val inFile = sc.textFile("/Users/ksankar//fdpsv3/
data/Line_of_numbers.csv")
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
println(summedData.collect().mkString(","))
766.0

[2] sequence file are binary flat file that consists of key-value pairs; they are one common way of storing data with HADOOP.
[scala]
> val data = sc.sequenceFile[String, Int](inputFile)

[3] HBASE is Hadoop-based database designed to support random read/write access to entries; How to load data from Hbase:
[scala][RDD]
>
import spark._
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
....
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, input_table)
// Initialize hBase table if necessary
val admin = new HBaseAdmin(conf)
Loading and Saving Data in Spark
if(!admin.isTableAvailable(input_table)) {
val tableDesc = new HTableDescriptor(input_table)
admin.createTable(tableDesc)
}
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])

[4] RDD can be saved into different file format such as text file and sequence file. it can support to save comprese format;
[scala]
>
rddOfStrings.saveAsTextFile("out.txt")
keyValueRdd.saveAsObjectFile("sequenceOut")

compresed format : saveAsTextFile(path: String, codec: Class[_ <:CompressionCodec])



4, task :
[1] use bulk-loading tools like "import TSV" to import csv data into hbase first and run spark against hbase;
[2]