Spark学习1

来源：互联网发布：五线谱打谱软件mac 编辑：程序博客网时间：2024/04/30 02:37

1, 知识点:

[1], Spark中addFile：如果有需求把数据分发到各个计算结点的时候；一种方法是通过HDFS,之后再由Spark去访问HDFS中的文件；但是如果文件较小(e.g. config file)。就没有必要了，因为HDFS中的文件占用的block很大；这时， addFile,就可以派上用场了。如果想知道addFile把文件copy到哪里了，可以打开sc.setLogLevel(INFO) or sc.setLogLevel(DEBUG)；

var path = "/user/ip.txt"
sc.addFile(path)
val rdd = sc.textFile(SparkFiles.get(path))

问题：　

如果默认配置了读取HDFS，　上述命令找不到各个节点的本地文件路径。如何设置　？

目前想到的方法是让sc默认从本地文件读取，这样textFile就从本地文件找到分发的文件；

或者path传给的是hdfs文件路径，不过这样的话占用的block会浪费空间；

[2]， > sc.SetLogLevel("INFO")

INFO｜DEBUG｜WARN

[3] a broadcast value is immutable and it can be read by all the workers; but an accumulator can be writtern by all worker but only be read by driver; Consider this example: The workers add 1 to the error count whenever they encounter a bad record. Now if you want a count of the records with errors, the driver will just need to check the error value, which will be the bad records from all the workers.

[4] groupByKey function provides an easy way to group data together by a key. This function is a special case for combineByKey. A common thing to do with pairRDD is to perform groupByKey and then sum the results withgroupByKey().map({case (x, y)=>(x, y.sum})。actually, it can be simplified toreduceByKey((x,y)=>x+y).

and the benifit to use reduceByKey is no big shuffle is needed.The groupBy() function shuffles the data so that the values are together.Bear in mind that
performance-wise, reduce is log(n), while fold is O(n).

[5] coGroup : (rdd.coGroup(otherRDD)) : join the RDD together for processing.

[6]

2 best practices:

- 尽量避免在大的RDD上使用collect(), 这个函数把数据收集到driver所在的node的内存中，容易造成out of memory错误。一般用一个take(N)，N是一个自己估计的可以容纳在某个内存中的值；

3, common scenarios:

[1] read csv file: use opencsv library;

e.g :

import au.com.bytecode.opencsv.CSVReader
import java.io.StringReader
val inFile = sc.textFile("/Users/ksankar//fdpsv3/
data/Line_of_numbers.csv")
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
println(summedData.collect().mkString(","))
766.0

[2] sequence file are binary flat file that consists of key-value pairs; they are one common way of storing data with HADOOP.

[scala]

> val data = sc.sequenceFile[String, Int](inputFile)

[3] HBASE is Hadoop-based database designed to support random read/write access to entries; How to load data from Hbase:

[scala][RDD]

import spark._
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
....
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, input_table)
// Initialize hBase table if necessary
val admin = new HBaseAdmin(conf)
Loading and Saving Data in Spark

if(!admin.isTableAvailable(input_table)) {
val tableDesc = new HTableDescriptor(input_table)
admin.createTable(tableDesc)
}
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])

[4] RDD can be saved into different file format such as text file and sequence file. it can support to save comprese format;

[scala]

rddOfStrings.saveAsTextFile("out.txt")
keyValueRdd.saveAsObjectFile("sequenceOut")

compresed format : saveAsTextFile(path: String, codec: Class[_ <:CompressionCodec])

4, task :

[1] use bulk-loading tools like "import TSV" to import csv data into hbase first and run spark against hbase;

[2]

阅读全文

0 0