sparkstreaming官方文档笔记

来源：互联网发布：联通暂停数据流量编辑：程序博客网时间：2024/05/22 15:30

1、sparksteaming 入门例子

注：代码摘自spark官方文档 http://spark.apache.org/docs/latest/streaming-programming-guide.html#a-quick-example

import org.apache.spark._import org.apache.spark.streaming._import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3// Create a local StreamingContext with two working thread and batch interval of 1 second.// The master requires 2 cores to prevent from a starvation scenario.val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")val ssc = new StreamingContext(conf, Seconds(1))

// Create a DStream that will connect to hostname:port, like localhost:9999val lines = ssc.socketTextStream("localhost", 9999)

// Split each line into wordsval words = lines.flatMap(_.split(" "))

import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3// Count each word in each batchval pairs = words.map(word => (word, 1))val wordCounts = pairs.reduceByKey(_ + _)// Print the first ten elements of each RDD generated in this DStream to the consolewordCounts.print()

ssc.start()             // Start the computationssc.awaitTermination()  // Wait for the computation to terminate

然后，开启一个终端窗口，作为数据源输入： nc -lk 9999

进入spark环境目录，执行workcount实时统计例子： ./bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999

2、DStream 数据源

1）、TCP scoket

如上例子；

通过StreamingContext API 读取文件数据源streamingContext.textFileStream(dataDirectory)

2）、Advanced Sources

也可以从kafka、flume、kinesis（这个工作中还真没使用过）消费数据，这也是典型的sparkstreaming实时处理流程；

3）、Custom Sources

根据业务场景定制数据源；

之前工作涉及浅显的spark技术，由于最近工作也不怎么用，工作之余，就重新学习一下，共勉！

阅读全文

0 0