Spark Streaming 1：入门程序windows或linux下监听端口或本地目录的wordcount

来源：互联网发布：java 替换特殊字符编辑：程序博客网时间：2024/05/23 22:05

Spark Streaming Programming Guide 1.6.2官方指导

http://spark.apache.org/docs/1.6.2/streaming-programming-guide.html

Spark Streaming可以监听本地文件、HDFS、端口、flume、kafka等。

Linux下监听端口9999实现wordcount：

1.代码实现

配置SparkContext时，需要‘local[2]’,因为需要两个线程，一个端口监听一个计算

每十秒进行一次计算

#------------------------------word count-----------------------------------------------------------from pyspark import SparkContextfrom pyspark.streaming import StreamingContext# Create a local StreamingContext with two working thread and batch interval of 10 secondsc = SparkContext("local[2]", "NetworkWordCount")ssc = StreamingContext(sc, 10)# Create a DStream that will connect to hostname:port, like localhost:9999lines = ssc.socketTextStream("localhost", 9999)# Split each line into wordswords = lines.flatMap(lambda line: line.split(" "))# Count each word in each batchpairs = words.map(lambda word: (word, 1))wordCounts = pairs.reduceByKey(lambda x, y: x + y)# Print the first ten elements of each RDD generated in this DStream to the consolewordCounts.pprint()wordCounts.saveAsTextFiles("/input/","txt")ssc.start()             # Start the computationssc.awaitTermination()  # Wait for the computation to terminate

2. 获取数据源

打开一个命令终端，输入 nc -lk 9999 ，然后在输入东西，sparkstreaming就会每隔10秒进行一次计算

监听本地目录实现wordcount：

1. 代码实现

# Create a local StreamingContext with two working thread and batch interval of 1 secondsc = SparkContext("NetworkWordCount")ssc = StreamingContext(sc, 10)# 添加需要监听的本地路径lines=ssc.textFileStream("file:///input/flume/source")# Split each line into wordswords = lines.flatMap(lambda line: line.split(" "))# Count each word in each batchpairs = words.map(lambda word: (word, 1))wordCounts = pairs.reduceByKey(lambda x, y: x + y)# Print the first ten elements of each RDD generated in this DStream to the console#wordCounts.pprint()wordCounts.saveAsTextFiles("/input/","txt")ssc.start()             # Start the computationssc.awaitTermination()  # Wait for the computation to terminate

2. 获取数据源

在linux中使用 cp命令向目录直接添加文件。手动复制不能监听，不知道是什么原因

在windows下，直接复制也是不能监听

可以利用代码向目录中写入文件，win和linux都可行

import timeimport ospath = r'file:///input/flume/source'for name in os.listdir(path):    file = path+'/'+name    os.remove(file)for i in range(5):    #每隔10秒 写入一次    time.sleep(10)    file = open( path + '/data' +str(i) +r'.txt', 'w')    file.write('word，count\n\                hello,word\n\                666,666')    print i    file.close()

参考：

http://blog.csdn.net/jianghuxiaojin/article/details/51452593

http://spark.apache.org/docs/1.6.2/streaming-programming-guide.html

0 0