Spark Streaming 1:入门程序windows或linux下监听端口或本地目录的wordcount
来源:互联网 发布:java 替换特殊字符 编辑:程序博客网 时间:2024/05/23 22:05
Spark Streaming Programming Guide 1.6.2官方指导
http://spark.apache.org/docs/1.6.2/streaming-programming-guide.html
Spark Streaming可以监听本地文件、HDFS、端口、flume、kafka等。
Linux下监听端口9999实现wordcount:
1.代码实现
配置SparkContext时,需要‘local[2]’,因为需要两个线程,一个端口监听一个计算
每十秒进行一次计算
#------------------------------word count-----------------------------------------------------------from pyspark import SparkContextfrom pyspark.streaming import StreamingContext# Create a local StreamingContext with two working thread and batch interval of 10 secondsc = SparkContext("local[2]", "NetworkWordCount")ssc = StreamingContext(sc, 10)# Create a DStream that will connect to hostname:port, like localhost:9999lines = ssc.socketTextStream("localhost", 9999)# Split each line into wordswords = lines.flatMap(lambda line: line.split(" "))# Count each word in each batchpairs = words.map(lambda word: (word, 1))wordCounts = pairs.reduceByKey(lambda x, y: x + y)# Print the first ten elements of each RDD generated in this DStream to the consolewordCounts.pprint()wordCounts.saveAsTextFiles("/input/","txt")ssc.start() # Start the computationssc.awaitTermination() # Wait for the computation to terminate
2. 获取数据源
打开一个命令终端,输入 nc -lk 9999 , 然后在输入东西,sparkstreaming就会每隔10秒进行一次计算
监听本地目录实现wordcount:
1. 代码实现# Create a local StreamingContext with two working thread and batch interval of 1 secondsc = SparkContext("NetworkWordCount")ssc = StreamingContext(sc, 10)# 添加需要监听的本地路径lines=ssc.textFileStream("file:///input/flume/source")# Split each line into wordswords = lines.flatMap(lambda line: line.split(" "))# Count each word in each batchpairs = words.map(lambda word: (word, 1))wordCounts = pairs.reduceByKey(lambda x, y: x + y)# Print the first ten elements of each RDD generated in this DStream to the console#wordCounts.pprint()wordCounts.saveAsTextFiles("/input/","txt")ssc.start() # Start the computationssc.awaitTermination() # Wait for the computation to terminate
2. 获取数据源
在linux中使用 cp命令向目录直接添加文件。手动复制不能监听,不知道是什么原因
在windows下,直接复制也是不能监听
可以利用代码向目录中写入文件,win和linux都可行
import timeimport ospath = r'file:///input/flume/source'for name in os.listdir(path): file = path+'/'+name os.remove(file)for i in range(5): #每隔10秒 写入一次 time.sleep(10) file = open( path + '/data' +str(i) +r'.txt', 'w') file.write('word,count\n\ hello,word\n\ 666,666') print i file.close()
参考:
http://blog.csdn.net/jianghuxiaojin/article/details/51452593
http://spark.apache.org/docs/1.6.2/streaming-programming-guide.html
0 0
- Spark Streaming 1:入门程序windows或linux下监听端口或本地目录的wordcount
- 在windows本地编写spark的wordcount
- Spark Streaming 实战(1)搭建kafka+zookeeper+spark streaming 的windows本地开发环境
- spark streaming实现状态可恢复的wordcount计算程序
- Windows下文件名或目录的简写
- windows下的端口监听、程序端口查找命令
- 4.Spark Streaming:实时wordcount程序开发
- Spark实战----(1)使用Scala开发本地测试的Spark WordCount程序
- Spark入门的WordCount
- ubuntu下查看占用某端口的程序或进程
- java8实现spark streaming的wordcount
- spark入门实战windows本地测试程序
- 快速发现Windows系统监听或开放端口
- 快速发现 Windows 系统监听或开放端口
- spark streaming wordcount
- spark streaming wordcount
- spark streaming kafka wordcount
- Spark Streaming 实战(2) kafka+zookeeper+spark streaming 的windows本地测试Demo
- 学习笔记:自编码神经网络 TensorFlow(代码)
- MySQL学习系列二---MySQL函数
- Palindromes
- 大牛博客和各种学习资源
- CSS 文本效果
- Spark Streaming 1:入门程序windows或linux下监听端口或本地目录的wordcount
- 经典算法面试题及答案
- 阿里云虚拟机被尝试登陆多次,还好哥的密码不是一般般的
- 字母重排
- 安卓开发中的各种Manager
- 日期转换
- touchScroll实现手机触屏滚动
- 数学建模
- CNN经典模型总结