Spark Streaming从Flume读取数据流(pull模式)
来源:互联网 发布:d3.js绘制社交图谱 编辑:程序博客网 时间:2024/04/30 05:07
1.jar包准备
参考官方文档: http://spark.apache.org/docs/latest/streaming-flume-integration.html
当前测试flume使用到的jar包版本如下:
spark-streaming-flume-sink_2.11-2.2.0.jarscala-library-2.11.8.jarcommons-lang3-3.5.jar
这几个jar包下载后放到flume安装目录 ./flume/lib/
中。
spark streaming用到的jar版本如下:
spark-streaming-flume-assembly_2.11-2.2.0.jar
在 http://search.maven.org 下载后放到spark jar依赖目录。
2.flume 配置启动
假设flume数据源为本地日志文件: /tmp/log_source/src.log
新增config文件, 如flume-spark.conf:
a1.channels = c1a1.sinks = sparka1.sources = r1a1.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSinka1.sinks.spark.hostname = your_hostnamea1.sinks.spark.port = 9999a1.sinks.spark.channel = c1a1.sources.r1.type=execa1.sources.r1.channels=c1a1.sources.r1.command=tail -F /tmp/log_source/src.loga1.channels.c1.type = filea1.channels.c1.checkpointDir=/tmp/flume-spark/tmp/checkpointa1.channels.c1.dataDirs=/tmp/flume-spark/tmp/data
启动测试:
hadoop@1:/usr/local/flume$ bin/flume-ng agent -c conf -f conf/flume-spark.conf -n a1 -Dflume.root.logger=DEBUG,console
3.spark streaming 接收数据(python)
词频统计逻辑实现test_streaming.py:
from __future__ import print_functionimport sysimport loggingfrom pyspark import SparkContextfrom pyspark.streaming import StreamingContextfrom pyspark.streaming.flume import FlumeUtilsdef flume_main(): sc = SparkContext(appName="streaming_analysis_wordcount") sc.setLogLevel("WARN") ssc = StreamingContext(sc, 1) addrs = [("your_hostname", 9999), ] fps = FlumeUtils.createPollingStream(ssc, addrs) lines = fps.map(lambda x: x[1]) counts = lines.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a+b) counts.pprint() ssc.start() ssc.awaitTermination() if __name__ == "__main__": flume_main()
提交spark任务:
hadoop@1:/data/test$ /usr/local/spark/bin/spark-submit --master yarn --deploy-mode client test_streaming.py
运行后可看到统计输出如:
-------------------------------------------Time: 2017-10-19 11:49:00--------------------------------------------------------------------------------------Time: 2017-10-19 11:49:01-------------------------------------------('current', 20)('20171019', 20)('11:48:58', 1)('11:48:59', 10)('5435854358', 1)('5435954359', 1)('5436354363', 1)('5436454364', 1)('5436954369', 1)('5435454354', 1)...
阅读全文
1 0
- Spark Streaming从Flume读取数据流(pull模式)
- Spark streaming整合Flume之pull方式
- Spark streaming and flume
- Spark-streaming 连接flume
- flume kafka spark streaming
- Spark-Flume整合--Pull
- 第88讲:Spark Streaming从Flume Poll数据
- Spark Streaming实时处理本地数据流
- Spark学习笔记-Streaming-Flume
- Spark Streaming+Flume对接实验
- flume spark streaming配置详解
- Spark Streaming + Flume Integration Guide
- flume-kafka- spark streaming(pyspark)
- spark Streaming flume poll 坑
- flume-kafka- spark streaming(pyspark)
- spark streaming读取HDFS
- Spark Streaming之:Flume监控目录下文件内容变化,然后Spark Streaming实时监听Flume,然后从其上拉取数据,并计算出结果
- SparkFlumeEvent:spark streaming连接flume,从SparkFlumeEvent中获取记录内容
- PullToRefreshListView 刷新
- Java校招面试 Google面试官亲授
- 【Windows网络编程】完成端口IOCP介绍(超详细)
- Vijos P1021 Victoria的舞会1
- http协议
- Spark Streaming从Flume读取数据流(pull模式)
- 机器学习-学习笔记 Caffe安装-MNIST(手写体数字识别)
- k近邻法: k-nearest neighbor
- C语言基础
- 计算机视觉领域一些牛人博客
- 欢迎使用CSDN-markdown编辑器
- sringboot项目在tomcat上的部署
- Spark2.2 内核架构深层剖析图解
- erlang 数据机构 -- array