Spark Streaming+Flume对接实验(推送)

来源:互联网 发布:淘宝关键词点击率查询 编辑:程序博客网 时间:2024/05/02 01:05

Spark Streaming+Flume对接实验(推送)

软件环境:

flume-ng-core-1.4.0-cdh5.0.0

spark-1.2.0-bin-hadoop2.3

流程说明:

  • Spark Streaming: 使用spark-streaming-flume_2.10-1.2.0插件,启动一个avro source,用来接收数据,并做相应的处理;
  • Flume agent:source监控本地文件系统的一个目录,当文件发生变化时候,由avro sink发送至Spark Streaming的监听端口

Flume配置:

flume-lxw-conf.properties

  1. #-->设置sources名称
  2. agent_lxw.sources = sources1
  3. #--> 设置channel名称
  4. agent_lxw.channels = fileChannel
  5. #--> 设置sink 名称
  6. agent_lxw.sinks = sink1
  7.  
  8. # source 配置
  9. ## 一个自定义的Source,实现类似tail -f 的功能,比exec source更可靠
  10. agent_lxw.sources.sources1.type = org.apache.flume.source.taildirectory.DirectoryTailSource
  11. agent_lxw.sources.sources1.dirs = lxwlog
  12. ## 监控的目录
  13. agent_lxw.sources.sources1.dirs.lxwlog.path = file:///tmp/lxw-source
  14. #监控文件的正则规则,此正则用java的正则
  15. agent_lxw.sources.sources1.dirs.lxwlog.file-pattern = ^lxw_.*log$
  16. agent_lxw.sources.sources1.first-line-pattern = ^(.*)$
  17. agent_lxw.sources.sources1.channels = fileChannel
  18.  
  19.  
  20. # sink 1 配置 将数据发送至slave004.lxw1234.com的44444端口
  21. agent_lxw.sinks.sink1.type = avro
  22. agent_lxw.sinks.sink1.hostname = slave004.lxw1234.com
  23. agent_lxw.sinks.sink1.port = 44444
  24. agent_lxw.sinks.sink1.channel = fileChannel
  25. agent_lxw.sinks.sink1.batch-size = 500
  26. agent_lxw.sinks.sink1.connect-timeout = 40000
  27. agent_lxw.sinks.sink1.request-timeout = 40000
  28.  
  29. agent_lxw.channels.fileChannel.type = file
  30. #-->检测点文件所存储的目录
  31. agent_lxw.channels.fileChannel.checkpointDir = /tmp/flume/checkpoint/site
  32. #-->数据存储所在的目录设置
  33. agent_lxw.channels.fileChannel.dataDirs = /tmp/flume/data/site
  34. #-->隧道的最大容量
  35. agent_lxw.channels.fileChannel.capacity = 10000
  36. #-->事务容量的最大值设置
  37. agent_lxw.channels.fileChannel.transactionCapacity = 100

Spark Streaming程序:

Spark_Flume.scala

 

  1. package com.lxw.test
  2.  
  3. import org.apache.spark.SparkConf
  4. import org.apache.spark.SparkContext
  5. import org.apache.spark.storage.StorageLevel
  6. import org.apache.spark.streaming.Seconds
  7. import org.apache.spark.streaming.StreamingContext
  8. import org.apache.spark.streaming.flume.FlumeUtils
  9.  
  10.  
  11. object Spark_Flume {
  12. def main (args : Array[String]) {
  13. if(args.length < 2) {
  14. println("Usage: Spark_Flume <hostname> <port>")
  15. System.exit(1)
  16. }
  17. val hostname = args(0)
  18. val port = Integer.parseInt(args(1))
  19. val sc = new SparkContext(new SparkConf().setAppName("Spark_Flume"))
  20. val ssc = new StreamingContext(sc, Seconds(10))
  21. val flumeStream = FlumeUtils.createStream(ssc, hostname, port,StorageLevel.MEMORY_AND_DISK)
  22. flumeStream.map(e => "Event:header:" + e.event.get(0).toString + "body: " + new String(e.event.getBody.array)).print()
  23. ssc.start()
  24. ssc.awaitTermination()
  25. }
  26. }

启动:

  • 先启动Spark Streaming程序:
  1. ./spark-submit \
  2. --name "spark-flume" \
  3. --master spark://192.168.1.130:7077 \
  4. --executor-memory 1G \
  5. --class com.lxw.test.Spark_Flume \
  6. /home/liuxiaowen/spark-flume.jar slave004.lxw1234.com 44444
  • 再启动Flume agent:
  1. flume-ng agent -n agent_lxw --conf . -f flume-lxw-conf.properties

效果示例:

命令行往文件中增加数据

Spark and Flume

Spark and Flume

 

Flume监听到文件变化

Spark and Flume

Spark and Flume

 

Spark Streaming接收并处理数据

Spark and Flume

Spark and Flume

 

注意事项:

  1. Spark集群已经部署好,采用Standalone模式;
  2. Spark集群中每台节点需要将spark-streaming-flume_2.10-1.2.0.jar和flume-avro-source-1.4.0-cdh5.0.0.jar添加至SPARK_CLASSPATH中;
  3. Spark_Flume.scala在编译时候依赖:spark-assembly-1.2.0-hadoop2.3.0.jar、spark-streaming-flume_2.10-1.2.0.jar、flume-avro-source-1.4.0-cdh5.0.0.jar、flume-ng-sdk-1.4.0-cdh5.0.0.jar;
  4. 启动Spark Streaming时候传入的hostname (slave004.lxw1234.com),必须是Spark集群中的一台节点,Spark会在这台机器上启动NettyServer;

 

0 0