Spark Streaming和Flume的结合使用

来源:互联网 发布:大尺度网络腐剧百度云 编辑:程序博客网 时间:2024/05/21 22:26
首先在IDEA里面导入依赖包
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.10</artifactId>
<version>${spark.version}</version>
</dependency>

在linux下安装flume,减压flume包,然后到conf里面复制flume-env.sh,修改里面的JavaHOME安装目录就好了
1、 Flume主动向Streaming推送数据
object FlumePushDemo {  def main(args: Array[String]): Unit = {    Logger.getLogger("org").setLevel(Level.WARN)    //local[2]这里必须是2个或2个以上的线程,一个负责接收数据,一个负责将接收的数据下发到worker上执行    val config = new SparkConf().setAppName("FlumePushDemo").setMaster("local[2]")    val sc = new SparkContext(config)    val ssc = new StreamingContext(sc, Seconds(2))    //这个地址是spark程序启动时所在节点的地址    val flumeStream = FlumeUtils.createStream(ssc, "192.168.10.11", 8008)    flumeStream.flatMap(x => new String(x.event.getBody.array()).split(" ")).map((_, 1)).reduceByKey(_ + _)      .print()    ssc.start()    ssc.awaitTermination()  }}

配置flume文件
# 这个是启动命令,到flume的安装路径# bin/flume-ng agent -n a1 -c conf/ -f config/flume-push.conf  -Dflume.root.logger=INFO,console# flume 主动推送数据到spark上# Name the components on this agenta1.sources = r1a1.sinks = k1 k2a1.channels = c1# sourcea1.sources.r1.type = exec# 监控linux目录下的文件a1.sources.r1.command = tail -F /home/hadoop/access.loga1.sources.r1.channels = c1# Describe the sink# avro绑定一个端口a1.sinks.k1.type = avroa1.sinks.k1.hostname = 192.168.10.11a1.sinks.k1.port = 8008#在控制台打印信息a1.sinks.k2.type = logger# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1a1.sinks.k2.channel = c1

我去,弄了好几次了Flume配置文件开头总是显示<span style="font-size:14px;">,结尾显示</span>,大家在使用的时候注意,把这些去掉。
2、Streaming主动向Flume拉取数据(这个要优于上面的,可以根据处理数据的能力去拉取数据)
第一,拷贝三个jar包放到flume的lib目录下
spark-streaming-flume-sink_2.10-1.6.1.jar
scala-library-2.10.5.jar
commons-lang3-3.3.2.jar
第二,使用创建FlumeUtils.createPollingStream 的dstream
object FlumePullDemo {  def main(args: Array[String]): Unit = {    Logger.getLogger("org").setLevel(Level.WARN)    //local[2]这里必须是2个或2个以上的线程,一个负责接收数据,一个负责将接收的数据下发到worker上执行    val config = new SparkConf().setAppName("FlumePullDemo").setMaster("local[2]")    val sc = new SparkContext(config)    val ssc = new StreamingContext(sc, Seconds(2))    //这个地址是spark程序启动时所在节点的地址,后面可以添加多个地址    val addresses: Seq[InetSocketAddress] = Seq(new InetSocketAddress("192.168.10.11", 8008))    val flumeStream = FlumeUtils.createPollingStream(ssc, addresses, StorageLevel.MEMORY_ONLY)    flumeStream.flatMap(x => new String(x.event.getBody.array()).split(" ")).map((_, 1)).reduceByKey(_ + _)      .print()    ssc.start()    ssc.awaitTermination()  }}

配置flume文件
# 执行代码# bin/flume-ng agent -n a1 -c conf/ -f config/flume-pull.conf  -Dflume.root.logger=INFO,console# spark 主动到flume上拉取数据# Name the components on this agenta1.sources = r1a1.sinks = k1 k2a1.channels = c1# sourcea1.sources.r1.type = execa1.sources.r1.command = tail -F /home/hadoop/access.loga1.sources.r1.channels = c1# Describe the sink# 告诉flume下沉到spark编写好的组件中a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSinka1.sinks.k1.hostname = 192.168.10.11a1.sinks.k1.port = 8008# 控制台打印数据信息a1.sinks.k2.type = logger# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 1000# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1a1.sinks.k2.channel = c1