Spark Streaming + Flume Integration Guide
来源:互联网 发布:删除windows.old 编辑:程序博客网 时间:2024/05/12 07:06
Spark Streaming + Flume Integration Guide
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Here we explain how to configure Flume and Spark Streaming to receive data from Flume. There are two approaches to this.
Approach 1: Flume-style Push-based Approach
Flume is designed to push data between Flume agents. In this approach, Spark Streaming essentially sets up a receiver that acts an Avro agent for Flume, to which Flume can push the data. Here are the configuration steps.
General Requirements
Choose a machine in your cluster such that
When your Flume + Spark Streaming application is launched, one of the Spark workers must run on that machine.
Flume can be configured to push data to a port on that machine.
Due to the push model, the streaming application needs to be up, with the receiver scheduled and listening on the chosen port, for Flume to be able push data.
Configuring Flume
Configure Flume agent to send data to an Avro sink by having the following in the configuration file.
agent.sinks = avroSinkagent.sinks.avroSink.type = avroagent.sinks.avroSink.channel = memoryChannelagent.sinks.avroSink.hostname = <chosen machine's hostname>agent.sinks.avroSink.port = <chosen port on the machine>
See the Flume’s documentation for more information about configuring Flume agents.
Configuring Spark Streaming Application
Linking: In your SBT/Maven projrect definition, link your streaming application against the following artifact (see Linking section in the main programming guide for further information).
groupId = org.apache.spark artifactId = spark-streaming-flume_2.10 version = 1.6.1
Programming: In the streaming application code, import
FlumeUtils
and create input DStream as follows.import org.apache.spark.streaming.flume._ val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port])
See the API docs and the example.
Note that the hostname should be the same as the one used by the resource manager in the cluster (Mesos, YARN or Spark Standalone), so that resource allocation can match the names and launch the receiver in the right machine.
Deploying: As with any Spark applications,
spark-submit
is used to launch your application. However, the details are slightly different for Scala/Java applications and Python applications.For Scala and Java applications, if you are using SBT or Maven for project management, then package
spark-streaming-flume_2.10
and its dependencies into the application JAR. Make surespark-core_2.10
andspark-streaming_2.10
are marked asprovided
dependencies as those are already present in a Spark installation. Then usespark-submit
to launch your application (see Deploying section in the main programming guide).For Python applications which lack SBT/Maven project management,
spark-streaming-flume_2.10
and its dependencies can be directly added tospark-submit
using--packages
(see Application Submission Guide). That is,./bin/spark-submit --packages org.apache.spark:spark-streaming-flume_2.10:1.6.1 ...
Alternatively, you can also download the JAR of the Maven artifact
spark-streaming-flume-assembly
from the Maven repository and add it tospark-submit
with--jars
.
Approach 2: Pull-based Approach using a Custom Sink
Instead of Flume pushing data directly to Spark Streaming, this approach runs a custom Flume sink that allows the following.
- Flume pushes data into the sink, and the data stays buffered.
- Spark Streaming uses a reliable Flume receiver and transactions to pull data from the sink. Transactions succeed only after data is received and replicated by Spark Streaming.
This ensures stronger reliability and fault-tolerance guarantees than the previous approach. However, this requires configuring Flume to run a custom sink. Here are the configuration steps.
General Requirements
Choose a machine that will run the custom sink in a Flume agent. The rest of the Flume pipeline is configured to send data to that agent. Machines in the Spark cluster should have access to the chosen machine running the custom sink.
Configuring Flume
Configuring Flume on the chosen machine requires the following two steps.
Sink JARs: Add the following JARs to Flume’s classpath (see Flume’s documentation to see how) in the machine designated to run the custom sink .
(i) Custom sink JAR: Download the JAR corresponding to the following artifact (or direct link).
groupId = org.apache.spark artifactId = spark-streaming-flume-sink_2.10 version = 1.6.1
(ii) Scala library JAR: Download the Scala library JAR for Scala 2.10.5. It can be found with the following artifact detail (or, direct link).
groupId = org.scala-lang artifactId = scala-library version = 2.10.5
(iii) Commons Lang 3 JAR: Download the Commons Lang 3 JAR. It can be found with the following artifact detail (or, direct link).
groupId = org.apache.commons artifactId = commons-lang3 version = 3.3.2
Configuration file: On that machine, configure Flume agent to send data to an Avro sink by having the following in the configuration file.
agent.sinks = spark agent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSink agent.sinks.spark.hostname = <hostname of the local machine> agent.sinks.spark.port = <port to listen on for connection from Spark> agent.sinks.spark.channel = memoryChannel
Also make sure that the upstream Flume pipeline is configured to send the data to the Flume agent running this sink.
See the Flume’s documentation for more information about configuring Flume agents.
Configuring Spark Streaming Application
Linking: In your SBT/Maven project definition, link your streaming application against the
spark-streaming-flume_2.10
(see Linking sectionin the main programming guide).Programming: In the streaming application code, import
FlumeUtils
and create input DStream as follows.import org.apache.spark.streaming.flume._ val flumeStream = FlumeUtils.createPollingStream(streamingContext, [sink machine hostname], [sink port])
See the Scala example FlumePollingEventCount.
Note that each input DStream can be configured to receive data from multiple sinks.
Deploying: This is same as the first approach.
- Spark Streaming + Flume Integration Guide
- Spark streaming and flume
- Spark-streaming 连接flume
- flume kafka spark streaming
- Spark Streaming Programming Guide
- Spark Streaming + Kafka Integration Guide 位置策略和消费策略译文
- Spark学习笔记-Streaming-Flume
- Spark Streaming+Flume对接实验
- flume spark streaming配置详解
- flume-kafka- spark streaming(pyspark)
- spark Streaming flume poll 坑
- flume-kafka- spark streaming(pyspark)
- Spark 2.1.0 -- Spark Streaming Programming Guide
- Spark学习笔记-Streaming集成Flume
- Spark Streaming 和 Flume-NG的整合
- Spark Streaming + Flume 相关源码阅读
- spark streaming+flume avro实时计算
- Spark Streaming+Flume对接实验(推送)
- iOS闪退(崩溃)手动再符号化解析
- Lua 与 C混合编程
- 给TextView加边框
- IO负载高的来源定位
- Ubuntu 软件包管理工具 dpkg, APT 的一些命令
- Spark Streaming + Flume Integration Guide
- java 读取properties
- 数学之美番外篇:平凡而又神奇的贝叶斯方法
- VMWare上成功安装Mac OS X Mavericks(10.9)的全部过程
- volatile、非volatile、Atomic计数器比较
- 利用redis和php-resque实现后台任务
- lintcode: Edit Distance
- 选择排序原理分析及Java实现
- OpenGL基础图形编程(七)建模