flume之kafka source
来源:互联网 发布:关于股票信息软件 编辑:程序博客网 时间:2024/06/01 19:58
对于线上业务系统来说,有的时候需要对大量的数据进行统计,如果直接将数据保存到本地文件(例如使用log4j)可能会拖慢线上系统。那么,最好的方式是将大量的数据通过jms(例如:kafka)发送到消息服务器,消息中间件后面再对接flume来完成数据统计等需求。
接下来,我们来介绍一下flume 的kafka source。
一、理论:
#-------- kafkaSource相关配置-----------------# 定义消息源类型agent.sources.kafkaSource.type = org.apache.flume.source.kafka.KafkaSource# 定义kafka所在zk的地址agent.sources.kafkaSource.zookeeperConnect = 10.45.9.139:2181# 配置消费的kafka topicagent.sources.kafkaSource.topic = my-topic# 配置消费者组的idagent.sources.kafkaSource.groupId = flume# 消费超时时间,参照如下写法可以配置其他所有kafka的consumer选项。注意格式从kafka.xxx开始是consumer的配置属性agent.sources.kafkaSource.kafka.consumer.timeout.ms = 100
- auto.commit.enable 设置为 false by the source, and every batch is committed. 为了改善性能, 设置为 true 改为使用 kafka.auto.commit.enable。 这个可能会丢失数据 if the source goes down before committing.
- consumer.timeout.ms设置为 10, so when Flume polls Kafka for new data, it waits no more than 10 ms for the data to be available. Setting this to a higher value can reduce CPU utilization due to less frequent polling, but introduces latency in writing batches to the channel.
这里, flume承担kafka的consumer的角色。如果存在多个消费者,注意把他们配置在同一个消费者组中,以免出问题!!
二、实例:
1、安装、配置flume:
1)下载flume-1.7版本,解压,然后配置jdk等信息;
2)配置flume:
agent1.sources = logsourceagent1.channels = mc1agent1.sinks = avro-sinkagent1.sources.logsource.channels = mc1agent1.sinks.avro-sink.channel = mc1#sourceagent1.sources.logsource.type = org.apache.flume.source.kafka.KafkaSourceagent1.sources.logsource.zookeeperConnect = ttAlgorithm-kafka-online021-jylt.qiyi.virtual:2181,ttAlgorithm-kafka-online022-jylt.qiyi.virtual:2181,ttAlgorithm-kafka-online023-jylt.qiyi.virtual:2181,ttAlgorithm-kafka-online024-jylt.qiyi.virtual:2181,ttAlgorithm-kafka-online025-jylt.qiyi.virtual:2181agent1.sources.logsource.topic = page_visitsagent1.sources.logsource.groupId = flumeagent1.sources.logsource.kafka.consumer.timeout.ms = 100#channel1agent1.channels.mc1.type = memoryagent1.channels.mc1.capacity = 1000agent1.channels.mc1.keep-alive = 60#sink1agent1.sinks.avro-sink.type = file_rollagent1.sinks.avro-sink.sink.directory = /data/mysink
3)启动flume:
flume-ng agent -c /usr/local/apache-flume-1.7.0-bin/conf -f /usr/local/flume/conf/kafka_hdfs.conf -n agent1
4)启动后flume 会报错:
将zookeeper-3.4.5.jar 放到flume 的lib目录下,然后重新启动flume。
5)再次启动flume后,日志中不断显示如下信息:
flume-1.7版本的lib目录:
flume-1.6版本的lib目录:
下载flume源码,发现1.7的flume-kafka-source中使用了kafka最新的balance等功能。所以,该问题的解决方法是将flume换成1.6版本。
6)换成了flume-1.6,然后重新配置,启动flume后,一切正常。(下载地址:http://archive.apache.org/dist/flume/1.6.0/)
2、编写kafka的product,模拟发送消息,flume收消息:
1)kafka produce代码:
package cn.edu.nuc.MyTestSimple.kafka;import java.util.Date;import java.util.Properties;import java.util.Random;import kafka.javaapi.producer.Producer;import kafka.producer.KeyedMessage;import kafka.producer.ProducerConfig;import org.apache.kafka.clients.producer.KafkaProducer;import org.apache.kafka.clients.producer.ProducerRecord;public class TestProducer {public static void main(String[] args) {sendStringMsg();//sendBytesMsg(); }public static void sendStringMsg() {try {long events = 1000; Random rnd = new Random(); Properties props = new Properties(); props.put("metadata.broker.list", "ttAlgorithm-kafka-online001-jyltqbs.qiyi.virtual:9092," + "ttAlgorithm-kafka-online002-jyltqbs.qiyi.virtual:9092," + "ttAlgorithm-kafka-online003-jyltqbs.qiyi.virtual:9092," + "ttAlgorithm-kafka-online004-jyltqbs.qiyi.virtual:9092," + "ttAlgorithm-kafka-online005-jyltqbs.qiyi.virtual:9092"); props.put("serializer.class", "kafka.serializer.StringEncoder"); props.put("request.required.acks", "1"); ProducerConfig config = new ProducerConfig(props); Producer<String, String> producer = new Producer<String, String>(config); for (long nEvents = 0; nEvents < events; nEvents++) { long runtime = new Date().getTime(); String ip = "192.168.2." + rnd.nextInt(255); String msg = runtime + ",www.iqiyi.com," + ip; KeyedMessage<String, String> data = new KeyedMessage<String, String>("page_visits", ip, msg); producer.send(data); } producer.close();} catch(Exception e) {e.printStackTrace();}}}
2)flume日志中会报如下错:
17/10/16 18:51:35 ERROR kafka.KafkaSource: KafkaSource EXCEPTION, {}org.apache.flume.ChannelException: Unable to put batch on required channel: org.apache.flume.channel.MemoryChannel{name: mc1}at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:200)at org.apache.flume.source.kafka.KafkaSource.process(KafkaSource.java:130)at org.apache.flume.source.PollableSourceRunner$PollingRunner.run(PollableSourceRunner.java:139)at java.lang.Thread.run(Thread.java:745)Caused by: org.apache.flume.ChannelException: Put queue for MemoryTransaction of capacity 100 full, consider committing more frequently, increasing capacity or increasing thread countat org.apache.flume.channel.MemoryChannel$MemoryTransaction.doPut(MemoryChannel.java:84)at org.apache.flume.channel.BasicTransactionSemantics.put(BasicTransactionSemantics.java:93)at org.apache.flume.channel.BasicChannelSemantics.put(BasicChannelSemantics.java:80)at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:189)... 3 more
看到这个错误,应该是flume的channel组件问题,关于这个问题,下一篇文章会详细介绍。这里只简单给出一个解决方案:
增大了capacity和transactioncapacity。
3)重新启动flume(可以不用重启,flume会自动加载配置文件),发现可以收到kafka的数据。
至此,flume-kafka-source的配置和坑全部介绍完毕。
参考:
http://kaimingwan.com/post/flume/flumecong-kafkala-xiao-xi-chi-jiu-hua-dao-hdfs
https://my.oschina.net/u/1421929/blog/498969
http://www.cnblogs.com/niuzhifa/p/6285784.html
- flume之kafka source
- Flume使用大全之kafka source-kafka channel-hdfs
- flume学习05---Kafka Source
- Flume使用大全之kafka source-kafka channel-hdfs(kerberos认证)
- Flume使用大全之kafka source-kafka channel-hdfs(SSL加密)
- Flume使用大全之kafka source-kafka channel-hdfs(kerberos认证,SSL加密)
- flume之Http Source
- 将Kafka作为Flume的Source
- flume整合kafka之kafka接收flume数据
- flume之source,channel,sink
- flume配置-生产环境下 Taildir Source to kafka Sink
- flume+kafka
- flume+kafka
- flume kafka
- 从无到有系列之flume-kafka整合01
- Flume Source
- Flume Source
- Flume笔记二之source,channel,sink
- Spring Aop学习
- opencv读图像C语言实现canny边缘检测
- [FWT] Codeforces663E .Binary Table
- sql server 2000安装图解
- 机器学习---浅谈神经网络
- flume之kafka source
- Math ceil()、floor()、round()方法
- Parcelable接口和用法
- Swing中分割面板JSplitPane的使用
- 作为前端Web开发者,这12个终端命令不可不会
- 玩转Linux系统-elasticsearch搭建
- htseq-count的使用
- 系统运行级别
- 20171017日记账流水