Flume自定义功能实现
来源:互联网 发布:淘宝c店已经死了 编辑:程序博客网 时间:2024/05/21 19:38
实现流程如下:
准备工作:
首先开启多个集群并且确认hbase和flume可以正常使用,还需将dom4j(用来解析xml文件)所需要的jar包dom4j-1.6.1.jar和jaxen-1.1-beta-7.jar(利用xpath技术来像sql一样来定位xml中的内容)导入到flume的lib目录下,并且将要读取的xml文件复制在所有节点中,并且导入用eclipse将项目打包成core.jar的jar包(这里需要注意的是字符编码的问题,不一致的话会报错),这里我开启了两个集群做测试。再将xml配置文件导入两个集群的所有几点/home/hadoop目录下,dao.xml内容如下:
[hadoop@h71 ~]$ cat dao.xml
<?xml version="1.0" encoding="UTF-8"?><configuration> <ip name="192.168.8.71"> <duankou>4141</duankou> <duankou>4040</duankou> <biao name="messages"> <liezus> <zu name="cf"> <ziduan name="ip">3</ziduan> <ziduan name="host">4</ziduan> </zu> <zu name="df"> <ziduan name="leixing">5</ziduan> <ziduan name="xinxi">6</ziduan> </zu> </liezus> <fengefu> </fengefu> </biao> <biao name="hui"> <liezus> <zu name="ef"> <ziduan name="haha">3</ziduan> <ziduan name="hehe">4</ziduan> </zu> </liezus> <fengefu> </fengefu> </biao> </ip> <ip name="192.168.8.21"> <duankou>5151</duankou> <biao name="messages"> <liezus> <zu name="cf"> <ziduan name="ip">3</ziduan> <ziduan name="host">4</ziduan> </zu> <zu name="df"> <ziduan name="leixing">5</ziduan> <ziduan name="xinxi">6</ziduan> </zu> </liezus> <fengefu> </fengefu> </biao> <biao name="hui"> <liezus> <zu name="ef"> <ziduan name="haha">3</ziduan> <ziduan name="hehe">4</ziduan> </zu> </liezus> <fengefu> </fengefu> </biao> </ip></configuration>该项目想实现在一个sink端启动多个集群端口并且将数据插入对应集群中的hbase表中。该项目还实现了断点续传、在hbase中自动建表、file_roll模式下能生成想要的文件。
我这里要启动h71的flume进程来启动三个端口(第一个端口要在h71的hbase中自动建立相应的表并且插入数据,第二个端口在file_roll模式的sink下向h71的/home/hadoop/hui目录下生成messages.txt文件,第三个端口要在h21下的hbase中建相应的表并且插入数据)
在h71启动flume进程产生三个端口时需先启动相应三个端口的source端:
hadoop@h21:~/apache-flume-1.6.0-cdh5.5.2-bin/conf$ cat messages5.conf
# Name the components on this agenta1.sources = r1a1.channels = c1a1.sinks = k1# Describe/configure the sourcea1.sources.r1.type = hui.avrosource.AvroSourcea1.sources.r1.channels = c1a1.sources.r1.bind = 192.168.8.21a1.sources.r1.port = 5151# Describe the sinka1.sinks.k1.type = hui.org.apache.flume.sink.hbase.HBaseSinka1.sinks.k1.serializer = com.tcloud.flume.AsyncHbaseLogEventSerializera1.sinks.k1.channel = memoryChannel# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1hadoop@h21:~/apache-flume-1.6.0-cdh5.5.2-bin$ bin/flume-ng agent -c . -f conf/messages5.conf -n a1 -Dflume.root.logger=INFO,console
(若hbase中没有xml中的表则建立,有则不建立,该hbase中没有,进程显示
Create messages SUCCESS!
Create hui SUCCESS!
12/12/13 01:08:58 INFO sink.DefaultSinkFactory: Creating instance of sink: k1, type: hui.org.apache.flume.sink.hbase.HBaseSink
表名:messages
列族名:cf)
hbase(main):013:0> listTABLE hui messages2 row(s) in 0.0220 seconds[hadoop@h71 conf]$ cat messages5.conf
# Name the components on this agenta1.sources = r1a1.channels = c1a1.sinks = k1# Describe/configure the sourcea1.sources.r1.type = hui.avrosource.AvroSourcea1.sources.r1.channels = c1a1.sources.r1.bind = 192.168.8.71a1.sources.r1.port = 4141# Describe the sinka1.sinks.k1.type = hui.org.apache.flume.sink.hbase.HBaseSinka1.sinks.k1.serializer = com.tcloud.flume.AsyncHbaseLogEventSerializera1.sinks.k1.channel = memoryChannel# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1[hadoop@h71 apache-flume-1.6.0-cdh5.5.2-bin]$ bin/flume-ng agent -c . -f conf/messages5.conf -n a1 -Dflume.root.logger=INFO,console
(hbase中存在相应的表,进程显示:
messages exists!
hui exists!
17/03/18 15:36:46 INFO sink.DefaultSinkFactory: Creating instance of sink: k1, type: hui.org.apache.flume.sink.hbase.HBaseSink
表名:messages
列族名:cf)
[hadoop@h71 conf]$ cat messages6.conf
# Name the components on this agenta1.sources = r1a1.channels = c1a1.sinks = k1# Describe/configure the sourcea1.sources.r1.type = avroa1.sources.r1.channels = c1a1.sources.r1.bind = 192.168.8.71a1.sources.r1.port = 4040# Describe the sinka1.sinks.k1.type = cn.huyanping.flume.sinks.SafeRollingFileSinka1.sinks.k1.channel = c1a1.sinks.k1.sink.directory = /home/hadoop/huia1.sinks.k1.sink.rollInterval = 0# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1[hadoop@h71 hui]$ ls (/home/hadoop/hui目录下为空)
[hadoop@h71 apache-flume-1.6.0-cdh5.5.2-bin]$ bin/flume-ng agent -c . -f conf/messages6.conf -n a1 -Dflume.root.logger=INFO,console
[hadoop@h71 hui]$ ls
messages.txt
(启动进程后产生messages.tt空文件)
启动sink端:
[hadoop@h71 conf]$ cat messages4.conf
a1.sources = r1a1.channels = c1a1.sinks = k1# Describe/configure the sourcea1.sources.r1.type = org.apache.flume.chiwei.filemonitor.FileMonitorSourcea1.sources.r1.channels = c1a1.sources.r1.file = /home/hadoop/messagesa1.sources.r1.positionDir = /home/hadoop# Describe the sinka1.sinks.k1.type = hui.avrosink.AvroSinka1.sinks.k1.batch-size = 2# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1[hadoop@h71 ~]$ cat messages (所要导入数据的日志文件源)
Jan 23 19:59:00 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Feb 20 06:25:04 h107 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="22204" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Jan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Jan 3 19:59:02 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnames
[hadoop@h71 apache-flume-1.6.0-cdh5.5.2-bin]$ bin/flume-ng agent -c . -f conf/messages4.conf -n a1 -Dflume.root.logger=INFO,console
(该进程启动后在前面启动的三个source端所监听的端口都启动成功,如下)
17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xabab0b15, /192.168.8.71:33975 => /192.168.8.71:4040] OPEN17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xabab0b15, /192.168.8.71:33975 => /192.168.8.71:4040] BOUND: /192.168.8.71:404017/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xabab0b15, /192.168.8.71:33975 => /192.168.8.71:4040] CONNECTED: /192.168.8.71:3397517/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xb9dd8531, /192.168.8.71:51345 => /192.168.8.71:4141] OPEN17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xb9dd8531, /192.168.8.71:51345 => /192.168.8.71:4141] BOUND: /192.168.8.71:414117/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xb9dd8531, /192.168.8.71:51345 => /192.168.8.71:4141] CONNECTED: /192.168.8.71:5134512/12/13 01:19:05 INFO ipc.NettyServer: [id: 0x65dc160e, /192.168.8.71:50634 => /192.168.8.21:5151] OPEN12/12/13 01:19:05 INFO ipc.NettyServer: [id: 0x65dc160e, /192.168.8.71:50634 => /192.168.8.21:5151] BOUND: /192.168.8.21:515112/12/13 01:19:05 INFO ipc.NettyServer: [id: 0x65dc160e, /192.168.8.71:50634 => /192.168.8.21:5151] CONNECTED: /192.168.8.71:50634)[hadoop@h71 ~]$ echo "Jan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames" >> messages
h71的sink端的输出:
Jan 23 19:59:00 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnamesFeb 20 06:25:04 h107 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="22204" x-info="http://www.rsyslog.com"] rsyslogd was HUPedJan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnamesJan 3 19:59:02 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnamesJan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames17/03/18 15:46:46 INFO avrosink.AvroSink: Attempting to create Avro Rpc client.17/03/18 15:46:46 WARN api.NettyAvroRpcClient: Using default maxIOWorkersclient-->NettyAvroRpcClient { host: h71, port: 4141 }17/03/18 15:46:47 INFO avrosink.AvroSink: Attempting to create Avro Rpc client.17/03/18 15:46:47 WARN api.NettyAvroRpcClient: Using default maxIOWorkersclient-->NettyAvroRpcClient { host: h71, port: 4040 }17/03/18 15:46:47 INFO avrosink.AvroSink: Attempting to create Avro Rpc client.17/03/18 15:46:47 WARN api.NettyAvroRpcClient: Using default maxIOWorkersclient-->NettyAvroRpcClient { host: 192.168.8.21, port: 5151 }在h21查看相应hbase中的表:
hbase(main):014:0> scan 'messages'ROW COLUMN+CELL 2012-12-13 01:23:00 column=cf:host, timestamp=1355379762701, value=s_sys@hui 2012-12-13 01:23:00 column=cf:ip, timestamp=1355379762628, value=192.168.101.254 2012-12-13 01:23:00 column=df:leixing, timestamp=1355379762741, value=trafficlogger: 2012-12-13 01:23:00 column=df:xinxi, timestamp=1355379762791, value=empty 2012-12-13 01:23:01 column=cf:host, timestamp=1355379763516, value=s_sys@hui 2012-12-13 01:23:01 column=cf:ip, timestamp=1355379763488, value=:: 2012-12-13 01:23:01 column=df:leixing, timestamp=1355379763544, value=trafficlogger: 2012-12-13 01:23:01 column=df:xinxi, timestamp=1355379763573, value=empty 2 row(s) in 0.0610 secondshbase(main):015:0> scan 'hui'ROW COLUMN+CELL 2012-12-13 01:23:01 column=ef:haha, timestamp=1355379763422, value=19:59:02 2012-12-13 01:23:01 column=ef:hehe, timestamp=1355379763452, value=192.168.101.254 2012-12-13 01:23:02 column=ef:haha, timestamp=1355379763607, value=:: 2012-12-13 01:23:02 column=ef:hehe, timestamp=1355379763635, value=s_sys@hui 2 row(s) in 0.0500 seconds在h71查看相应hbase中的表:
hbase(main):012:0> scan 'messages'ROW COLUMN+CELL 2017-03-18 15:46:47 column=cf:host, timestamp=1489823233223, value=192.168.101.254 2017-03-18 15:46:47 column=cf:ip, timestamp=1489823233185, value=19:59:02 2017-03-18 15:46:47 column=df:leixing, timestamp=1489823233263, value=s_sys@hui 2017-03-18 15:46:47 column=df:xinxi, timestamp=1489823233297, value=trafficlogger: 2017-03-18 15:46:48 column=cf:host, timestamp=1489823233439, value=s_sys@hui 2017-03-18 15:46:48 column=cf:ip, timestamp=1489823233406, value=:: 2017-03-18 15:46:48 column=df:leixing, timestamp=1489823233471, value=trafficlogger: 2017-03-18 15:46:48 column=df:xinxi, timestamp=1489823233505, value=empty 2 row(s) in 0.3660 secondshbase(main):013:0> scan 'hui'ROW COLUMN+CELL 2017-03-18 15:46:47 column=ef:haha, timestamp=1489823233106, value=:: 2017-03-18 15:46:47 column=ef:hehe, timestamp=1489823233145, value=s_sys@hui 2017-03-18 15:46:48 column=ef:haha, timestamp=1489823233544, value=:: 2017-03-18 15:46:48 column=ef:hehe, timestamp=1489823233578, value=s_sys@hui 2 row(s) in 0.0160 seconds[hadoop@h71 hui]$ cat messages.txt
Jan 23 19:59:00 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnamesFeb 20 06:25:04 h107 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="22204" x-info="http://www.rsyslog.com"] rsyslogd was HUPedJan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnamesJan 3 19:59:02 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnamesJan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames并且在/home/hadoop/目录下生成position.log作为端点续传的功能。
项目代码已经上传http://download.csdn.net/download/m0_37739193/10154814
- Flume自定义功能实现
- Flume自定义Source、Sink和Interceptor(简单功能实现)
- 自定义实现flume中的HbaseEventSerializer接口
- Flume(ng) 自定义sink实现和属性注入
- Flume-ng 自定义sink实现和属性注入
- flume 自定义拦截器实现多行读取日志
- Flume(ng) 自定义sink实现和属性注入
- Flume(ng) 自定义sink实现和属性注入
- Flume-ng 高级功能
- flume的自定义配置
- flume自定义sink source
- flume自定义sink
- flume开发--自定义Sink
- flume 自定义拦截器
- flume 自定义正则过滤器
- flume 自定义开发HttpSink
- flume开发--自定义Sink
- flume自定义source
- 漫谈WebSphere应用服务器之事务
- TCP(HTTP)长连接和短连接区别和怎样维护长连接
- 常见问题
- bootstrap datetimepicker 日期范围限制
- 如何选择离职
- Flume自定义功能实现
- 设计模式(十一)------23种设计模式(4):建造者模式(生成器模式)
- 倒三角99乘法表
- 利用Eclipse发布Java程序
- java web设置session过期时间
- Django查看原生SQL语句logging配置
- spark学习(一)
- UGUI自动布局
- RecyclerView的分割线