Flume自定义功能实现

来源:互联网 发布:淘宝c店已经死了 编辑:程序博客网 时间:2024/05/21 19:38
该功能可以实现flume读取xml配置文件在avro sink模式下可以同时开启多个端口,并且根据客户定义的xml来将数据处理后导入多个集群中的相应hbase表中。

实现流程如下:


准备工作:
首先开启多个集群并且确认hbase和flume可以正常使用,还需将dom4j(用来解析xml文件)所需要的jar包dom4j-1.6.1.jar和jaxen-1.1-beta-7.jar(利用xpath技术来像sql一样来定位xml中的内容)导入到flume的lib目录下,并且将要读取的xml文件复制在所有节点中,并且导入用eclipse将项目打包成core.jar的jar包(这里需要注意的是字符编码的问题,不一致的话会报错),这里我开启了两个集群做测试。再将xml配置文件导入两个集群的所有几点/home/hadoop目录下,dao.xml内容如下:
[hadoop@h71 ~]$ cat dao.xml 

<?xml version="1.0" encoding="UTF-8"?><configuration>   <ip name="192.168.8.71">      <duankou>4141</duankou>      <duankou>4040</duankou>      <biao name="messages">         <liezus>            <zu name="cf">               <ziduan name="ip">3</ziduan>               <ziduan name="host">4</ziduan>            </zu>            <zu name="df">               <ziduan name="leixing">5</ziduan>               <ziduan name="xinxi">6</ziduan>            </zu>         </liezus>         <fengefu> </fengefu>      </biao>      <biao name="hui">         <liezus>            <zu name="ef">               <ziduan name="haha">3</ziduan>               <ziduan name="hehe">4</ziduan>            </zu>         </liezus>         <fengefu> </fengefu>      </biao>   </ip>   <ip name="192.168.8.21">      <duankou>5151</duankou>      <biao name="messages">         <liezus>            <zu name="cf">               <ziduan name="ip">3</ziduan>               <ziduan name="host">4</ziduan>            </zu>            <zu name="df">               <ziduan name="leixing">5</ziduan>               <ziduan name="xinxi">6</ziduan>            </zu>         </liezus>         <fengefu> </fengefu>      </biao>      <biao name="hui">         <liezus>            <zu name="ef">               <ziduan name="haha">3</ziduan>               <ziduan name="hehe">4</ziduan>            </zu>         </liezus>         <fengefu> </fengefu>      </biao>   </ip></configuration>
该项目想实现在一个sink端启动多个集群端口并且将数据插入对应集群中的hbase表中。该项目还实现了断点续传、在hbase中自动建表、file_roll模式下能生成想要的文件。
我这里要启动h71的flume进程来启动三个端口(第一个端口要在h71的hbase中自动建立相应的表并且插入数据,第二个端口在file_roll模式的sink下向h71的/home/hadoop/hui目录下生成messages.txt文件,第三个端口要在h21下的hbase中建相应的表并且插入数据)

在h71启动flume进程产生三个端口时需先启动相应三个端口的source端:

hadoop@h21:~/apache-flume-1.6.0-cdh5.5.2-bin/conf$ cat messages5.conf 

# Name the components on this agenta1.sources = r1a1.channels = c1a1.sinks = k1# Describe/configure the sourcea1.sources.r1.type = hui.avrosource.AvroSourcea1.sources.r1.channels = c1a1.sources.r1.bind = 192.168.8.21a1.sources.r1.port = 5151# Describe the sinka1.sinks.k1.type = hui.org.apache.flume.sink.hbase.HBaseSinka1.sinks.k1.serializer = com.tcloud.flume.AsyncHbaseLogEventSerializera1.sinks.k1.channel = memoryChannel# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1
hadoop@h21:~/apache-flume-1.6.0-cdh5.5.2-bin$ bin/flume-ng agent -c . -f conf/messages5.conf -n a1 -Dflume.root.logger=INFO,console
(若hbase中没有xml中的表则建立,有则不建立,该hbase中没有,进程显示
Create messages SUCCESS!
Create hui SUCCESS!
12/12/13 01:08:58 INFO sink.DefaultSinkFactory: Creating instance of sink: k1, type: hui.org.apache.flume.sink.hbase.HBaseSink
表名:messages
列族名:cf)
hbase(main):013:0> listTABLE                                                                                                                                                                                                                                        hui                                                                                                                                                                                                                                          messages2 row(s) in 0.0220 seconds
[hadoop@h71 conf]$ cat messages5.conf 

# Name the components on this agenta1.sources = r1a1.channels = c1a1.sinks = k1# Describe/configure the sourcea1.sources.r1.type = hui.avrosource.AvroSourcea1.sources.r1.channels = c1a1.sources.r1.bind = 192.168.8.71a1.sources.r1.port = 4141# Describe the sinka1.sinks.k1.type = hui.org.apache.flume.sink.hbase.HBaseSinka1.sinks.k1.serializer = com.tcloud.flume.AsyncHbaseLogEventSerializera1.sinks.k1.channel = memoryChannel# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1
[hadoop@h71 apache-flume-1.6.0-cdh5.5.2-bin]$ bin/flume-ng agent -c . -f conf/messages5.conf -n a1 -Dflume.root.logger=INFO,console
(hbase中存在相应的表,进程显示:
messages exists!
hui exists!
17/03/18 15:36:46 INFO sink.DefaultSinkFactory: Creating instance of sink: k1, type: hui.org.apache.flume.sink.hbase.HBaseSink
表名:messages
列族名:cf)


[hadoop@h71 conf]$ cat messages6.conf 

# Name the components on this agenta1.sources = r1a1.channels = c1a1.sinks = k1# Describe/configure the sourcea1.sources.r1.type = avroa1.sources.r1.channels = c1a1.sources.r1.bind = 192.168.8.71a1.sources.r1.port = 4040# Describe the sinka1.sinks.k1.type = cn.huyanping.flume.sinks.SafeRollingFileSinka1.sinks.k1.channel = c1a1.sinks.k1.sink.directory = /home/hadoop/huia1.sinks.k1.sink.rollInterval = 0# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1
[hadoop@h71 hui]$ ls (/home/hadoop/hui目录下为空)
[hadoop@h71 apache-flume-1.6.0-cdh5.5.2-bin]$ bin/flume-ng agent -c . -f conf/messages6.conf -n a1 -Dflume.root.logger=INFO,console
[hadoop@h71 hui]$ ls
messages.txt
(启动进程后产生messages.tt空文件)

启动sink端:
[hadoop@h71 conf]$ cat messages4.conf 

a1.sources = r1a1.channels = c1a1.sinks = k1# Describe/configure the sourcea1.sources.r1.type = org.apache.flume.chiwei.filemonitor.FileMonitorSourcea1.sources.r1.channels = c1a1.sources.r1.file = /home/hadoop/messagesa1.sources.r1.positionDir = /home/hadoop# Describe the sinka1.sinks.k1.type = hui.avrosink.AvroSinka1.sinks.k1.batch-size = 2# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1
[hadoop@h71 ~]$ cat messages (所要导入数据的日志文件源)
Jan 23 19:59:00 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Feb 20 06:25:04 h107 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="22204" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Jan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Jan  3 19:59:02 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnames


[hadoop@h71 apache-flume-1.6.0-cdh5.5.2-bin]$ bin/flume-ng agent -c . -f conf/messages4.conf -n a1 -Dflume.root.logger=INFO,console
(该进程启动后在前面启动的三个source端所监听的端口都启动成功,如下)

17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xabab0b15, /192.168.8.71:33975 => /192.168.8.71:4040] OPEN17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xabab0b15, /192.168.8.71:33975 => /192.168.8.71:4040] BOUND: /192.168.8.71:404017/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xabab0b15, /192.168.8.71:33975 => /192.168.8.71:4040] CONNECTED: /192.168.8.71:3397517/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xb9dd8531, /192.168.8.71:51345 => /192.168.8.71:4141] OPEN17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xb9dd8531, /192.168.8.71:51345 => /192.168.8.71:4141] BOUND: /192.168.8.71:414117/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xb9dd8531, /192.168.8.71:51345 => /192.168.8.71:4141] CONNECTED: /192.168.8.71:5134512/12/13 01:19:05 INFO ipc.NettyServer: [id: 0x65dc160e, /192.168.8.71:50634 => /192.168.8.21:5151] OPEN12/12/13 01:19:05 INFO ipc.NettyServer: [id: 0x65dc160e, /192.168.8.71:50634 => /192.168.8.21:5151] BOUND: /192.168.8.21:515112/12/13 01:19:05 INFO ipc.NettyServer: [id: 0x65dc160e, /192.168.8.71:50634 => /192.168.8.21:5151] CONNECTED: /192.168.8.71:50634)
[hadoop@h71 ~]$ echo "Jan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames" >> messages
h71的sink端的输出:
Jan 23 19:59:00 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnamesFeb 20 06:25:04 h107 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="22204" x-info="http://www.rsyslog.com"] rsyslogd was HUPedJan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnamesJan  3 19:59:02 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnamesJan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames17/03/18 15:46:46 INFO avrosink.AvroSink: Attempting to create Avro Rpc client.17/03/18 15:46:46 WARN api.NettyAvroRpcClient: Using default maxIOWorkersclient-->NettyAvroRpcClient { host: h71, port: 4141 }17/03/18 15:46:47 INFO avrosink.AvroSink: Attempting to create Avro Rpc client.17/03/18 15:46:47 WARN api.NettyAvroRpcClient: Using default maxIOWorkersclient-->NettyAvroRpcClient { host: h71, port: 4040 }17/03/18 15:46:47 INFO avrosink.AvroSink: Attempting to create Avro Rpc client.17/03/18 15:46:47 WARN api.NettyAvroRpcClient: Using default maxIOWorkersclient-->NettyAvroRpcClient { host: 192.168.8.21, port: 5151 }
在h21查看相应hbase中的表:
hbase(main):014:0> scan 'messages'ROW                                                          COLUMN+CELL                                                                                                                                                                      2012-12-13 01:23:00                                         column=cf:host, timestamp=1355379762701, value=s_sys@hui                                                                                                                     2012-12-13 01:23:00                                         column=cf:ip, timestamp=1355379762628, value=192.168.101.254                                                                                                                     2012-12-13 01:23:00                                         column=df:leixing, timestamp=1355379762741, value=trafficlogger:                                                                                                                 2012-12-13 01:23:00                                         column=df:xinxi, timestamp=1355379762791, value=empty                                                                                                                            2012-12-13 01:23:01                                         column=cf:host, timestamp=1355379763516, value=s_sys@hui                                                                                                                     2012-12-13 01:23:01                                         column=cf:ip, timestamp=1355379763488, value=::                                                                                                                                  2012-12-13 01:23:01                                         column=df:leixing, timestamp=1355379763544, value=trafficlogger:                                                                                                                 2012-12-13 01:23:01                                         column=df:xinxi, timestamp=1355379763573, value=empty                                                                                                                           2 row(s) in 0.0610 secondshbase(main):015:0> scan 'hui'ROW                                                          COLUMN+CELL                                                                                                                                                                      2012-12-13 01:23:01                                         column=ef:haha, timestamp=1355379763422, value=19:59:02                                                                                                                          2012-12-13 01:23:01                                         column=ef:hehe, timestamp=1355379763452, value=192.168.101.254                                                                                                                   2012-12-13 01:23:02                                         column=ef:haha, timestamp=1355379763607, value=::                                                                                                                                2012-12-13 01:23:02                                         column=ef:hehe, timestamp=1355379763635, value=s_sys@hui                                                                                                                    2 row(s) in 0.0500 seconds
在h71查看相应hbase中的表:
hbase(main):012:0> scan 'messages'ROW                                                          COLUMN+CELL                                                                                                                                                                      2017-03-18 15:46:47                                         column=cf:host, timestamp=1489823233223, value=192.168.101.254                                                                                                                   2017-03-18 15:46:47                                         column=cf:ip, timestamp=1489823233185, value=19:59:02                                                                                                                            2017-03-18 15:46:47                                         column=df:leixing, timestamp=1489823233263, value=s_sys@hui                                                                                                                  2017-03-18 15:46:47                                         column=df:xinxi, timestamp=1489823233297, value=trafficlogger:                                                                                                                   2017-03-18 15:46:48                                         column=cf:host, timestamp=1489823233439, value=s_sys@hui                                                                                                                     2017-03-18 15:46:48                                         column=cf:ip, timestamp=1489823233406, value=::                                                                                                                                  2017-03-18 15:46:48                                         column=df:leixing, timestamp=1489823233471, value=trafficlogger:                                                                                                                 2017-03-18 15:46:48                                         column=df:xinxi, timestamp=1489823233505, value=empty                                                                                                                           2 row(s) in 0.3660 secondshbase(main):013:0> scan 'hui'ROW                                                          COLUMN+CELL                                                                                                                                                                      2017-03-18 15:46:47                                         column=ef:haha, timestamp=1489823233106, value=::                                                                                                                                2017-03-18 15:46:47                                         column=ef:hehe, timestamp=1489823233145, value=s_sys@hui                                                                                                                     2017-03-18 15:46:48                                         column=ef:haha, timestamp=1489823233544, value=::                                                                                                                                2017-03-18 15:46:48                                         column=ef:hehe, timestamp=1489823233578, value=s_sys@hui                                                                                                                    2 row(s) in 0.0160 seconds
[hadoop@h71 hui]$ cat messages.txt 
Jan 23 19:59:00 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnamesFeb 20 06:25:04 h107 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="22204" x-info="http://www.rsyslog.com"] rsyslogd was HUPedJan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnamesJan  3 19:59:02 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnamesJan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames
并且在/home/hadoop/目录下生成position.log作为端点续传的功能。


项目代码已经上传http://download.csdn.net/download/m0_37739193/10154814

原创粉丝点击