Flume断点续传深入研究

来源:互联网 发布:firefox flash linux 编辑:程序博客网 时间:2024/06/01 08:30
方法一:在excel source中运用复杂的tail命令
在百度中搜索到一篇文章:https://my.oschina.net/leejun2005/blog/288136

可以在tail传的时候记录行号,下次再传的时候,取上次记录的位置开始传输,类似:

agent1.sources.avro-source1.command = /usr/local/bin/tail  -n +$(tail -n1 /home/storm/tmp/n) --max-unchanged-stats=600 -F  /home/storm/tmp/id.txt | awk 'ARNGIND==1{i=$0;next}{i++; if($0~/文件已截断/)i=0; print i >> "/home/storm/tmp/n";print $1"---"i}' /home/storm/tmp/n -
需要注意如下几点:
(1)文件被rotation的时候,需要同步更新你的断点记录“指针”,
(2)需要按文件名来追踪文件,
(3)flume挂掉后需要累加断点续传“指针”
(4)flume挂掉后,如果恰好文件被rotation,那么会有丢数据的风险,只能监控尽快拉起或者加逻辑判断文件大小重置指针。
(5)tail 注意你的版本,请更新coreutils包到最新。


于是乎:
[hadoop@h71 ~]$ cat data.txt 

Jan 18 22:25:55 192.168.101.254 s_sys@hui syslog-ng[69]: STATS: dropped 0Jan 31 10:24:27 192.168.101.254 s_sys@hui kernel: External interface ADSL is downJan 31 10:24:31 192.168.101.254 s_sys@hui system: ||||IPSec event unroute-client on tunnel to gzFeb  2 06:26:01 h107 rsyslogd0: action 'action 17' resumed (module 'builtin:ompipe') [try http://www.rsyslog.com/e/0 ]Feb 20 06:25:04 h107 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="22204" x-info="http://www.rsyslog.com"] rsyslogd was HUPedFeb 21 10:24:32 192.168.101.254 s_sys@hui kernel: Feb 21 10:24:33 192.168.101.254 s_sys@hui kernel: 
[hadoop@h71 ~]$ cat n
0

[hadoop@h71 ~]$ echo "Feb 21 10:24:33 192.168.101.254 s_sys@hui kernel: " >> data.txt
[hadoop@h71 ~]$ cat n
1
2
3
4
5
6
7


[hadoop@h71 conf]$ cat hehe.sh
/home/hadoop/apache-flume-1.6.0-cdh5.5.2-bin/bin/flume-ng agent -c /home/hadoop/apache-flume-1.6.0-cdh5.5.2-bin/conf -f /home/hadoop/apache-flume-1.6.0-cdh5.5.2-bin/conf/hbase_simple.conf -n a1 -Dflume.root.logger=INFO,console
tail -n +0 -F /home/hadoop/data.txt |awk 'ARNGIND==1{next}{i++;print i > "/home/hadoop/n"}' /home/hadoop/data.txt
(一开始我还担心第二条命令也会和第一条命令一并执行,那这样的话n文件就不能记录flume进程挂掉后的行数了啊,最后还好经过试验知道第二条命令在第一条命令执行完成后才执行)


[hadoop@h71 conf]$ cat hbase_simple.conf 

a1.sources = r1a1.sinks = k1a1.channels = c1# Describe/configure the sourcea1.sources.r1.type = execa1.sources.r1.command = tail -n +$(tail -n -1 /home/hadoop/n) -F /home/hadoop/data.txta1.sources.r1.port = 44444a1.sources.r1.host = 192.168.8.71a1.sources.r1.channels = c1# Describe the sinka1.sinks.k1.type = loggera1.sinks.k1.type = hbasea1.sinks.k1.table = messagesa1.sinks.k1.columnFamily = hosta1.sinks.k1.columnFamily = cfa1.sinks.k1.serializer = com.tcloud.flume.AsyncHbaseLogEventSerializera1.sinks.k1.channel = memoryChannel# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channea1.sources.r1.channels = c1a1.sinks.k1.channel = c1
[hadoop@h71 conf]$ sh hehe.sh
但是:
tail -n +$(tail -n -1 /home/hadoop/n) -F /home/hadoop/data.txt这条指令在Linux中好使,在flume中却不好使,难道是$的原因吗。。。
在Linux中:
[hadoop@h71 conf]$ tail -n +$(tail -n -1 /home/hadoop/n) -F /home/hadoop/data.txt
Jan 18 22:25:55 192.168.101.254 s_sys@hui syslog-ng[69]: STATS: dropped 0Jan 31 10:24:27 192.168.101.254 s_sys@hui kernel: External interface ADSL is downJan 31 10:24:31 192.168.101.254 s_sys@hui system: ||||IPSec event unroute-client on tunnel to gzFeb  2 06:26:01 h107 rsyslogd0: action 'action 17' resumed (module 'builtin:ompipe') [try http://www.rsyslog.com/e/0 ]Feb 20 06:25:04 h107 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="22204" x-info="http://www.rsyslog.com"] rsyslogd was HUPedFeb 21 10:24:32 192.168.101.254 s_sys@hui kernel: 
在flume中sink能开启,但是source端却抽不到数据。。


无奈之下我只能换了思路,那就是每次flume进程因故障挂掉的时候,去查看n文件的最后一行数字就是flume进程结束时停止的行数,然后再手动去修改hbase_simple.conf中的
a1.sources.r1.command = tail -n +0 -F /home/hadoop/data.txt   (-n后面的参数就是查看n文件最后一行的数字再加一)
比如n文件的最后一行是7,那么再启动flume的时候就得将hbase_simple.conf中的a1.sources.r1.command改为tail -n +8 -F /home/hadoop/data.txt
哎,但是我感觉这种方法好笨啊,虽然是无奈之举。。


额,后来又想到这样还不能在a1.sources.r1.command中写入两个文件了,如果写入两个文件的话会出现这种情况(无法分开统计,成了累加了):
[hadoop@h71 ~]$ tail -n +0 -F /home/hadoop/data.txt /home/hadoop/data2.txt |awk 'ARNGIND==1{next}{i++;print i > "/home/hadoop/n"}' /home/hadoop/data.txt /home/hadoop/data2.txt
[hadoop@h71 ~]$ cat n
1
2
3
4
5
6
7
8
9
10
11
12
[hadoop@h71 ~]$ tail -n +0 -F /home/hadoop/data.txt /home/hadoop/data2.txt

==> /home/hadoop/data.txt <==Jan 18 22:25:55 192.168.101.254 s_sys@hui syslog-ng[69]: STATS: dropped 0Jan 31 10:24:27 192.168.101.254 s_sys@hui kernel: External interface ADSL is downJan 31 10:24:31 192.168.101.254 s_sys@hui system: ||||IPSec event unroute-client on tunnel to gzFeb  2 06:26:01 h107 rsyslogd0: action 'action 17' resumed (module 'builtin:ompipe') [try http://www.rsyslog.com/e/0 ]Feb 20 06:25:04 h107 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="22204" x-info="http://www.rsyslog.com"] rsyslogd was HUPedFeb 21 10:24:32 192.168.101.254 s_sys@hui kernel: Feb 21 10:24:33 192.168.101.254 s_sys@hui kernel: Feb 21 10:24:33 192.168.101.254 s_sys@hui kernel: ==> data2.txt <==Feb 01 05:54:55 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnamesJan 23 20:07:00 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnamesJan 29 06:29:39 h107 rsyslogd-2007: action 'action 17' suspended, next retry is Sun Jan 29 06:30:09 2017 [try http://www.rsyslog.com/e/2007 ]Jan 29 20:07:01 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnames
参考地址:http://www.cnblogs.com/zhzhang/p/5778836.html

后来发现在Linux下这个命令真的挺好用的,如果flume好使的话真的是不错,奈何flume不好使
a1.sources.r1.command = tail -n +$(tail -n1 /home/hadoop/n) -F /home/hadoop/data.txt 2>&1 | awk 'ARGIND==1{i=$0;next}{i++;if($0~/^tail/){i=0};print $0;print i >> "/home/hadoop/n";fflush("")}' /home/hadoop/n -
并且需要在首次启动flume进程时把n文件内容设置为1,如果设置为0的话第二次启动的时候读取的行数不是结束时的行数。


后来在群里问有人说flume不一定可以理解这么复杂的shell的,人家说exec是Java程序而已,又不是shell终端,普通的tail -f也只是给flume数据流而已,需要看看exec的源码,估计也只是从shell命令拿到数据流而已


方法二:自己编写source端
参考:github地址:https://github.com/cwtree/flume-filemonitor-source


先在hbase中建立相应的表:
hbase(main):175:0> create 'messages','cf','host'


flume配置:
[hadoop@h71 conf]$ cat chiwei.conf 

a1.sources = r1a1.sinks = k1a1.channels = c1a1.sources.r1.type = org.apache.flume.chiwei.filemonitor.FileMonitorSourcea1.sources.r1.channels = c1a1.sources.r1.file = /home/hadoop/messagesa1.sources.r1.positionDir = /home/hadoopa1.sinks.k1.type = loggera1.sinks.k1.type = hbasea1.sinks.k1.table = messagesa1.sinks.k1.columnFamily = hosta1.sinks.k1.columnFamily = cfa1.sinks.k1.serializer = com.tcloud.flume.AsyncHbaseLogEventSerializera1.sinks.k1.channel = memoryChannela1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100a1.sources.r1.channels = c1a1.sinks.k1.channel = c1
它会监测那个文件的变化的,并且周期性的将当前写过的位置记录到position.log文件中,按字节处理的;如果进程挂了,下次重新启动,它会自动按照position.log最后记录的字节位置开始往后写到下游的

这个有个弊端:
就是再次启动flume进程的时候当意外停止进程到启动进程之间有一行日志输入我在sink端可以匹配空白行解决,但是如果期间有多行导入的时候就无能为力了,这期间的数据就会丢失了,启动之后日志如果是一行一行的输入也没事,就怕一次导入多行我这个就不灵了。我感觉得修改source端的源码让数据一行一行的读取,而不是一起读取。


问题再现:

Jan 31 10:24:36 192.168.101.254 s_sys@hui kernel: External interface ADSL is downnJan 31 10:24:37 192.168.101.254 s_sys@hui kernel: External interface ADSL is downnJan 31 10:24:38 192.168.101.254 s_sys@hui kernel: External interface ADSL is downnJan 31 10:24:39 192.168.101.254 s_sys@hui kernel: External interface ADSL is downn(一次输入多行导入hbase失败)nullJan 31 10:24:40 192.168.101.254 s_sys@hui kernel: External interface ADSL is downn(一次输入一行导入hbase成功)com.tcloud.flume.AccessLog@150cf6c20170131102440Jan 31 10:24:41 192.168.101.254 s_sys@hui kernel: External interface ADSL is downncom.tcloud.flume.AccessLog@cb6c3320170131102441
解决:
我在AsyncHbaseLogEventSerializer中添加了数组遍历,而不是在正则中匹配空行,这才解决了问题。。。

12/12/13 18:13:04 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: k1 startedJan 18 22:25:55 192.168.101.254 s_sys@hui syslog-ng[69]: STATS: dropped 0Jan 31 10:24:27 192.168.101.254 s_sys@hui kernel: External interface ADSL is downJan 31 10:24:31 192.168.101.254 s_sys@hui system: ||||IPSec event unroute-client on tunnel to gzFeb  2 06:26:01 h107 rsyslogd0: action 'action 17' resumed (module 'builtin:ompipe') [try http://www.rsyslog.com/e/0 ]Feb 20 06:25:04 h107 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="22204" x-info="http://www.rsyslog.com"] rsyslogd was HUPedJan 31 10:24:27 192.168.101.254 s_sys@hui kernel: External interface ADSL is downJan 31 10:24:28 192.168.101.254 s_sys@hui kernel: External interface ADSL is downJan 18 22:25:55 192.168.101.254 s_sys@hui syslog-ng[69]: STATS: dropped 0com.tcloud.flume.AccessLog@150cf6c20170118222555Jan 31 10:24:27 192.168.101.254 s_sys@hui kernel: External interface ADSL is downcom.tcloud.flume.AccessLog@1ee97020170131102427Jan 31 10:24:31 192.168.101.254 s_sys@hui system: ||||IPSec event unroute-client on tunnel to gzcom.tcloud.flume.AccessLog@16fe33520170131102431Feb  2 06:26:01 h107 rsyslogd0: action 'action 17' resumed (module 'builtin:ompipe') [try http://www.rsyslog.com/e/0 ]com.tcloud.flume.AccessLog@adca0820170202062601Feb 20 06:25:04 h107 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="22204" x-info="http://www.rsyslog.com"] rsyslogd was HUPedcom.tcloud.flume.AccessLog@1ab04ce20170220062504Jan 31 10:24:27 192.168.101.254 s_sys@hui kernel: External interface ADSL is downcom.tcloud.flume.AccessLog@c889db20170131102427Jan 31 10:24:28 192.168.101.254 s_sys@hui kernel: External interface ADSL is downcom.tcloud.flume.AccessLog@1d2b35220170131102428Jan 31 10:24:29 192.168.101.254 s_sys@hui kernel: External interface ADSL is downJan 31 10:24:29 192.168.101.254 s_sys@hui kernel: External interface ADSL is downcom.tcloud.flume.AccessLog@1913b2920170131102429
代码已上传http://download.csdn.net/download/m0_37739193/10154916


总结:

上面两个方法是在flume的旧版本的时候没有断点续传功能的时候大家的一种大胆的尝试,可是毕竟大家水平有限而且都是个人在搞,所以都存在一定的缺陷。直到flume1.7推出了TaildirSource组件后(大家可以参考我的另一篇文章http://blog.csdn.net/m0_37739193/article/details/72962192),终于实现了断点续传功能,而且人家肯定是比较好的(有团队组织在维护,里面有大牛在哈,人家肯定是经过了反复的测试研究后才发布的)。上面两个方法就是为了给大家打开思路,提供参考而已,感觉思想还是很不错的。最后就是,既然已经有了现成好用的东西了,你也没必有再自己造轮子了,而且你造的轮子未必有人家的好。。。