flume

来源：互联网发布：c#,json的类子编辑：程序博客网时间：2024/06/15 01:42

Flume:
   ** Flume是Cloudera提供的一个高可用的，高可靠的，分布式的海量日志采集、传输、聚合的系统。
   ** Flume仅仅运行在linux环境下
   ** flume.apache.org(Documentation--Flume User Guide)

Flume体系结构(Architecture)：(见图)
Source：用于采集数据，Source是产生数据流的地方，同时Source会将产生的数据流传输到Channel
Channel：连接 source 和 sink的数据传输通道
Sink：   从Channel收集数据，将数据写到目标源，可以是下一个Source也可以是HDFS或者HBase

数据种类：
   ** 系统(通常指web应用)产生的的日志，(能被flume够实时捕获）
   ** 系统中自定义方法、命令产生的数据，(也能被flume够实时捕获）

获取数据的方式
   ** 传统方式：
       脚本+命令 ==>【周期性】上传数据到HDFS，然后进行分析
       如：load data local inpath ...
       缺点：繁琐、效率低、容易遗漏
   ** 采用flume框架
       优点：简单、高效、实时性捕获

----flume安装-----------------------------

1、解压(建议安装到cdh目录里)
tar zxf /opt/softwares/flume-ng-1.5.0-cdh5.3.6.tar.gz
2、改名，并修改flume-env.sh
$ mv flume-env.sh.template flume-env.sh
export JAVA_HOME=/opt/modules/jdk1.7.0_67

3、使用flume-ng命令
$ bin/flume-ng
--conf      指定配置目录
--name      指定Agent的名称
--conf-file   指定具体的配置文件

====案例1===========================================================

需求：使用flume监控某个端口，把从端口写入的数据输出为logger

1、复制
[lxl@lxl01 apache-flume-1.5.0-cdh5.3.6-bin]$ cd conf/
$ cp -a flume-conf.properties.template flume-telnet.conf

2、修改flume-telnet.conf
24行以下全删了，然后复制下面内容进去
# Name the components on this agent
# a1为代理(中介)实例名，任意命名，agent分三部分
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# netcat是用于调试和检查网络的工具包，windows和linux(redhat)均可用，需要安装
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
# 可以在文档Flume Sinks--Logger Sink部分查找
# 往日志文件里面写
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
# 内存channel
a1.channels.c1.type = memory
# channel里存放的最大event数
a1.channels.c1.capacity = 1000
# 每个事务支持的最大event数
a1.channels.c1.transactionCapacity = 100

# 绑定source和sink到channel
# 注意：这里有's'
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

*** 配置文件的使用：
a) 命名
b) 配置source、sink、channel
c) 关联

---------------------

测试：
*** 安装telnet
$ su -
# yum -y install telnet

*** 启动flume，'-D'设置日志级别和输出源
$ bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/flume-telnet.conf -Dflume.root.logger=INFO,console

*** 打开另外一个窗口
$ netstat -an|grep 44444   --检查是否有程序(flume)在监听44444端口
$ telnet localhost 44444   --连接本机的44444端口，telnet是访问这个端口的客户端
然后随意输入字符串...

PS：
a) 退出telnet：'ctrl+]'，然后输入quit。
b) 若flume-ng无法退出，则打开一个新的窗口，jps(或netstat -antp|grep 44444)查找pid，使用 kill -9

====案例2===================================================================

** 本例是一个企业常用实例，用来监控日志

需求：实时抽取新生成的日志文件内容 --> 追加到HDFS上对应文件的末尾
      本例使用flume去监控某个文件，将新增添的内容抽取到其他地方，如HDFS
      本例监控的是apache的日志文件 /var/log/httpd/access_log

----安装Apache服务器-------

$ su -
# yum -y install httpd
# service httpd start
# service httpd status
** 编辑主页，/var/www/html是Apache web服务器根目录
# vi /var/www/html/index.html
随意输入内容...
** 打开浏览器，http://192.168.122.128访问网页

** 授权
# chmod 755 /var/log/httpd/

** 动态监看日志变化，刷新页面可以触发日志生成
# su - tom
$ tail -f /var/log/httpd/access_log    --'-F'和'-f'效果相同

----------------------------

$ cp -a flume-telnet.conf flume-apache.conf

a2.sources = r2
a2.channels = c2
a2.sinks = k2

# define sources
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /var/log/httpd/access_log
# '-c'表示命令行，必需写
a2.sources.r2.shell = /bin/bash -c

# define channels
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# define sinks
#启用设置多级目录，这里按"年月日/时"2级目录，每1小时生成一个文件夹
a2.sinks.k2.type = hdfs
#目录会自动生成
a2.sinks.k2.hdfs.path=hdfs://192.168.122.128:8020/flume/%Y%m%d/%H
# 文件前缀
a2.sinks.k2.hdfs.filePrefix = accesslog
#启用按时间生成文件夹
a2.sinks.k2.hdfs.round=true
#设置round值：1，单位：小时
a2.sinks.k2.hdfs.roundValue=1
a2.sinks.k2.hdfs.roundUnit=hour
#使用本地时间戳，如：用来命名文件
a2.sinks.k2.hdfs.useLocalTimeStamp=true

# 缓冲到hdfs之前，用以写文件的事件的最大数
a2.sinks.k2.hdfs.batchSize=1000
a2.sinks.k2.hdfs.fileType=DataStream
a2.sinks.k2.hdfs.writeFormat=Text

#解决文件过多过小的问题(若是使用默认配置，会生成很多个小文件)
#每600秒生成一个文件
a2.sinks.k2.hdfs.rollInterval=600
#当文件达到128000000字节时，会创建一个新文件
#实际环境中如果一个文件块128M,那么这里一般设置成127M（127*1024*1024）
a2.sinks.k2.hdfs.rollSize=128000000
#设置文件的生成和events数无关
a2.sinks.k2.hdfs.rollCount=0
#需要设置为1，否则当有副本复制时，就重新生成文件，上面三条则会失效
a2.sinks.k2.hdfs.minBlockReplicas=1

# bind the sources and sinks to the channels
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

测试：
a) 启动CDH Hadoop
$ sbin/start-dfs.sh ; sbin/start-yarn.sh ; mr-jobhistory-daemon.sh start historyserver
b) 启动Apache
# service httpd start
c) 启动flume
$ bin/flume-ng agent --conf conf/ --name a2 --conf-file conf/flume-apache.conf
d) 刷新http://192.168.122.128
   监看web日志：$ tail -f /var/log/httpd/access_log
   监看HDFS：   $ bin/hdfs dfs -tail -f /flu/20161227/17/accesslog.1482830478766.tmp

PS:
a) flume框架应该安装在web服务器(apache)所在的机器上，利于抽取数据到HDFS

b) 当日志文件过大时，会影响阅读，可以清空日志，便于查看
$ echo "" > flume.log
或者只查看最后100行
$ tail -100 flume.log

c) flume可以放在后台运行
$ bin/flume-ng agent --conf conf/ --name a2 --conf-file conf/flume-apache.conf &
终止的方法：
$ jps      --找到Application对应的pid，即是flume的pid
$ kill -9

d)
$ echo $SHELL      --当前系统默认shell
$ cat /etc/shells --当前系统中所有的shell，每种shell支持的功能用些小差别

-------------------

面试题：flume在使用过程中报如下异常，是什么原因？
07 九月 2016 10:41:39,846 ERROR [conf-file-poller-0] (org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run:145) - Failed to start agent because dependencies were not found in classpath. Error follows.
java.lang.NoClassDefFoundError: org/apache/hadoop/io/SequenceFile$CompressionType
   at org.apache.flume.sink.hdfs.HDFSEventSink.configure(HDFSEventSink.java:251)
   at org.apache.flume.conf.Configurables.configure(Configurables.java:41)
   at org.apache.flume.node.AbstractConfigurationProvider.loadSinks(AbstractConfigurationProvider.java:413)

需要导入jar(有时不导入这些jar，flume也能正常运行)
$ cp hadoop-hdfs-2.5.0-cdh5.3.6.jar   /opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin/lib/
$ cp hadoop-common-2.5.0-cdh5.3.6.jar /opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin/lib/
$ cp commons-configuration-1.6.jar    /opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin/lib/
$ cp hadoop-auth-2.5.0-cdh5.3.6.jar   /opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin/lib/

====案例3==============================================================

利用flume监控某个目录(/home/tom/log)，把里面回滚好的文件实时抽取到HDFS平台。

$ mkdir /home/tom/log
$ cd log
$ cp /var/log/httpd/access_log access_log.1
$ cp /var/log/httpd/access_log access_log.2
需求：抽取文件access_log.1和access_log.2

$ mkdir /opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin/checkpoint
$ mkdir /opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin/checkdata

$ cp -a flume-apache.conf flume-dir.conf

a3.sources = r3
a3.channels = c3
a3.sinks = k3

# define sources
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /home/tom/log
# 使用正则表达式指定忽略的文件
# '.'表示除了'\r\n'以外的任意字符，'*'表示0-n个
a3.sources.r3.ignorePattern = ^.*\_log$

# define channels
# 通过临时文件进行转存(即把数据缓存到一个临时文件中，然后一起flush)，速度慢，但数据相对安全
# 这里使用memory channel也可以
a3.channels.c3.type = file
# checkpoint文件存放的地方，checkpoint里存储着元数据信息，比如哪些文件被抽取过，哪些还没有...
a3.channels.c3.checkpointDir = /opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin/checkpoint
# 临时文件存放的地方
a3.channels.c3.dataDirs = /opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin/checkdata

# define sinks
#启用设置多级目录，这里按"年月日/时"2级目录，每1小时生成一个文件夹
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path=hdfs://192.168.122.128:8020/flume2/%Y%m%d/%H
a3.sinks.k3.hdfs.filePrefix = accesslog
#启用按时间生成文件夹
a3.sinks.k3.hdfs.round=true
a3.sinks.k3.hdfs.roundValue=1
a3.sinks.k3.hdfs.roundUnit=hour
#使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp=true

a3.sinks.k3.hdfs.batchSize=1000
a3.sinks.k3.hdfs.fileType=DataStream
a3.sinks.k3.hdfs.writeFormat=Text

#解决文件过多过小问题
#每600秒生成一个文件
a3.sinks.k3.hdfs.rollInterval=600
a3.sinks.k3.hdfs.rollSize=128000000
#设置文件的生成和events数无关
a3.sinks.k3.hdfs.rollCount=0
#设置成1，否则当有副本复制时就重新生成文件，上面三条则会失去效果
a3.sinks.k3.hdfs.minBlockReplicas=1

# bind the sources and sinks to the channels
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

测试：
$ bin/flume-ng agent --conf conf/ --name a3 --conf-file conf/flume-dir.conf
去http://192.168.122.128:50070查看即可
** 进入log/，可以看到，带后缀的表示抽取完成
$ ls
access_log.1.COMPLETED access_log.2.COMPLETED

再次生成一个日志文件，会发现其会被立即抽取
$ cp access_log.1.COMPLETED access_log.3
$ ls
access_log.1.COMPLETED access_log.3.COMPLETED   access_log.2.COMPLETED

====案例4================================================================

在同一个服务器启动三个agent:
agent1：用于实时监控/var/log/httpd/access_log

** flume-apache.conf

# 配置agent1
agent1.sources = r1
agent1.channels = c1
agent1.sinks = k1

# define sources
agent1.sources.r1.type = exec
# 注意：执行flume命令的用户对/var/log/httpd/access_log文件一定要有可读权限
agent1.sources.r1.command = tail -F /var/log/httpd/access_log
agent1.sources.r1.shell = /bin/bash -c

# define channels
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100

# define sinks
# 一种序列号技术
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = 192.168.122.128
agent1.sinks.k1.port = 4545

# bind the sources and sinks to the channels
agent1.sources.r1.channels = c1
agent1.sinks.k1.channel = c1

测试：
启动Apache

启动agent1：
$ bin/flume-ng agent --conf conf/ --name agent1 --conf-file conf/flume-apache.conf
$ tail -F /var/log/httpd/access_log
刷新网页，查看变化

------------------

agent2：用于实时监控/opt/modules/cdh/hive-0.13.1-cdh5.3.6/logs/hive.log
$ mkdir logs
$ vi conf/hive-log4j.properties
hive.log.dir=/opt/modules/cdh/hive-0.13.1-cdh5.3.6/logs

** flume-hive.conf

# 配置agent2
agent2.sources = r2
agent2.channels = c2
agent2.sinks = k2

# define sources
agent2.sources.r2.type = exec
agent2.sources.r2.command = tail -F /opt/modules/cdh/hive-0.13.1-cdh5.3.6/logs/hive.log
agent2.sources.r2.shell = /bin/bash -c

# define channels
agent2.channels.c2.type = memory
agent2.channels.c2.capacity = 1000
agent2.channels.c2.transactionCapacity = 100

# define sinks
agent2.sinks.k2.type = avro
agent2.sinks.k2.hostname = 192.168.122.128
agent2.sinks.k2.port = 4545

# bind the sources and sinks to the channels
agent2.sources.r2.channels = c2
agent2.sinks.k2.channel = c2

测试：
启动agent2：
$ bin/flume-ng agent --conf conf/ --name agent2 --conf-file conf/flume-hive.conf
$ tail -F /opt/modules/cdh/hive-0.13.1-cdh5.3.6/logs/hive.log
进入hive，随便执行几条语句，查看日志变化
hive> show databases;
...

-------------------

agent3：用于实时监控收集agent1和agent2传递过来的数据

** flume-collector.conf

# 配置agent3
agent3.sources = r3
agent3.channels = c3
agent3.sinks = k3

# define sources
agent3.sources.r3.type = avro
agent3.sources.r3.bind = 192.168.122.128
agent3.sources.r3.port = 4545

# define channels
agent3.channels.c3.type = memory
agent3.channels.c3.capacity = 1000
agent3.channels.c3.transactionCapacity = 100

# define sinks
# 启用设置多级目录，这里按"年月日"时 2级目录，每个小时生成一个文件夹
agent3.sinks.k3.type = hdfs
agent3.sinks.k3.hdfs.path=hdfs://192.168.122.128:8020/flume3/%Y%m%d/%H
agent3.sinks.k3.hdfs.filePrefix = accesslog

# 启用按小时生成文件夹
agent3.sinks.k3.hdfs.round=true
agent3.sinks.k3.hdfs.roundValue=1
agent3.sinks.k3.hdfs.roundUnit=hour
agent3.sinks.k3.hdfs.useLocalTimeStamp=true

agent3.sinks.k3.hdfs.batchSize=1000
agent3.sinks.k3.hdfs.fileType=DataStream
agent3.sinks.k3.hdfs.writeFormat=Text

# 解决文件过多过小的问题
# 每600秒生成一个文件
agent3.sinks.k3.hdfs.rollInterval=600
agent3.sinks.k3.hdfs.rollSize=128000000
# 设置文件的生成和events数无关
agent3.sinks.k3.hdfs.rollCount=0
# 设置成1，否则当有副本复制时就会重新生成文件，上面三条则会失效
agent3.sinks.k3.hdfs.minBlockReplicas=1

# bind the sources and sinks to the channels
agent3.sources.r3.channels = c3
agent3.sinks.k3.channel = c3

测试：
启动agent3：
$ bin/flume-ng agent --conf conf/ --name agent3 --conf-file conf/flume-collector.conf
进入CDH Hadoop，监控日志变化，注意：路径要修改(监控.temp文件效果会明显点)
$ bin/hdfs dfs -tail -f /flume3/20161220/11/accesslog.1482203839459

0 0