Flume

来源：互联网发布：长沙网站关键字优化编辑：程序博客网时间：2024/05/21 10:36

Flume Agent

    Flume Agent的配置被存储在一个本地配置文件,这是一个根据java属性文件格式的文本文件，在这个配置文件中，包括了对source、channel、sink的属性配置，和其相关联形成数据流的配置。    Flume Agent实时监控端口，收集数据，将其以日志的形式打印在控制台。    一个source可以分发到多个channel，一个sink取一个channel的数据。    Flume的开发就是编写配置文件，说白了就是Agent中Source、Channel和Sink的类型及属性。企业中常用的flume typesource(获取数据源)：    ->exec (文件)    ->spoolingdir (文件夹)    ->taildir(文件夹及文件的变动)    ->kafka    ->syslog    ->httpchannel(管道)：    ->mem    ->file    ->kafkasink(将channel中的 数据发送到目标地址):    ->hdfs    ->hive    ->hbase        ->同步        ->异步

Agent的编写（初级）

案例1：source：hive.log，channel: mem，sink:log

前期工作

拷贝一份conf目录下的文件：cp flume-conf.properties.template hive-men-log.properties动态查看hive的log：tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.log-Dflume.root.logger=INFO,console ： 日志级别

运行指令

bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/hive-mem-log.properties -Dflume.root.logger=INFO,console

具体内容

#define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = execa1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.loga1.sources.s1.shell = /bin/sh -c# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100    # define sinka1.sinks.k1.type = logger#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1

案例2：channel：file

前期工作

需要先创建目录在/opt/datas下：mkdir -p flume/datas，mkdir flume/check    以下两个参数是必须的a1.channels.c1.dataDirs = /opt/datas/flume/datasa1.channels.c1.checkpointDir = /opt/datas/flume/check

运行指令

bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/hive-file-log.properties -Dflume.root.logger=INFO,console

具体内容

#define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = execa1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.loga1.sources.s1.shell = /bin/sh -c# define channela1.channels.c1.type = filea1.channels.c1.dataDirs = /opt/datas/flume/datasa1.channels.c1.checkpointDir = /opt/datas/flume/check# define sinka1.sinks.k1.type = logger#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1

案例3：sink:hdfs

前期工作

案例分析：实时收集数据至HDFS    此案例实时监控Hive日志文件，放到HDFS目录中。    实时监控某个日志文件，将数据收集存储到HDFS上。此案例使用EXEC Source，实时监控文件数据，使用Memory Channel缓存数据，使用HDFS Sink写入数据。a1.sinks.k1.hdfs.path = /flume/event/size三种配置方式：1.配置全局变量 2.考入配置文件 3.写出HDFS绝对路径HDFS目录：a1.sinks.k1.hdfs.path = /flume/event

运行指令

bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/hive-mem-hdfs.properties -Dflume.root.logger=INFO,console

具体内容

#define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = execa1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.loga1.sources.s1.shell = /bin/sh -c# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100    # define sinka1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = /flume/eventa1.sinks.k1.hdfs.fileType = DataStream#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1

Agent当前存在问题（中级）

关于文件大小,案例4：size

前期工作

# define sink  基于大小，就把基于时间的关闭，设置为0即可a1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = /flume/event/sizea1.sinks.k1.hdfs.fileType = DataStreama1.sinks.k1.hdfs.rollSize = 10240a1.sinks.k1.hdfs.rollInterval = 0a1.sinks.k1.hdfs.rollCount = 0

运行指令

bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/hive-mem-size.properties -Dflume.root.logger=INFO,console

具体内容

#define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = execa1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.loga1.sources.s1.shell = /bin/sh -c# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100    # define sinka1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = /flume/event/sizea1.sinks.k1.hdfs.fileType = DataStreama1.sinks.k1.hdfs.rollSize = 10240a1.sinks.k1.hdfs.rollInterval = 0a1.sinks.k1.hdfs.rollCount = 0#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1

hive分区表与修改文件头部,案例5

hive分区

a1.sinks.k1.hdfs.path = /flume/event/date/date=%Y%m%d/hour=%H%M

修改文件头部
```
a1.sinks.k1.hdfs.filePrefix = hive-log
```

运行指令

bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/hive-mem-part.properties -Dflume.root.logger=INFO,console

具体内容

#define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = execa1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.loga1.sources.s1.shell = /bin/sh -c# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100    # define sinka1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = /flume/event/date/date=%Y%m%d/hour=%H%Ma1.sinks.k1.hdfs.fileType = DataStreama1.sinks.k1.hdfs.rollSize = 10240a1.sinks.k1.hdfs.rollInterval = 0a1.sinks.k1.hdfs.rollCount = 0a1.sinks.k1.hdfs.useLocalTimeStamp = truea1.sinks.k1.hdfs.filePrefix = hive-log#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1

企业中的日志存储一（高级一）

案例6：source：spoolingdir

前期工作

企业中日志文件的产生如下：web_log/20161127.log.tmp  ->  20161127.log            20161128.log.tmp如何进行数据的采集？exec：监控是一个文件需求：动态监控文件夹内新文件的生成，并且采集source：spoolingdir   监控目录创建目录：/opt/datas/flume/spoodir

运行指令

bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/dir-mem-hdfs.properties -Dflume.root.logger=INFO,console

具体内容

#define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = spooldira1.sources.s1.spoolDir = /opt/datas/flume/spooldir# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100    # define sinka1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = /flume/dira1.sinks.k1.hdfs.fileType = DataStream#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1

测试
拷贝一个文件到目录下，运行结束后文件名变为20161202.log.COMPLETED 表示上传完成。

优化：忽略以.tmp结尾的文件

#define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = spooldira1.sources.s1.spoolDir = /opt/datas/flume/spooldira1.sources.s1.ignorePattern = ([^ ]*\.tmp$)# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100    # define sinka1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = /flume/dir-iga1.sinks.k1.hdfs.fileType = DataStream#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1

企业架构（高级二）

多sink的配置：一个sink对应一个channel

具体内容

#define a agent a1.sources = s1a1.channels = hdfsc1 hdfsc2a1.sinks = hdfs1 hdfs2# define sourcea1.sources.s1.type = execa1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.loga1.sources.s1.shell = /bin/sh -c# define channel1a1.channels.hdfsc1.type = filea1.channels.hdfsc1.dataDirs = /opt/datas/flume/data1a1.channels.hdfsc1.checkpointDir = /opt/datas/flume/check1# define channel2a1.channels.hdfsc2.type = filea1.channels.hdfsc2.dataDirs = /opt/datas/flume/data2a1.channels.hdfsc2.checkpointDir = /opt/datas/flume/check2  # define sink1a1.sinks.hdfs1.type = hdfsa1.sinks.hdfs1.hdfs.path = /flume/dir1a1.sinks.hdfs1.hdfs.fileType = DataStream# define sink2a1.sinks.hdfs2.type = hdfsa1.sinks.hdfs2.hdfs.path = /flume/dir2a1.sinks.hdfs2.hdfs.fileType = DataStream#zuhe source sink channela1.sources.s1.channels = hdfsc1 hdfsc2a1.sinks.hdfs1.channel = hdfsc1a1.sinks.hdfs2.channel = hdfsc2

flume collect

前期工作

avro sink  &  avro source    agent端配置：收集日志传输给avro source    collect端配置：收集agent的数据，传递给HDFS sink配置好avro-collect.properties发送到集群中其他机器

运行指令

主机运行collect：bin/flume-ng agent --conf conf/  --name a1 --conf-file conf/avro-collect.properties -Dflume.root.logger=INFO,consoleslave1机器运行agent1： bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/avro-agent.properties -Dflume.root.logger=INFO,consoleslave2机器运行agent2： bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/avro-agent.properties -Dflume.root.logger=INFO,console

具体内容：avro-collect.properties

#define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = avroa1.sources.s1.bind = 192.168.134.191a1.sources.s1.port = 50505# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100    # define sinka1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = /flume/avroa1.sinks.k1.hdfs.fileType = DataStream#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1

具体内容：avro-agent.properties

#define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = execa1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.loga1.sources.s1.shell = /bin/sh -c# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100# define sinka1.sinks.k1.type = avroa1.sinks.k1.hostname = 192.168.134.191a1.sinks.k1.port = 50505#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1

使用taildir（高级三）

关于taildir
1. 需求
  监控文件的同时要监控文件夹，但是exec,spoolingdir满足不了需求，因此选择使用taildir，这是1.7以后才有的功能，需要手动编译源码。
编译完成taildir，GitHub
1. 新建一个目录，尽量不包含中文
2. 进入目录，执行
```
git clone https://github.com/apache/flume.git
```
3. 下载完成，进入flume目录
```
查看所有版本：git branch -r 查看当前版本：git branch -a切换版本：git checkout origin/flume-1.7
```
4. 导入eclipse
  1. jar包替换
```
-》1.5.0    1.5.0-cdh5.3.6-》cdh版本    添加cdh的maven源-》建议直接拷贝提供的pom
```
  2. 报错
    1. 删除overwrite
    2. 拷贝flume-1.7/core/source/PollableSourceConstants.class到项目中
5. 编译Maven
6. 将jar包放入flume的lib：flume-taildir-source-1.5.0-cdh5.3.6.jar

实现taildir

前期工作

创建文件：[beifeng@hadoop-senior01 flume]$ echo " " >hadoop10.log创建目录：[beifeng@hadoop-senior01 flume]$ mkdir hadoop10

运行指令

bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/taildir-mem-log.properties -Dflume.root.logger=INFO,console

具体内容

#define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = org.apache.flume.source.taildir.TaildirSourcea1.sources.s1.positionFile = /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6-bin/position/taildir_position.jsona1.sources.s1.filegroups = f1 f2a1.sources.s1.filegroups.f1 = /opt/datas/flume/hadoop10.loga1.sources.s1.headers.f1.headerKey1 = value1a1.sources.s1.filegroups.f2 = /opt/datas/flume/hadoop10/.*a1.sources.s1.headers.f2.headerKey1 = value2-1a1.sources.s1.headers.f2.headerKey2 = value2-2a1.sources.s1.fileHeader = true# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100# define sinka1.sinks.k1.type = logger#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1

实现效果
测试

控制台显示

taildir_position.json内容

0 0