Flume hdfs 案例

来源：互联网发布：探测网络拓扑编辑：程序博客网时间：2024/05/16 17:56

1、业务背景：
A、B两台机器上实时生产日志文件：access.log、ugcheader.log、ugctail.log

现在要求：
把A、B 机器中的 access.log、ugcheader.log、ugctail.log 汇总到C机器上统一收集到hdfs中。

但是在hdfs中要求的目录为：
/source/access/20160101/**
/source/ugcheader/20160101/**
/source/ugctail/20160101/**

2、过程

3、机器

A：hadoop1
B：hadoop2
C：hadoop3

4、首先安装hadoop 集群

http://blog.csdn.net/u012689336/article/details/52857921

5、安装flume

本文使用的是：apache-flume-1.5.0-bin.tar.gz
解压
[sparkadmin@hadoop1 ~]$ tar -zxvf apache-flume-1.5.0-bin.tar.gz
进行外关联
[sparkadmin@hadoop1 ~]$ ln -s apache-flume-1.5.0-bin flume
[sparkadmin@hadoop1 ~]$ cd flume/conf/
[sparkadmin@hadoop1 conf]$ mv flume-env.sh.template flume-env.sh
[sparkadmin@hadoop1 conf]$ vim flume-env.sh
JAVA_HOME=/usr/java/jdk1.7.0_79

6、编写配置文件

A机器：配置文件名称为a.conf：

[sparkadmin@hadoop1 conf]$ vim a.conf

#agent
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1

#source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /cloud/data/access.log
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
#静态的添加一个key为type value为access的键值对到header里
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = access
a1.sources.r1.channels = c1

a1.sources.r2.type = exec
a1.sources.r2.command = tail -F /cloud/data/ugchead.log
a1.sources.r2.interceptors = i1
a1.sources.r2.interceptors.i1.type = static
a1.sources.r2.interceptors.i1.key = type
a1.sources.r2.interceptors.i1.value = ugchead
a1.sources.r2.channels = c1

a1.sources.r3.type = exec
a1.sources.r3.command = tail -F /cloud/data/ugctail.log
a1.sources.r3.interceptors = i1
a1.sources.r3.interceptors.i1.type = static
a1.sources.r3.interceptors.i1.key = type
a1.sources.r3.interceptors.i1.value = ugctail
a1.sources.r3.channels = c1

#sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop3
a1.sinks.k1.port = 44444
a1.sinks.k1.channel = c1

#channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 10000

B机器：配置文件名称为b.conf：
[sparkadmin@hadoop2 conf]$ vim b.conf
#agent
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1

C机器：配置文件名称 c.conf

[sparkadmin@hadoop3 conf]$ vim c.conf

#agent

a1.sources = r1
a1.sinks = k1
a1.channels = c1

#source
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop3
a1.sources.r1.port = 44444
a1.sources.r1.channels = c1
# source r1定义拦截器，为消息添加时间戳
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder

#sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop1:8020/source/%{type}/%Y%m%d
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.fileType = DataStream
#a1.sinks.k1.hdfs.fileType = CompressedStream
#a1.sinks.k1.hdfs.codeC = gzip
#不按照条数生成文件
a1.sinks.k1.hdfs.rollCount = 0
#如果压缩存储的话HDFS上的文件达到64M时生成一个文件注意是压缩前大小为64生成一个文件，然后压缩存储。
a1.sinks.k1.hdfs.rollSize = 67108864
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.channel = c1

#channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000

启动hadoop3上启动flume

[sparkadmin@hadoop3 flume]$ bin/flume-ng agent --conf conf/ --conf-file conf/c.conf --name a1 -Dflume.root.logger=INFO,console &
[1] 29875

然后启动hadoop1、hadoop2上面的flume

[sparkadmin@hadoop1 flume]$ bin/flume-ng agent --conf conf/ --conf-file conf/b.conf --name a1 &

[sparkadmin@hadoop2 flume]$ bin/flume-ng agent --conf conf/ --conf-file conf/b.conf --name a1 &

验证：

[sparkadmin@hadoop3 ~]$ hadoop fs -ls /source
Found 3 items
drwxr-xr-x - sparkadmin supergroup 0 2016-10-19 14:16 /source/access
drwxr-xr-x - sparkadmin supergroup 0 2016-10-19 14:16 /source/ugchead
drwxr-xr-x - sparkadmin supergroup 0 2016-10-19 14:16 /source/ugctail

[sparkadmin@hadoop3 ~]$ hadoop fs -ls /source/access
Found 1 items
drwxr-xr-x - sparkadmin supergroup 0 2016-10-19 14:16 /source/access/20161019

[sparkadmin@hadoop3 ~]$ hadoop fs -ls /source/access/20161019
Found 1 items
-rw-r--r-- 3 sparkadmin supergroup 29 2016-10-19 14:16 /source/access/20161019/events-.1476857773516.tmp

0 0