大数据Flume_06

来源：互联网发布：赌博游戏源码编辑：程序博客网时间：2024/06/05 02:33

Flume

一、数据模型

Flume的概念

flume是分布式的日志收集系统，它将各个服务器中的数据收集起来并送到指定的地方去，比如说送到图中的HDFS，简单来说flume就是收集日志的。

Note：收集到的数据不一定直接到HDFS，还可以暂时存储到Kafka中，然后在存储到HDFS中。

Event的概念
event的相关概念：

flume的核心是把数据从数据源(source)收集过来，在将收集到的数据送到指定的目的地(sink)。为了保证输送的过程一定成功，在送到目的地(sink)之前，会先缓存数据(channel),待数据真正到达目的地(sink)后，flume在删除自己缓存的数据。
在整个数据的传输的过程中，流动的是event，即事务保证是在event级别进行的。那么什么是event呢？—–event将传输的数据进行封装，是flume传输数据的基本单位，如果是文本文件，通常是一行记录，event也是事务的基本单位。event从source，流向channel，再到sink，本身为一个字节数组，并可携带headers(头信息)信息。event代表着一个数据的最小完整单元，从外部数据源来，向外部的目的地去。
为了方便大家理解，给出一张event的数据流向图：

一个完整的event包括：event headers、event body、event信息(即文本文件中的单行记录)，如下所以：

其中event信息就是flume收集到的日记记录。

二、简单的配置

三、启动Flume

$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

example.conf 指定上面的配置文件。

a1 为agent的别名。

INFO 指定日志的级别。

condole 为日志输出位置。

四、flume架构介绍

flume之所以这么神奇，是源于它自身的一个设计，这个设计就是agent，agent本身是一个java进程，运行在日志收集节点—所谓日志收集节点就是服务器节点。
agent里面包含3个核心的组件：source—->channel—–>sink,类似生产者、仓库、消费者的架构。
source：source组件是专门用来收集数据的，可以处理各种类型、各种格式的日志数据,包括avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy、自定义。
channel：source组件把数据收集来以后，临时存放在channel中，即channel组件在agent中是专门用来存放临时数据的——对采集到的数据进行简单的缓存，可以存放在memory、jdbc、file等等。
sink：sink组件是用于把数据发送到目的地的组件，目的地包括hdfs、logger、avro、thrift、ipc、file、null、hbase、solr、自定义。

五、flume的运行机制

flume的核心就是一个agent，这个agent对外有两个进行交互的地方，一个是接受数据的输入——source，一个是数据的输出sink，sink负责将数据发送到外部指定的目的地。source接收到数据之后，将数据发送给channel，chanel作为一个数据缓冲区会临时存放这些数据，随后sink会将channel中的数据发送到指定的地方—-例如HDFS等，注意：只有在sink将channel中的数据成功发送出去之后，channel才会将临时数据进行删除，这种机制保证了数据传输的可靠性与安全性。

六、flume的广义用法

flume之所以这么神奇—-其原因也在于flume可以支持多级flume的agent，即flume可以前后相继，例如sink可以将数据写到下一个agent的source中，这样的话就可以连成串了，可以整体处理了。flume还支持扇入(fan-in)、扇出(fan-out)。所谓扇入就是source可以接受多个输入，所谓扇出就是sink可以将数据输出多个目的地destination中。

七、flume应用—日志采集

对于flume的原理其实很容易理解，我们更应该掌握flume的具体使用方法，flume提供了大量内置的Source、Channel和Sink类型。而且不同类型的Source、Channel和Sink可以自由组合—–组合方式基于用户设置的配置文件，非常灵活。比如：Channel可以把事件暂存在内存里，也可以持久化到本地硬盘上。Sink可以把日志写入HDFS, HBase，甚至是另外一个Source等等。下面我将用具体的案例详述flume的具体用法。
其实flume的用法很简单—-书写一个配置文件，在配置文件当中描述source、channel与sink的具体实现，而后运行一个agent实例，在运行agent实例的过程中会读取配置文件的内容，这样flume就会采集到数据。
配置文件的编写原则：
1>从整体上描述代理agent中sources、sinks、channels所涉及到的组件

    # Name the components on this agent    a1.sources = r1    a1.sinks = k1    a1.channels = c11
2
3
4

2>详细描述agent中每一个source、sink与channel的具体实现：即在描述source的时候，需要
指定source到底是什么类型的，即这个source是接受文件的、还是接受http的、还是接受thrift
的；对于sink也是同理，需要指定结果是输出到HDFS中，还是Hbase中啊等等；对于channel
需要指定是内存啊，还是数据库啊，还是文件啊等等。

    # Describe/configure the source    a1.sources.r1.type = netcat    a1.sources.r1.bind = localhost    a1.sources.r1.port = 44444    # Describe the sink    a1.sinks.k1.type = logger    # Use a channel which buffers events in memory    a1.channels.c1.type = memory    a1.channels.c1.capacity = 1000    a1.channels.c1.transactionCapacity = 1001
2
3
4
5
6
7
8
9
10
11
12

3>通过channel将source与sink连接起来

    # Bind the source and sink to the channel    a1.sources.r1.channels = c1    a1.sinks.k1.channel = c11
2
3

启动agent的shell操作：

    flume-ng  agent -n a1  -c  ../conf   -f  ../conf/example.file      -Dflume.root.logger=DEBUG,console  1
2

参数说明： -n 指定agent名称(与配置文件中代理的名字相同)
-c 指定flume中配置文件的目录
-f 指定配置文件
-Dflume.root.logger=DEBUG,console 设置日志等级

八、具体案例：

案例1：

NetCat Source：监听一个指定的网络端口，即只要应用程序向这个端口里面写数据，这个source组件就可以获取到信息。其中 Sink：logger Channel：memory
flume官网中NetCat Source描述：

Property Name Default     Descriptionchannels       –     type           –     The component type name, needs to be netcatbind           –  日志需要发送到的主机名或者Ip地址，该主机运行着netcat类型的source在监听          port           –  日志需要发送到的端口号，该端口号要有netcat类型的source在监听      1
2
3
4
5

a) 编写配置文件：

# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1# Describe/configure the sourcea1.sources.r1.type = netcata1.sources.r1.bind = 192.168.222.26a1.sources.r1.port = 44444# Describe the sinka1.sinks.k1.type = logger# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c11
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

b) 启动flume agent a1 服务端

flume-ng  agent -n a1  -c /home/flumeconfig  -f /home/flumeconfig/netcat.conf   -Dflume.root.logger=DEBUG,console1

c) 使用telnet发送数据

telnet  192.168.222.26  44444  big data world！（windows中运行的）1

d) 在控制台上查看flume收集到的日志数据：
这里写图片描述

案例2：

NetCat Source：监听一个指定的网络端口，即只要应用程序向这个端口里面写数据，这个source组件就可以获取到信息。其中 Sink：hdfs Channel：file (相比于案例1的两个变化)
flume官网中HDFS Sink的描述：

a) 编写配置文件：

# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1# Describe/configure the sourcea1.sources.r1.type = netcata1.sources.r1.bind = 192.168.222.108a1.sources.r1.port = 44444# Describe the sinka1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = hdfs://weekend11:9000/dataoutputa1.sinks.k1.hdfs.writeFormat = Texta1.sinks.k1.hdfs.fileType = DataStreama1.sinks.k1.hdfs.rollInterval = 10a1.sinks.k1.hdfs.rollSize = 0a1.sinks.k1.hdfs.rollCount = 0a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%Sa1.sinks.k1.hdfs.useLocalTimeStamp = true# Use a channel which buffers events in filea1.channels.c1.type = filea1.channels.c1.checkpointDir = /usr/flume/checkpointa1.channels.c1.dataDirs = /usr/flume/data# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c11
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

b) 启动flume agent a1 服务端

flume-ng  agent -n a1  -c /home/flumeconfig  -f /home/flumeconfig/netcat2.conf   -Dflume.root.logger=DEBUG,console1

c) 使用telnet发送数据

telnet  192.168.222.108  44444  big data world！（windows中运行的）1

d) 在HDFS中查看flume收集到的日志数据：

案例3、两个flume做集群

node01服务器中，配置文件
############################################################
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = node1
a1.sources.r1.port = 44444

# Describe the sink
# a1.sinks.k1.type = logger
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = node2
a1.sinks.k1.port = 60000

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
############################################################
node02服务器中，安装Flume（步骤略）
配置文件
############################################################
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.bind = node2
a1.sources.r1.port = 60000
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
############################################################

先启动node02的Flume
flume-ng agent -n a1 -c conf -f avro.conf -Dflume.root.logger=INFO,console

再启动node01的Flume
flume-ng agent -n a1 -c conf -f simple.conf2 -Dflume.root.logger=INFO,console

打开telnet 测试 node02控制台输出结果

案例4、Exec Source

http://flume.apache.org/FlumeUserGuide.html#exec-source

配置文件
############################################################
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/flume.exec.log

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
############################################################
启动Flume
flume-ng agent -n a1 -c conf -f exec.conf -Dflume.root.logger=INFO,console
创建空文件演示 touch flume.exec.log
循环添加数据
for i in {1..50}; do echo "$i hi flume" >> flume.exec.log ; sleep 0.1; done

案例5、Spooling Directory Source

http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source

目录中有新的文件产生机会将数据输出到控制台。
配置文件
############################################################
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source

a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/logs
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = logger

案例6、hdfs sink

http://flume.apache.org/FlumeUserGuide.html#hdfs-sink

配置文件
############################################################
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/logs
a1.sources.r1.fileHeader = true

# Describe the sink
***只修改上一个spool sink的配置代码块 a1.sinks.k1.type = logger
a1.sinks.k1.type=hdfs

#n1为hdfs集群的nameservice
a1.sinks.k1.hdfs.path=hdfs://n1/flume/%Y-%m-%d/%H%M

##每隔60s或者文件大小超过10M的时候产生新文件
# hdfs有多少条消息时新建文件，0不基于消息个数
a1.sinks.k1.hdfs.rollCount=0
# hdfs创建多长时间新建文件，0不基于时间
a1.sinks.k1.hdfs.rollInterval=60
# hdfs多大时新建文件，0不基于文件大小
a1.sinks.k1.hdfs.rollSize=10240
# 当目前被打开的临时文件在该参数指定的时间（秒）内，没有任何数据写入，则将该临时文件关闭并重命名成目标文件
a1.sinks.k1.hdfs.idleTimeout=3

a1.sinks.k1.hdfs.fileType=DataStream

#配置文件中使用了日期格式，设置为true
a1.sinks.k1.hdfs.useLocalTimeStamp=true

## 每五分钟生成一个目录:
# 是否启用时间上的”舍弃”，这里的”舍弃”，类似于”四舍五入”，后面再介绍。如果启用，则会影响除了%t的其他所有时间表达式
a1.sinks.k1.hdfs.round=true
# 时间上进行“舍弃”的值；
a1.sinks.k1.hdfs.roundValue=5
# 时间上进行”舍弃”的单位，包含：second,minute,hour
a1.sinks.k1.hdfs.roundUnit=minute

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
############################################################

启动Flume
flume-ng agent -n a1 -f hdfs.conf -Dflume.root.logger=INFO,console
往/home/logs 目录下书写文件。
查看hdfs文件
hadoop fs -ls /flume/...
hadoop fs -get /flume/...

阅读全文

0 0