一步两步,学习大数据（五）——flume的介绍、配置以及使用

来源：互联网发布：影响黄金的重要数据编辑：程序博客网时间：2024/06/05 05:01

大数据的业务处理中，数据采集占据重要的地位，而在互联网中大量数据产生的来源之一便是网络日志。flume是分布式的日志收集系统，它将各个服务器中的数据收集起来并送到指定的地方去，可以是文件、可以是hdfs。有关flume架构更加详细的介绍大家可以参考安静的技术控 Flume架构以及应用介绍

今天小编给大家介绍的是，flume的安装、配置、以及一些简单的使用，带大家初始flume

准备工作

centos7

apache-flume-1.8.0-bin.tar.gz

安装

注：小编选择的目录，大家可以根据自己的需要进行选择。
1. 通过Xftp把.gz包上传到虚拟机
2. tar zxvf apache-flume-1.8.0-bin.tar.gz
3. mv apache-flume-1.8.0 /usr/share/flume
4. 配置flume环境变量
打开 vi /etc/profile

添加#Flume export FLUME_HOME=/usr/share/flume export PATH=$PATH:$FLUME_HOME/bin

保存退出
运行source /etc/profile
运行 flume-ng version 输出版本信息表示安装正确。

具体案例

案例一： NetCat Source：监听一个指定的网络端口，即只要应用程序向这个端口里面写数据，这个source组件就可以获取到信息。

flume官网中NetCat Source描述：Property Name Default     Descriptionchannels       –     type           –     The component type name, needs to be netcatbind           –  日志需要发送到的主机名或者Ip地址，该主机运行着netcat类型的source在监听          port           –  日志需要发送到的端口号，该端口号要有netcat类型的source在监听

配置文件：

# 指定Agent的组件名称（a），一个进程a.sources=r1a.channels=c1a.sinks=k1a.sources.r1.type=netcat   a.sources.r1.bind=master a.sources.r1.port=8888a.sources.r1.channels=c1a.channels.c1.type=memorya.channels.c1.capacity=1000a.channels.c1.transactionCapacity=1000a.sinks.k1.channel=c1a.sinks.k1.type=logger

启动flume agent a 服务端：

flume-ng  agent -n a1  -c ../conf  -f ../conf/netcat.conf   -Dflume.root.logger=DEBUG,console#-Dflume.root.logger=DEBUG,console 设置控制台打印#telnet master 8888 2334/hello/1232

telnet master 8888

案例二：NetCat Source：监听一个指定的网络端口，即只要应用程序向这个端口里面写数据，这个source组件就可以获取到信息。其中 Sink：hdfs Channel：file (相比于案例1的两个变化)

配置文件

# Name the components on this agenta.sources = r1a.sinks = k1a.channels = c1# Describe/configure the sourcea.sources.r1.type = netcata.sources.r1.bind = mastera.sources.r1.port = 8888# Describe the sinka.sinks.k1.type = hdfs#指定hdfs地址中的输出目录a.sinks.k1.hdfs.path = hdfs://master:9000/outputa.sinks.k1.hdfs.writeFormat = Texta.sinks.k1.hdfs.fileType = DataStreama.sinks.k1.hdfs.rollInterval = 10a.sinks.k1.hdfs.rollSize = 0a.sinks.k1.hdfs.rollCount = 0a.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%Sa.sinks.k1.hdfs.useLocalTimeStamp = true# Use a channel which buffers events in filea.channels.c1.type = filea.channels.c1.checkpointDir = /usr/flume/checkpointa.channels.c1.dataDirs = /usr/flume/data# Bind the source and sink to the channela.sources.r1.channels = c1a.sinks.k1.channel = c1

启动flume agent a 服务端：

flume-ng agent -c conf -f flume-hdfs-test01.properties -name a -Dflume.root.logger=INFO,console

在HDFS中查看flume收集到的日志数据：：

telnet master 8888输入测试数据，如：123在hdfs的output目录中可以看到目录中多出一个以时间戳命名的文件，文件中写入了你的测试数据（123）

案例3：Spooling Directory Source：监听一个指定的目录，即只要应用程序向这个指定的目录中添加新的文件，source组件就可以获取到该信息，并解析该文件的内容，然后写入到channle。写入完成后，标记该文件已完成或者删除该文件。其中 Sink：logger Channel：memory

flume官网中Spooling Directory Source描述：

Property Name       Default      Descriptionchannels              –  type                  –          The component type name, needs to be spooldir.spoolDir              –          Spooling Directory Source监听的目录fileSuffix         .COMPLETED    文件内容写入到channel之后，标记该文件deletePolicy       never         文件内容写入到channel之后的删除策略: never or immediatefileHeader         false         Whether to add a header storing the absolute path filename.ignorePattern      ^$           Regular expression specifying which files to ignore (skip)interceptors          –          指定传输中event的head(头信息)，常用timestamp

Spooling Directory Source的两个注意事项：

①If a file is written to after being placed into the spooling directory, Flume will print an error to its log file and stop processing.即：拷贝到spool目录下的文件不可以再打开编辑②If a file name is reused at a later time, Flume will print an error to its log file and stop processing.即：不能将具有相同文件名字的文件拷贝到这个目录下

配置文件：

a1.sources = r1a1.sinks = k1a1.channels = c1#Describe/configure the sourcea1.sources.r1.type = spooldira1.sources.r1.spoolDir = /usr/local/datainputa1.sources.r1.fileHeader = truea1.sources.r1.interceptors = i1a1.sources.r1.interceptors.i1.type = timestamp# Describe the sinka1.sinks.k1.type = logger# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1

启动flume agent a1 服务端

flume-ng agent -c conf -f flume-nect-test02.properties -name a1 -Dflume.root.logger=INFO,console

控制台打印：

从控制台显示的结果可以看出event的头信息中包含了时间戳信息。

同时我们查看一下Spooling Directory中的datafile信息—-文件内容写入到channel之后，该文件被标记了：flume-hdfs.properties.COMPLETED

案例四：Spooling Directory Source：监听一个指定的目录，即只要应用程序向这个指定的目录中添加新的文件，source组件就可以获取到该信息，并解析该文件的内容，然后写入到channle。写入完成后，标记该文件已完成或者删除该文件。其中 Sink：hdfs Channel：file (相比于案例三的两个变化)

配置文件：

# Name the components on this agenta.sources = r1a.sinks = k1a.channels = c1# Describe/configure the sourcea.sources.r1.type = spooldira.sources.r1.spoolDir = /usr/local/datainputa.sources.r1.fileHeader = truea.sources.r1.interceptors = i1a.sources.r1.interceptors.i1.type = timestamp# Describe the sink# Describe the sinka.sinks.k1.type = hdfsa.sinks.k1.hdfs.path = hdfs://master:9000/outputa.sinks.k1.hdfs.writeFormat = Texta.sinks.k1.hdfs.fileType = DataStreama.sinks.k1.hdfs.rollInterval = 10a.sinks.k1.hdfs.rollSize = 0a.sinks.k1.hdfs.rollCount = 0a.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H-%M-%Sa.sinks.k1.hdfs.useLocalTimeStamp = true# Use a channel which buffers events in filea.channels.c1.type = filea.channels.c1.checkpointDir = /usr/flume/checkpointa.channels.c1.dataDirs = /usr/flume/data# Bind the source and sink to the channela.sources.r1.channels = c1a.sinks.k1.channel = c1

 flume-ng agent -c conf -f flume-spooldir-test01.properties -name a -Dflume.root.logger=INFO,console

复制文件

在控制台上可以参看sink的运行进度日志：
进度日志：部分

在HDFS的output文件夹中中查看flume收集到的日志数据：
hdfs中收集的文件

案例五，接收json格式数据

配置文件：

c.sources=r1 r2c.channels=c1c.sinks=s1c.sources.r1.type=spooldirc.sources.r1.spoolDir=flumec.sources.r2.type = httpc.sources.r2.port = 8888c.source.r2.bind = 192.168.13.100c.sources.r2.channels = c1c.channels.c1.type=memoryc.channels.c1.capacity=1000c.channels.c1.transactionCapacity=100c.sinks.s1.type=hdfsc.sinks.s1.hdfs.path=/flume/%y-%m-%dc.sinks.s1.hdfs.rollInterval=0c.sinks.s1.hdfs.writeFormat=Textc.sinks.s1.hdfs.fileType=DataStreamc.sinks.s1.hdfs.rollCount=0c.sinks.s1.hdfs.rollSize=10485760c.sinks.s1.hdfs.useLocalTimeStamp=truec.sources.r1.channels=c1c.sinks.s1.channel=c1

启动agent c flume-ng agent -c conf -f flume-http-test01.properties -name c -Dflume.root.logger=INFO,console

postman发起json请求：

hdfs中的flume文件夹中查看时间戳文夹，可以找到有些文件写入json请求中的body体

这便是flume的一些简单应用和基本配置，作为一种收集网络日志的方法，基于各种各样的需求，可以配置适合自己的flume。

阅读全文

0 0

一步两步,学习大数据（五）——flume的介绍、配置以及使用

准备工作

安装

具体案例

案例一： NetCat Source：监听一个指定的网络端口，即只要应用程序向这个端口里面写数据，这个source组件就可以获取到信息。

案例二 ：NetCat Source：监听一个指定的网络端口，即只要应用程序向这个端口里面写数据，这个source组件就可以获取到信息。 其中 Sink：hdfs Channel：file (相比于案例1的两个变化)

案例五，接收json格式数据

这便是flume的一些简单应用和基本配置，作为一种收集网络日志的方法，基于各种各样的需求，可以配置适合自己的flume。

案例二：NetCat Source：监听一个指定的网络端口，即只要应用程序向这个端口里面写数据，这个source组件就可以获取到信息。其中 Sink：hdfs Channel：file (相比于案例1的两个变化)