flume配置及案例

来源：互联网发布：二维码扫描软件ios 编辑：程序博客网时间：2024/06/05 08:02

文件收集框架Flume

官网：http://flume.apache.org/

一、flume简介

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

实时收集数据，常与kafaka/flume-->storm/spark streming

分布式：每台服务器都有flume的客户端，用于收集数据。

架构：源source、管道channel、目标sink

webServer:数据放在服务器的摸个目录下

Agent:source从远端拿取数据（封装成event(event由可选的header（header中容纳了key-value无序集合）和byte array组成)）—>event放到管道里（nio流）

-->sink从channel中拿数据。

channel有容灾和恢复的功能。

一个channel可以连接多个source，只能连接一个sink..

source和channel之后可以选择加入数据清洗的过滤器。

二、搭建flume环境：

用户指南：http://flume.apache.org/FlumeUserGuide.html

下载地址：http://archive.cloudera.com/cdh5/cdh/5/
1、flume的版本
flume-og -》原始的
flume-ng -》next
2、安装部署
1、下载解压：
tar -zxvf flume-ng-1.5.0-cdh5.3.6.tar.gz -C /opt/modules/cdh-5.3.6/
2、修改配置文件
--需求：java、hdfs
--修改配置文件flume-env（配置java环境变量,flume是用java编写的）export JAVA_HOME=/opt/modules/modules/jdk1.7.0_67

--将hdfs配置文件复制到 flume-ng-1.5.0-cdh5.3.6/conf目录下

--复制hdfs 相关JAR包到lib

三、案例

案例：使用spooling directory source、hdfs sink

spooling directory source: This source will watch the specified directory for new files, and will parse events out of new files as they appear.

前提：已经在lib中加入hdfs相关jar包

a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = spooldira1.sources.s1.spoolDir = /opt/data/flume/loga1.sources.s1.ignorePattern = ([^ ]*\.tmp$)   //正则表达式指定要忽略的文件（跳过）。#define channela1.channels.c1.type = filea1.channels.c1.checkpointDir = /opt/data/flume/checka1.channels.c1.dataDirs = /opt/data/flume/data#define  sinksa1.sinks.k1.type = hdfs   a1.sinks.k1.hdfs.path = hdfs://rainbow.com.cn:8020/flume/envent/%y-%m-%d    //hdfs的全路径a1.sinks.k1.hdfs.fileType = DataStream    //文件格式、DataStream 不用设置压缩方式，CompressionStream需要设置压缩方式hdfs.codeCa1.sinks.k1.hdfs.useLocalTimeStamp = truea1.sinks.k1.hdfs.rollCount = 0a1.sinks.k1.hdfs.rollSize = 10240a1.sinks.k1.hdfs.rollInterval = 0a1.sinks.k1.hdfs.filePrefix = rainbow# combinea1.sources.s1.channels = c1a1.sinks.k1.channel = c1

运行前先启动hdfs

再启动flume：bin/flume-ng agent -c conf/ -n a1 -f -Dflume.root.logger=INFO.console

运行之后check与data文件夹下的文件：

当log文件夹下有新增文件时，flume将信息写入到hdfs上

hdfs:

第一张图片是我没有用到下面这个属性:

<span style="font-size:14px;">a1.sources.s1.ignorePattern = ([^ ]*\.tmp$)   //</span><span style="font-size: 14px; font-family: Arial, Helvetica, sans-serif;">正则表达式指定要忽略的文件（跳过）。</span>

每次启动会新建一个.tmp文件。

第二张图片使用这个属性之后就不会有这种情况了。

0 0