Flume

来源:互联网 发布:长沙网站关键字优化 编辑:程序博客网 时间:2024/05/21 10:36

Flume Agent

    Flume Agent的配置被存储在一个本地配置文件,这是一个根据java属性文件格式的文本文件,在这个配置文件中,包括了对source、channel、sink的属性配置,和其相关联形成数据流的配置。    Flume Agent实时监控端口,收集数据,将其以日志的形式打印在控制台。    一个source可以分发到多个channel,一个sink取一个channel的数据。    Flume的开发就是编写配置文件,说白了就是Agent中Source、Channel和Sink的类型及属性。企业中常用的flume typesource(获取数据源):    ->exec (文件)    ->spoolingdir (文件夹)    ->taildir(文件夹及文件的变动)    ->kafka    ->syslog    ->httpchannel(管道):    ->mem    ->file    ->kafkasink(将channel中的 数据发送到目标地址):    ->hdfs    ->hive    ->hbase        ->同步        ->异步

Agent的编写(初级)

  1. 案例1:source:hive.log,channel: mem,sink:log

    1. 前期工作

      拷贝一份conf目录下的文件:cp flume-conf.properties.template hive-men-log.properties动态查看hive的log:tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.log-Dflume.root.logger=INFO,console : 日志级别
    2. 运行指令

      bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/hive-mem-log.properties -Dflume.root.logger=INFO,console
    3. 具体内容
      #define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = execa1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.loga1.sources.s1.shell = /bin/sh -c# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100    # define sinka1.sinks.k1.type = logger#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1
  2. 案例2:channel:file

    1. 前期工作

      需要先创建目录在/opt/datas下:mkdir -p flume/datas,mkdir flume/check    以下两个参数是必须的a1.channels.c1.dataDirs = /opt/datas/flume/datasa1.channels.c1.checkpointDir = /opt/datas/flume/check
    2. 运行指令

      bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/hive-file-log.properties -Dflume.root.logger=INFO,console
    3. 具体内容

      #define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = execa1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.loga1.sources.s1.shell = /bin/sh -c# define channela1.channels.c1.type = filea1.channels.c1.dataDirs = /opt/datas/flume/datasa1.channels.c1.checkpointDir = /opt/datas/flume/check# define sinka1.sinks.k1.type = logger#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1
  3. 案例3:sink:hdfs

    1. 前期工作

      案例分析:实时收集数据至HDFS    此案例实时监控Hive日志文件,放到HDFS目录中。    实时监控某个日志文件,将数据收集存储到HDFS上。此案例使用EXEC Source,实时监控文件数据,使用Memory Channel缓存数据,使用HDFS Sink写入数据。a1.sinks.k1.hdfs.path = /flume/event/size三种配置方式:1.配置全局变量 2.考入配置文件 3.写出HDFS绝对路径HDFS目录:a1.sinks.k1.hdfs.path = /flume/event
    2. 运行指令

      bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/hive-mem-hdfs.properties -Dflume.root.logger=INFO,console
    3. 具体内容

      #define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = execa1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.loga1.sources.s1.shell = /bin/sh -c# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100    # define sinka1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = /flume/eventa1.sinks.k1.hdfs.fileType = DataStream#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1

Agent当前存在问题(中级)

  1. 关于文件大小,案例4:size

    1. 前期工作

      # define sink  基于大小,就把基于时间的关闭,设置为0即可a1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = /flume/event/sizea1.sinks.k1.hdfs.fileType = DataStreama1.sinks.k1.hdfs.rollSize = 10240a1.sinks.k1.hdfs.rollInterval = 0a1.sinks.k1.hdfs.rollCount = 0
    2. 运行指令

      bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/hive-mem-size.properties -Dflume.root.logger=INFO,console
    3. 具体内容

      #define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = execa1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.loga1.sources.s1.shell = /bin/sh -c# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100    # define sinka1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = /flume/event/sizea1.sinks.k1.hdfs.fileType = DataStreama1.sinks.k1.hdfs.rollSize = 10240a1.sinks.k1.hdfs.rollInterval = 0a1.sinks.k1.hdfs.rollCount = 0#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1
  2. hive分区表与修改文件头部,案例5

    1. hive分区

      a1.sinks.k1.hdfs.path = /flume/event/date/date=%Y%m%d/hour=%H%M
    2. 修改文件头部

      a1.sinks.k1.hdfs.filePrefix = hive-log
    3. 运行指令

      bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/hive-mem-part.properties -Dflume.root.logger=INFO,console
    4. 具体内容

      #define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = execa1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.loga1.sources.s1.shell = /bin/sh -c# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100    # define sinka1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = /flume/event/date/date=%Y%m%d/hour=%H%Ma1.sinks.k1.hdfs.fileType = DataStreama1.sinks.k1.hdfs.rollSize = 10240a1.sinks.k1.hdfs.rollInterval = 0a1.sinks.k1.hdfs.rollCount = 0a1.sinks.k1.hdfs.useLocalTimeStamp = truea1.sinks.k1.hdfs.filePrefix = hive-log#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1

企业中的日志存储一(高级一)

  1. 案例6:source:spoolingdir

    1. 前期工作

      企业中日志文件的产生如下:web_log/20161127.log.tmp  ->  20161127.log            20161128.log.tmp如何进行数据的采集?exec:监控是一个文件需求:动态监控文件夹内新文件的生成,并且采集source:spoolingdir   监控目录创建目录:/opt/datas/flume/spoodir
    2. 运行指令

      bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/dir-mem-hdfs.properties -Dflume.root.logger=INFO,console
    3. 具体内容

      #define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = spooldira1.sources.s1.spoolDir = /opt/datas/flume/spooldir# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100    # define sinka1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = /flume/dira1.sinks.k1.hdfs.fileType = DataStream#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1
    4. 测试
      拷贝一个文件到目录下,运行结束后文件名变为20161202.log.COMPLETED 表示上传完成。
      这里写图片描述

    5. 优化:忽略以.tmp结尾的文件
      #define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = spooldira1.sources.s1.spoolDir = /opt/datas/flume/spooldira1.sources.s1.ignorePattern = ([^ ]*\.tmp$)# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100    # define sinka1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = /flume/dir-iga1.sinks.k1.hdfs.fileType = DataStream#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1

企业架构(高级二)

  1. 多sink的配置:一个sink对应一个channel

    1. 具体内容
      #define a agent a1.sources = s1a1.channels = hdfsc1 hdfsc2a1.sinks = hdfs1 hdfs2# define sourcea1.sources.s1.type = execa1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.loga1.sources.s1.shell = /bin/sh -c# define channel1a1.channels.hdfsc1.type = filea1.channels.hdfsc1.dataDirs = /opt/datas/flume/data1a1.channels.hdfsc1.checkpointDir = /opt/datas/flume/check1# define channel2a1.channels.hdfsc2.type = filea1.channels.hdfsc2.dataDirs = /opt/datas/flume/data2a1.channels.hdfsc2.checkpointDir = /opt/datas/flume/check2  # define sink1a1.sinks.hdfs1.type = hdfsa1.sinks.hdfs1.hdfs.path = /flume/dir1a1.sinks.hdfs1.hdfs.fileType = DataStream# define sink2a1.sinks.hdfs2.type = hdfsa1.sinks.hdfs2.hdfs.path = /flume/dir2a1.sinks.hdfs2.hdfs.fileType = DataStream#zuhe source sink channela1.sources.s1.channels = hdfsc1 hdfsc2a1.sinks.hdfs1.channel = hdfsc1a1.sinks.hdfs2.channel = hdfsc2
  2. flume collect

    1. 前期工作

      avro sink  &  avro source    agent端配置:收集日志传输给avro source    collect端配置:收集agent的数据,传递给HDFS sink配置好avro-collect.properties发送到集群中其他机器
    2. 运行指令

      主机运行collect:bin/flume-ng agent --conf conf/  --name a1 --conf-file conf/avro-collect.properties -Dflume.root.logger=INFO,consoleslave1机器运行agent1: bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/avro-agent.properties -Dflume.root.logger=INFO,consoleslave2机器运行agent2: bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/avro-agent.properties -Dflume.root.logger=INFO,console
    3. 具体内容:avro-collect.properties

      #define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = avroa1.sources.s1.bind = 192.168.134.191a1.sources.s1.port = 50505# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100    # define sinka1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = /flume/avroa1.sinks.k1.hdfs.fileType = DataStream#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1
    4. 具体内容:avro-agent.properties

      #define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = execa1.sources.s1.command = tail -F /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/logs/hive.loga1.sources.s1.shell = /bin/sh -c# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100# define sinka1.sinks.k1.type = avroa1.sinks.k1.hostname = 192.168.134.191a1.sinks.k1.port = 50505#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1

使用taildir(高级三)

  1. 关于taildir
    1. 需求
      监控文件的同时要监控文件夹,但是exec,spoolingdir满足不了需求,因此选择使用taildir,这是1.7以后才有的功能,需要手动编译源码。
  2. 编译完成taildir,GitHub

    1. 新建一个目录,尽量不包含中文
    2. 进入目录,执行

      git clone https://github.com/apache/flume.git
    3. 下载完成,进入flume目录

      查看所有版本:git branch -r 查看当前版本:git branch -a切换版本:git checkout origin/flume-1.7
    4. 导入eclipse

      1. jar包替换

        -1.5.0    1.5.0-cdh5.3.6-》cdh版本    添加cdh的maven源-》建议直接拷贝提供的pom
      2. 报错

        1. 删除overwrite
          这里写图片描述
        2. 拷贝flume-1.7/core/source/PollableSourceConstants.class到项目中
    5. 编译Maven
      这里写图片描述
    6. 将jar包放入flume的lib:flume-taildir-source-1.5.0-cdh5.3.6.jar
  3. 实现taildir

    1. 前期工作

      创建文件:[beifeng@hadoop-senior01 flume]$ echo " " >hadoop10.log创建目录:[beifeng@hadoop-senior01 flume]$ mkdir hadoop10
    2. 运行指令

      bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/taildir-mem-log.properties -Dflume.root.logger=INFO,console
    3. 具体内容

      #define a agent a1.sources = s1a1.channels = c1a1.sinks = k1# define sourcea1.sources.s1.type = org.apache.flume.source.taildir.TaildirSourcea1.sources.s1.positionFile = /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6-bin/position/taildir_position.jsona1.sources.s1.filegroups = f1 f2a1.sources.s1.filegroups.f1 = /opt/datas/flume/hadoop10.loga1.sources.s1.headers.f1.headerKey1 = value1a1.sources.s1.filegroups.f2 = /opt/datas/flume/hadoop10/.*a1.sources.s1.headers.f2.headerKey1 = value2-1a1.sources.s1.headers.f2.headerKey2 = value2-2a1.sources.s1.fileHeader = true# define channela1.channels.c1.type = memorya1.channels.c1.capacity = 100a1.channels.c1.transactionCapacity = 100# define sinka1.sinks.k1.type = logger#zuhe source sink channela1.sources.s1.channels = c1a1.sinks.k1.channel = c1
    4. 实现效果
      测试
      这里写图片描述

      控制台显示
      这里写图片描述

      taildir_position.json内容
      这里写图片描述

0 0
原创粉丝点击