Flume实战（实时导入日志内容进ODPS）

来源：互联网发布：java截取汉字字符串编辑：程序博客网时间：2024/06/05 14:15

简介：

利用flume监测每天的日志文件，实时导入新增的内容（1秒上千的增量）进ODPS（阿里云的大数据服务），网上较多的资料是通过直接配置配置文件来实现一些已经提供的监测功能，例如监测文件新增，以及端口数据监测等等（功能比较局限，并且导入的数据库比较局限），本文主要是基于flume，自己编写sink（具体后面会讲）代码，实现较简单的根据需求定制实现flume导入自己可定义的数据库。。

准备：

Flume1.7.0（下载地址）
JAVA 1.8
MAVEN

前言：

首先介绍一下flume的一些重要的观念：
1、Event的概念
在这里有必要先介绍一下flume中event的相关概念：flume的核心是把数据从数据源(source)收集过来，在将收集到的数据送到指定的目的地(sink)。为了保证输送的过程一定成功，在送到目的地(sink)之前，会先缓存数据(channel),待数据真正到达目的地(sink)后，flume在删除自己缓存的数据。
在整个数据的传输的过程中，流动的是event，即事务保证是在event级别进行的。那么什么是event呢？—–event将传输的数据进行封装，是flume传输数据的基本单位，如果是文本文件，通常是一行记录，event也是事务的基本单位。event从source，流向channel，再到sink，本身为一个字节数组，并可携带headers(头信息)信息。event代表着一个数据的最小完整单元，从外部数据源来，向外部的目的地去。
为了方便大家理解，给出一张event的数据流向图：
简洁图

关于直接修改配置文件实现监控的例子，这篇文章（http://blog.csdn.net/a2011480169/article/details/51544664）已经讲的很清楚，大家可以阅读一下。

source：
sources是flume日志采集的起点，监控日志文件系统目录。支持exec（命令），端口监测，文件夹等等。

channel ：
channel 是flume的中间数据缓存管道，有点类似kafka的机制，因此个组件的性能很重要。
我在项目中主要采用的是menmory channel,原因是数据量大，要求较大的数据吞吐量和速度，但是有一点不好的是
如果一旦flume进程down掉，是没有“续点传输”的机制的。
关键参数讲解：
(1) capacity : 存储在channel中的events的最大数量
(2) transactionCapacity ：每次数据由channel到sink传输的最大events的数量

sink ：
sink组件的核心工作是把channel中数据进行输出到特定的终端，比如hdfs,hbase,database，avro等等。

JAVA内存的设计 :
主要通过修改 conf/flume-env.sh文件实现
主要设计Xmx和Xms两个参数,可以根据OS内存的大小进行合理设置, 一般以每秒处理5000行Apache日志的速度，需要配置 5-10个G 。
-Xms set initial Java heap size…………………….
-Xmx set maximum Java heap size…………………….

开始：

首先，配置文件（在flume安装目录的conf文件夹下，假设我的叫odpsSink.conf）。
agent1.sources = source1
agent1.sinks = odpsSink
agent1.channels = channel1

#Describe/configure source1
agent1.sources.source1.type = exec
agent1.sources.source1.command = /home/qcj/flume/flume.sh
agent1.sources.source1.channels = channel1

#Describe odpsSink
agent1.sinks.odpsSink.type=OdpsSink
agent1.sinks.odpsSink.accessId=****
agent1.sinks.odpsSink.accessKey=****
agent1.sinks.odpsSink.odpsUrl=http://service.odps.aliyun.com/api
agent1.sinks.odpsSink.project=****
agent1.sinks.odpsSink.table=****
agent1.sinks.odpsSink.channel = channel1
agent1.sinks.odpsSink.batchSize=3000

#Use a channel which buffers events in memory
agent1.channels.channel1.type = memory
agent1.channels.channel1.keep-alive=30
agent1.channels.channel1.capacity = 10000
agent1.channels.channel1.transactionCapacity = 3000

该配置文件中，****代表就是你自己的一些配置，例如我是写入odps，所以是关于odps的一些配置信息，用于连上odps。如果是导入mysql也可以在这配置，其实简单来说就相当于是一个properties，到时候会在自己编写的sink类中读取。

其次，编写sink类：

/** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements.  See the NOTICE file * distributed with this work for additional information * regarding copyright ownership.  The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License.  You may obtain a copy of the License at * *     http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */import com.aliyun.odps.Odps;import com.aliyun.odps.account.Account;import com.aliyun.odps.account.AliyunAccount;import com.aliyun.odps.data.Record;import com.aliyun.odps.data.RecordWriter;import com.aliyun.odps.tunnel.TableTunnel;import com.aliyun.odps.tunnel.TunnelException;import com.google.common.base.Preconditions;import com.aliyun.odps.tunnel.TableTunnel.UploadSession;import domain.GpsLog;import org.apache.flume.*;import org.apache.flume.conf.Configurable;import org.apache.flume.sink.AbstractSink;import org.slf4j.Logger;import org.slf4j.LoggerFactory;import utils.JsonUtil;import java.io.IOException;import java.util.ArrayList;import java.util.List;public class OdpsSink extends AbstractSink implements Configurable {    private Logger LOG = LoggerFactory.getLogger(OdpsSink.class);    private String accessId;    private String accessKey;    private String odpsUrl;    private String project;    private String table;    private int batchSize;    public OdpsSink() {        LOG.info("OdpsSink start...");    }    public void configure(Context context) {        accessId = context.getString("accessId");        Preconditions.checkNotNull(accessId, "accessId must be set!!");        accessKey = context.getString("accessKey");        Preconditions.checkNotNull(accessKey, "accessKey must be set!!");        odpsUrl = context.getString("odpsUrl");        Preconditions.checkNotNull(odpsUrl, "odpsUrl must be set!!");        project = context.getString("project");        Preconditions.checkNotNull(project, "project must be set!!");        table = context.getString("table");        Preconditions.checkNotNull(table, "table must be set!!");        batchSize = context.getInteger("batchSize", 100);        Preconditions.checkNotNull(batchSize > 0, "batchSize must be a positive number!!");    }    public void start() {        super.start();    }    public void stop() {        super.stop();    }    public Status process() throws EventDeliveryException {        Status result = Status.READY;        Channel channel = getChannel();        Transaction transaction = channel.getTransaction();        Event event;        String content;        Account account = new AliyunAccount(accessId, accessKey);        Odps odps = new Odps(account);        odps.setEndpoint(odpsUrl);        odps.setDefaultProject(project);        TableTunnel tunnel = new TableTunnel(odps);        List<GpsLog> actions = new ArrayList<GpsLog>();        transaction.begin();        try {            for (int i = 0; i < batchSize; i++) {                event = channel.take();                if (event != null) {                    content = new String(event.getBody());                    GpsLog gpsLog = JsonUtil.deserialize(content, GpsLog.class);                    actions.add(gpsLog);                } else {                    result = Status.BACKOFF;                    break;                }            }            if (actions.size() > 0) {                UploadSession uploadSession = tunnel.createUploadSession(project,                        table);                RecordWriter recordWriter = uploadSession.openRecordWriter(0);                Record record = uploadSession.newRecord();                for (GpsLog temp : actions) {                        record.setBigint(0, temp.getPktSeq());                        //......这里主要根据你odps的表结构自己上传数据。                        recordWriter.write(record);                }                recordWriter.close();                uploadSession.commit(new Long[]{0L});            }            transaction.commit();        } catch (TunnelException e) {            transaction.rollback();            e.printStackTrace();        } catch (IOException e) {            transaction.rollback();            e.printStackTrace();        } catch (Throwable e) {            transaction.rollback();            e.printStackTrace();        } finally {            transaction.close();        }        return result;    }}

这里讲几点：
1、上面代码的最前面的一些注释掉的介绍是不可缺少的，否则到时候maven编译会通过不了。并且根据我观察，所以你依赖的java类，甚至xml都需要加上这些注释
2、configure方法就是载入配置信息。就是前面的odpsSink.conf中的信息。
3、process就是关键的方法，是source有数据的时候就会执行的方法。主要就是编写接收数据，并做出自己想要的处理，然后上传到想上传的库中去。这里要注意：transaction最后一定要关闭，并且关闭前一定要commit或者rollback。否则，下一次source有数据的时候flume就会报错。
4、自行导入pom依赖。

接着：

编写完sink类后，需要用maven打包成jar包，然后放入flume安装目录下的lib目录中，注意：这里要将你sink依赖的jar包也要放进去。例如我传odps就要将odps的jar包放进去，去用户下的.m2文件下找。如果是导入mysql，就要将mysql-connector的jar包放进去

最后：

虽然是最后了，但也是最重要的一个环节。将编写好的jar包都放进lib中后。
执行：

nohup ./flume-ng  agent -c ../conf/ -f ../conf/odpsSink.conf -n agent1 -Dflume.root.logger=INFO,console`

这里面的参数都不可少，而且-n后面的agent名称要和配置文件中一致。

然后就是关键的地方，关于配置文件中的channal设置

agent1.channels.channel1.capacity = 10000agent1.channels.channel1.transactionCapacity = 3000

这里的含义上面已经讲过了。但是这里要提一下，就是我自己编写的sink类中的有个batchSize属性，如何提高flume的性能就在于如何配置好这三个参数。。

完结：

完结散花，希望大家都能简单的使用flume，另外有什么新的问题，或者新的理解都可以留言给我，谢谢。

阅读全文

0 0