flume拦截器使用

来源：互联网发布：置乱算法 arnold 编辑：程序博客网时间：2024/05/16 00:45

log4j.properties配置：

log4j.rootLogger=INFO
log4j.category.com.besttone=INFO,flume
log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.flume.Hostname = localhost
log4j.appender.flume.Port = 44444

log4j.appender.flume.UnsafeMode = true

需要将/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/flume-ng/tools/flume-ng-log4jappender-1.4.0-cdh5.0.0-jar-with-dependencies.jar添加到classpath下。

然后可以写一个简单的测试类来测试一下：

[java] view plain copy
package com.besttone.flume;  
  
import java.util.Date;  
  
import org.apache.commons.logging.Log;  
import org.apache.commons.logging.LogFactory;  
  
public class WriteLog {  
    protected static final Log logger = LogFactory.getLog(WriteLog.class);  
  
    /** 
     * @param args 
     * @throws InterruptedException  
     */  
    public static void main(String[] args) throws InterruptedException {  
        // TODO Auto-generated method stub  
        while (true) {  
        //每隔两秒log输出一下当前系统时间戳  
            logger.info(new Date().getTime());  
            Thread.sleep(2000);  
        }  
    }  
}  

然后写一个run.sh脚本运行这个类：

[plain] view plain copy
#!/bin/bash  
jarlist=`ls ./lib/*.jar`  
CLASSPATH='./bin/'  
for jar in ${jarlist}  
do  
   CLASSPATH=${CLASSPATH}:${jar}  
done  
echo ${CLASSPATH}  
  
java -classpath "$CLASSPATH" com.besttone.flume.WriteLog &  

执行run.sh，将sink设置为logger,去flume的日志文件里去看，可以看到log4j的日志输出已经传输到了flume中：

2014-07-16 14:23:54,193 INFO org.apache.flume.sink.LoggerSink: Event: { headers:{flume.client.log4j.log.level=20000, flume.client.log4j.message.encoding=UTF8, flume.client.log4j.logger.name=com.besttone.flume.WriteLog, flume.client.log4j.timestamp=1405491834189} body: 31 34 30 35 34 39 31 38 33 34 31 38 39 1405491834189 }

对于flume拦截器,我的理解是：在app(应用程序日志)和 source 之间的，对app日志进行拦截处理的。也即在日志进入到source之前，对日志进行一些包装、清新过滤等等动作。

官方上提供的已有的拦截器有：

Timestamp Interceptor

Host Interceptor

Static Interceptor

Regex Filtering Interceptor

Regex Extractor Interceptor

像很多java的开源项目如springmvc中的拦截器一样，flume的拦截器也是chain形式的，可以对一个source指定多个拦截器，按先后顺序依次处理。

Timestamp Interceptor :在event的header中添加一个key叫：timestamp,value为当前的时间戳。这个拦截器在sink为hdfs 时很有用，后面会举例说到

Host Interceptor：在event的header中添加一个key叫：host,value为当前机器的hostname或者ip。

Static Interceptor:可以在event的header中添加自定义的key和value。

Regex Filtering Interceptor:通过正则来清洗或包含匹配的events。

Regex Extractor Interceptor：通过正则表达式来在header中添加指定的key,value则为正则匹配的部分

下面举例说明这些拦截器的用法，首先我们调整一下第一篇文章中的那个WriteLog类：

[java] view plain copy
public class WriteLog {  
    protected static final Log logger = LogFactory.getLog(WriteLog.class);  
  
    /** 
     * @param args 
     * @throws InterruptedException 
     */  
    public static void main(String[] args) throws InterruptedException {  
        // TODO Auto-generated method stub  
        while (true) {  
            logger.info(new Date().getTime());  
            logger.info("{\"requestTime\":"  
                    + System.currentTimeMillis()  
                    + ",\"requestParams\":{\"timestamp\":1405499314238,\"phone\":\"02038824941\",\"cardName\":\"测试商家名称\",\"provinceCode\":\"440000\",\"cityCode\":\"440106\"},\"requestUrl\":\"/reporter-api/reporter/reporter12/init.do\"}");  
            Thread.sleep(2000);  
  
        }  
    }  
}  

又多输出了一行日志信息，现在每次循环都会输出两行日志信息，第一行是一个时间戳信息，第二行是一行JSON格式的字符串信息。

接下来我们用regex_filter和 timestamp这两个拦截器来实现这样一个功能：

1 过滤掉LOG4J输出的第一行那个时间戳日志信息，只收集JSON格式的日志信息

2 将收集的日志信息保存到HDFS上，每天的日志保存到以该天命名的目录下面，如2014-7-25号的日志，保存到/flume/events/14-07-25目录下面。

修改后的flume.conf如下：

[plain] view plain copy
tier1.sources=source1  
tier1.channels=channel1  
tier1.sinks=sink1  
  
tier1.sources.source1.type=avro  
tier1.sources.source1.bind=0.0.0.0  
tier1.sources.source1.port=44444  
tier1.sources.source1.channels=channel1  
  
tier1.sources.source1.interceptors=i1 i2  
tier1.sources.source1.interceptors.i1.type=regex_filter  
tier1.sources.source1.interceptors.i1.regex=\\{.*\\}  
tier1.sources.source1.interceptors.i2.type=timestamp  
  
tier1.channels.channel1.type=memory  
tier1.channels.channel1.capacity=10000  
tier1.channels.channel1.transactionCapacity=1000  
tier1.channels.channel1.keep-alive=30  
  
tier1.sinks.sink1.type=hdfs  
tier1.sinks.sink1.channel=channel1  
tier1.sinks.sink1.hdfs.path=hdfs://master68:8020/flume/events/%y-%m-%d  
tier1.sinks.sink1.hdfs.fileType=DataStream  
tier1.sinks.sink1.hdfs.writeFormat=Text  
tier1.sinks.sink1.hdfs.rollInterval=0  
tier1.sinks.sink1.hdfs.rollSize=10240  
tier1.sinks.sink1.hdfs.rollCount=0  
tier1.sinks.sink1.hdfs.idleTimeout=60  

我们对source1添加了两个拦截器i1和i2,i1为regex_filter，过滤的正则为\\{.*\\},注意正则的写法用到了转义字符，不然source1无法启动，会报错。

i2为timestamp，在header中添加了一个timestamp的key,然后我们修改了sink1.hdfs.path在后面加上了/%y-%m-%d这一串字符，这一串字符要求event的header中必须有timestamp这个key,这就是为什么我们需要添加一个timestamp拦截器的原因，如果不添加这个拦截器，无法使用这样的占位符，会报错。还有很多占位符，请参考官方文档。

然后运行WriteLog,去hdfs上查看对应目录下面的文件，会发现内容只有JSON字符串的日志，与我们的功能描述一致。

先回想一下，spooldir source可以将文件名作为header中的key:basename写入到event的header当中去。试想一下，如果有一个拦截器可以拦截这个event,然后抽取header中这个key的值，将其拆分成3段，每一段都放入到header中，这样就可以实现那个需求了。

遗憾的是，flume没有提供可以拦截header的拦截器。不过有一个抽取body内容的拦截器：RegexExtractorInterceptor，看起来也很强大，以下是一个官方文档的示例：

If the Flume event body contained 1:2:3.4foobar5 and the following configuration was used

a1.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)
a1.sources.r1.interceptors.i1.serializers = s1 s2 s3
a1.sources.r1.interceptors.i1.serializers.s1.name = one
a1.sources.r1.interceptors.i1.serializers.s2.name = two
a1.sources.r1.interceptors.i1.serializers.s3.name = three
The extracted event will contain the same body but the following headers will have been added one=>1, two=>2, three=>3

大概意思就是，通过这样的配置，event body中如果有1:2:3.4foobar5 这样的内容，这会通过正则的规则抽取具体部分的内容，然后设置到header当中去。

于是决定打这个拦截器的主义，觉得只要把代码稍微改改，从拦截body改为拦截header中的具体key，就OK了。翻开源码，哎呀，很工整，改起来没难度，以下是我新增的一个拦截器：RegexExtractorExtInterceptor：

[java] view plain copy
package com.besttone.flume;  
  
import java.util.List;  
import java.util.Map;  
import java.util.regex.Matcher;  
import java.util.regex.Pattern;  
  
import org.apache.commons.lang.StringUtils;  
import org.apache.flume.Context;  
import org.apache.flume.Event;  
import org.apache.flume.interceptor.Interceptor;  
import org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer;  
import org.apache.flume.interceptor.RegexExtractorInterceptorSerializer;  
import org.slf4j.Logger;  
import org.slf4j.LoggerFactory;  
  
import com.google.common.base.Charsets;  
import com.google.common.base.Preconditions;  
import com.google.common.base.Throwables;  
import com.google.common.collect.Lists;  
  
/** 
 * Interceptor that extracts matches using a specified regular expression and 
 * appends the matches to the event headers using the specified serializers</p> 
 * Note that all regular expression matching occurs through Java's built in 
 * java.util.regex package</p>. Properties: 
 * <p> 
 * regex: The regex to use 
 * <p> 
 * serializers: Specifies the group the serializer will be applied to, and the 
 * name of the header that will be added. If no serializer is specified for a 
 * group the default {@link RegexExtractorInterceptorPassThroughSerializer} will 
 * be used 
 * <p> 
 * Sample config: 
 * <p> 
 * agent.sources.r1.channels = c1 
 * <p> 
 * agent.sources.r1.type = SEQ 
 * <p> 
 * agent.sources.r1.interceptors = i1 
 * <p> 
 * agent.sources.r1.interceptors.i1.type = REGEX_EXTRACTOR 
 * <p> 
 * agent.sources.r1.interceptors.i1.regex = (WARNING)|(ERROR)|(FATAL) 
 * <p> 
 * agent.sources.r1.interceptors.i1.serializers = s1 s2 
 * agent.sources.r1.interceptors.i1.serializers.s1.type = 
 * com.blah.SomeSerializer agent.sources.r1.interceptors.i1.serializers.s1.name 
 * = warning agent.sources.r1.interceptors.i1.serializers.s2.type = 
 * org.apache.flume.interceptor.RegexExtractorInterceptorTimestampSerializer 
 * agent.sources.r1.interceptors.i1.serializers.s2.name = error 
 * agent.sources.r1.interceptors.i1.serializers.s2.dateFormat = yyyy-MM-dd 
 * </code> 
 * </p> 
 *  
 * <pre> 
 * Example 1: 
 * </p> 
 * EventBody: 1:2:3.4foobar5</p> Configuration: 
 * agent.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d) 
 * </p> 
 * agent.sources.r1.interceptors.i1.serializers = s1 s2 s3 
 * agent.sources.r1.interceptors.i1.serializers.s1.name = one 
 * agent.sources.r1.interceptors.i1.serializers.s2.name = two 
 * agent.sources.r1.interceptors.i1.serializers.s3.name = three 
 * </p> 
 * results in an event with the the following 
 *  
 * body: 1:2:3.4foobar5 headers: one=>1, two=>2, three=3 
 *  
 * Example 2: 
 *  
 * EventBody: 1:2:3.4foobar5 
 *  
 * Configuration: agent.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d) 
 * <p> 
 * agent.sources.r1.interceptors.i1.serializers = s1 s2 
 * agent.sources.r1.interceptors.i1.serializers.s1.name = one 
 * agent.sources.r1.interceptors.i1.serializers.s2.name = two 
 * <p> 
 *  
 * results in an event with the the following 
 *  
 * body: 1:2:3.4foobar5 headers: one=>1, two=>2 
 * </pre> 
 */  
public class RegexExtractorExtInterceptor implements Interceptor {  
  
    static final String REGEX = "regex";  
    static final String SERIALIZERS = "serializers";  
  
    // 增加代码开始  
  
    static final String EXTRACTOR_HEADER = "extractorHeader";  
    static final boolean DEFAULT_EXTRACTOR_HEADER = false;  
    static final String EXTRACTOR_HEADER_KEY = "extractorHeaderKey";  
  
    // 增加代码结束  
  
    private static final Logger logger = LoggerFactory  
            .getLogger(RegexExtractorExtInterceptor.class);  
  
    private final Pattern regex;  
    private final List<NameAndSerializer> serializers;  
  
    // 增加代码开始  
  
    private final boolean extractorHeader;  
    private final String extractorHeaderKey;  
  
    // 增加代码结束  
  
    private RegexExtractorExtInterceptor(Pattern regex,  
            List<NameAndSerializer> serializers, boolean extractorHeader,  
            String extractorHeaderKey) {  
        this.regex = regex;  
        this.serializers = serializers;  
        this.extractorHeader = extractorHeader;  
        this.extractorHeaderKey = extractorHeaderKey;  
    }  
  
    @Override  
    public void initialize() {  
        // NO-OP...  
    }  
  
    @Override  
    public void close() {  
        // NO-OP...  
    }  
  
    @Override  
    public Event intercept(Event event) {  
        String tmpStr;  
        if(extractorHeader)  
        {  
            tmpStr = event.getHeaders().get(extractorHeaderKey);  
        }  
        else  
        {  
            tmpStr=new String(event.getBody(),  
                    Charsets.UTF_8);  
        }  
          
        Matcher matcher = regex.matcher(tmpStr);  
        Map<String, String> headers = event.getHeaders();  
        if (matcher.find()) {  
            for (int group = 0, count = matcher.groupCount(); group < count; group++) {  
                int groupIndex = group + 1;  
                if (groupIndex > serializers.size()) {  
                    if (logger.isDebugEnabled()) {  
                        logger.debug(  
                                "Skipping group {} to {} due to missing serializer",  
                                group, count);  
                    }  
                    break;  
                }  
                NameAndSerializer serializer = serializers.get(group);  
                if (logger.isDebugEnabled()) {  
                    logger.debug("Serializing {} using {}",  
                            serializer.headerName, serializer.serializer);  
                }  
                headers.put(serializer.headerName, serializer.serializer  
                        .serialize(matcher.group(groupIndex)));  
            }  
        }  
        return event;  
    }  
  
    @Override  
    public List<Event> intercept(List<Event> events) {  
        List<Event> intercepted = Lists.newArrayListWithCapacity(events.size());  
        for (Event event : events) {  
            Event interceptedEvent = intercept(event);  
            if (interceptedEvent != null) {  
                intercepted.add(interceptedEvent);  
            }  
        }  
        return intercepted;  
    }  
  
    public static class Builder implements Interceptor.Builder {  
  
        private Pattern regex;  
        private List<NameAndSerializer> serializerList;  
  
        // 增加代码开始  
  
        private boolean extractorHeader;  
        private String extractorHeaderKey;  
  
        // 增加代码结束  
  
        private final RegexExtractorInterceptorSerializer defaultSerializer = new RegexExtractorInterceptorPassThroughSerializer();  
  
        @Override  
        public void configure(Context context) {  
            String regexString = context.getString(REGEX);  
            Preconditions.checkArgument(!StringUtils.isEmpty(regexString),  
                    "Must supply a valid regex string");  
  
            regex = Pattern.compile(regexString);  
            regex.pattern();  
            regex.matcher("").groupCount();  
            configureSerializers(context);  
  
            // 增加代码开始  
            extractorHeader = context.getBoolean(EXTRACTOR_HEADER,  
                    DEFAULT_EXTRACTOR_HEADER);  
  
            if (extractorHeader) {  
                extractorHeaderKey = context.getString(EXTRACTOR_HEADER_KEY);  
                Preconditions.checkArgument(  
                        !StringUtils.isEmpty(extractorHeaderKey),  
                        "必须指定要抽取内容的header key");  
            }  
            // 增加代码结束  
        }  
  
        private void configureSerializers(Context context) {  
            String serializerListStr = context.getString(SERIALIZERS);  
            Preconditions.checkArgument(  
                    !StringUtils.isEmpty(serializerListStr),  
                    "Must supply at least one name and serializer");  
  
            String[] serializerNames = serializerListStr.split("\\s+");  
  
            Context serializerContexts = new Context(  
                    context.getSubProperties(SERIALIZERS + "."));  
  
            serializerList = Lists  
                    .newArrayListWithCapacity(serializerNames.length);  
            for (String serializerName : serializerNames) {  
                Context serializerContext = new Context(  
                        serializerContexts.getSubProperties(serializerName  
                                + "."));  
                String type = serializerContext.getString("type", "DEFAULT");  
                String name = serializerContext.getString("name");  
                Preconditions.checkArgument(!StringUtils.isEmpty(name),  
                        "Supplied name cannot be empty.");  
  
                if ("DEFAULT".equals(type)) {  
                    serializerList.add(new NameAndSerializer(name,  
                            defaultSerializer));  
                } else {  
                    serializerList.add(new NameAndSerializer(name,  
                            getCustomSerializer(type, serializerContext)));  
                }  
            }  
        }  
  
        private RegexExtractorInterceptorSerializer getCustomSerializer(  
                String clazzName, Context context) {  
            try {  
                RegexExtractorInterceptorSerializer serializer = (RegexExtractorInterceptorSerializer) Class  
                        .forName(clazzName).newInstance();  
                serializer.configure(context);  
                return serializer;  
            } catch (Exception e) {  
                logger.error("Could not instantiate event serializer.", e);  
                Throwables.propagate(e);  
            }  
            return defaultSerializer;  
        }  
  
        @Override  
        public Interceptor build() {  
            Preconditions.checkArgument(regex != null,  
                    "Regex pattern was misconfigured");  
            Preconditions.checkArgument(serializerList.size() > 0,  
                    "Must supply a valid group match id list");  
            return new RegexExtractorExtInterceptor(regex, serializerList,  
                    extractorHeader, extractorHeaderKey);  
        }  
    }  
  
    static class NameAndSerializer {  
        private final String headerName;  
        private final RegexExtractorInterceptorSerializer serializer;  
  
        public NameAndSerializer(String headerName,  
                RegexExtractorInterceptorSerializer serializer) {  
            this.headerName = headerName;  
            this.serializer = serializer;  
        }  
    }  
}  

简单说明一下改动的内容：

增加了两个配置参数：

extractorHeader 是否抽取的是header部分，默认为false,即和原始的拦截器功能一致，抽取的是event body的内容

extractorHeaderKey 抽取的header的指定的key的内容，当extractorHeader为true时，必须指定该参数。

按照第八讲的方法，我们将该类打成jar包，作为flume的插件放到了/var/lib/flume-ng/plugins.d/RegexExtractorExtInterceptor/lib目录下，重新启动flume，将该拦截器加载到classpath中。

最终的flume.conf如下：

[plain] view plain copy
tier1.sources=source1  
tier1.channels=channel1  
tier1.sinks=sink1  
tier1.sources.source1.type=spooldir  
tier1.sources.source1.spoolDir=/opt/logs  
tier1.sources.source1.fileHeader=true  
tier1.sources.source1.basenameHeader=true  
tier1.sources.source1.interceptors=i1  
tier1.sources.source1.interceptors.i1.type=com.besttone.flume.RegexExtractorExtInterceptor$Builder  
tier1.sources.source1.interceptors.i1.regex=(.*)\\.(.*)\\.(.*)  
tier1.sources.source1.interceptors.i1.extractorHeader=true  
tier1.sources.source1.interceptors.i1.extractorHeaderKey=basename  
tier1.sources.source1.interceptors.i1.serializers=s1 s2 s3  
tier1.sources.source1.interceptors.i1.serializers.s1.name=one  
tier1.sources.source1.interceptors.i1.serializers.s2.name=two  
tier1.sources.source1.interceptors.i1.serializers.s3.name=three  
tier1.sources.source1.channels=channel1  
tier1.sinks.sink1.type=hdfs  
tier1.sinks.sink1.channel=channel1  
tier1.sinks.sink1.hdfs.path=hdfs://master68:8020/flume/events/%{one}/%{three}  
tier1.sinks.sink1.hdfs.round=true  
tier1.sinks.sink1.hdfs.roundValue=10  
tier1.sinks.sink1.hdfs.roundUnit=minute  
tier1.sinks.sink1.hdfs.fileType=DataStream  
tier1.sinks.sink1.hdfs.writeFormat=Text  
tier1.sinks.sink1.hdfs.rollInterval=0  
tier1.sinks.sink1.hdfs.rollSize=10240  
tier1.sinks.sink1.hdfs.rollCount=0  
tier1.sinks.sink1.hdfs.idleTimeout=60  
tier1.channels.channel1.type=memory  
tier1.channels.channel1.capacity=10000  
tier1.channels.channel1.transactionCapacity=1000  
tier1.channels.channel1.keep-alive=30  

我把source type改回了内置的spooldir，而不是上一讲自定义的source,然后添加了一个拦截器i1,type是自定义的拦截器：com.besttone.flume.RegexExtractorExtInterceptor$Builder,正则表达式按“.”分隔抽取三部分，分别放到header中的key:one,two,three当中去，即a.log.2014-07-31,通过拦截器后，在header当中就会增加三个key: one=a,two=log,three=2014-07-31。这时候我们在tier1.sinks.sink1.hdfs.path=hdfs://master68:8020/flume/events/%{one}/%{three}。

就实现了和前面第八讲一模一样的需求。

也可以看到，自定义拦截器的改动成本非常小，比自定义source小多了，我们这就增加了一个类，就实现了该功能。

阅读全文

1 0