如何编写Flume-ng-morphline-avro-sink

来源:互联网 发布:淘宝会员名大全简单 编辑:程序博客网 时间:2024/06/04 19:02

以下内容都是自己瞎琢磨,并实验通过了。不知是否还有其他更好的方法,请各位大侠指正。本人研究大数据不到两个月时间。    

   工作需要,在预研与大数据,主要是hadoop相关组件和子项目的一些技术。

    预研的产品平台主要包含hadoop、flume、solr、spark。目前重点关注flume和solr部分。
    即:从flume采集回日志进行分词传给solr创建索引,或第三方平台发送的已经结构化的数据直接创建索引。

    平台框架类似下图(自己用visio画的简单示意图,仅供参考。图中省略了flume channel,数据处理层与数据分发层中间缺少了一个source和channel):



数据源:部署flume agent和第三方agent。flume agent主要用于采集日志。第三方agent用户采集除日志之外的其他信息。
采集层:部署flume和第三方采集平台,flume source用于接收从flume agent采集回来的日志信息。3rd platform用于接收从3rd agent采集回来的其他数据。3rd platform将数据进行结构化传递给flume source。
数据处理层:判断数据类型(结构化、非结构化),对非结构化数据使用morphline进行分词。采用morphline-avro-sink发出。
数据分发层:接收morphline-avro-sink数据,使用不同的sink将数据分发给不同的业务进行处理,其中重要的一条路径是分发给solr创建数据索引。
业务处理层:只介绍solr创建索引。

本文重点介绍morphline-avro-sink的编写过程。以及周边所需要的一些功能。
主要有以下几点是需要重点关注的:
1、参照flume源码中的flume-ng-morphline-solr-sink代码。
2、由于该sink最后是需要将数据以avro格式发出,所以MorphlineSink要继承AbstractRpcSink。因为Flume-ng的AvroSink就是继承的这个类。
3、因为AbstractRpcSink对外提供了两个接口用于数据处理:RpcClient.append(Event)和RpcClient.appendBatch(List<Event>)。所以,要在MorphlineSink中做好RpcClient的初始化。
4、MorphlineHandlerImpl中要在上下文中初始化好一个finalChild的Command。这个command默认是morphline中所有命令的最后一个,来接收之前命令的处理结果。
以下是MorphlineSink的代码,其中与flume-ng-morphline-solr-sink不同之处用红色字体标识。

/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements.  See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License.  You may obtain a copy of the License at * *     http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package org.apache.flume.sink.avro.morphline;import java.util.Properties;import java.util.Map.Entry;import org.apache.flume.Channel;import org.apache.flume.ChannelException;import org.apache.flume.Context;import org.apache.flume.Event;import org.apache.flume.EventDeliveryException;import org.apache.flume.Transaction;import org.apache.flume.api.RpcClient;import org.apache.flume.api.RpcClientConfigurationConstants;import org.apache.flume.api.RpcClientFactory;import org.apache.flume.conf.Configurable;import org.apache.flume.conf.ConfigurationException;import org.apache.flume.instrumentation.SinkCounter;import org.apache.flume.sink.AbstractRpcSink;import org.kitesdk.morphline.api.Command;import org.slf4j.Logger;import org.slf4j.LoggerFactory;/** * Flume sink that extracts search documents from Flume events and processes them using a morphline * {@link Command} chain. */public class MorphlineSink extends AbstractRpcSink implements Configurable {private RpcClient client;private Properties clientProps;  private int maxBatchSize = 1000;  private long maxBatchDurationMillis = 1000;  private String handlerClass;  private MorphlineHandler handler;  private Context context;  private SinkCounter sinkCounter;  public static final String BATCH_SIZE = "batchSize";  public static final String BATCH_DURATION_MILLIS = "batchDurationMillis";  public static final String HANDLER_CLASS = "handlerClass";    private static final Logger LOGGER = LoggerFactory.getLogger(MorphlineSink.class);  public MorphlineSink() {    this(null);  }  /** For testing only */  protected MorphlineSink(MorphlineHandler handler) {    this.handler = handler;  }  @Override  public void configure(Context context) {    this.context = context;    maxBatchSize = context.getInteger(BATCH_SIZE, maxBatchSize);    maxBatchDurationMillis = context.getLong(BATCH_DURATION_MILLIS, maxBatchDurationMillis);    handlerClass = context.getString(HANDLER_CLASS, MorphlineHandlerImpl.class.getName());        if (sinkCounter == null) {    LOGGER.info("sinkCount is null");      sinkCounter = new SinkCounter(getName());    }    /*LOGGER.info("sinkCount is " + sinkCounter.toString());*/        clientProps = new Properties();        clientProps.setProperty(RpcClientConfigurationConstants.CONFIG_HOSTS, "h1");    clientProps.setProperty(RpcClientConfigurationConstants.CONFIG_HOSTS_PREFIX +        "h1", context.getString("hostname") + ":" + context.getInteger("port"));    for (Entry<String, String> entry: context.getParameters().entrySet()) {      clientProps.setProperty(entry.getKey(), entry.getValue());    }        <span style="color:#ff0000;">client = initializeRpcClient(clientProps);</span>    if (handler == null) {        MorphlineHandler tmpHandler;        try {          tmpHandler = (MorphlineHandler) Class.forName(handlerClass).newInstance();        } catch (Exception e) {          throw new ConfigurationException(e);        }        tmpHandler.configure(context);        handler = tmpHandler;      }     super.configure(context);  }  /**   * Returns the maximum number of events to take per flume transaction;   * override to customize   */  private int getMaxBatchSize() {    return maxBatchSize;  }  /** Returns the maximum duration per flume transaction; override to customize */  private long getMaxBatchDurationMillis() {    return maxBatchDurationMillis;  }  /*@Override  public synchronized void start() {    LOGGER.info("Starting Morphline Sink {} ...", this);    sinkCounter.start();    if (handler == null) {      MorphlineHandler tmpHandler;      try {        tmpHandler = (MorphlineHandler) Class.forName(handlerClass).newInstance();      } catch (Exception e) {        throw new ConfigurationException(e);      }      tmpHandler.configure(context);      handler = tmpHandler;    }        super.start();    LOGGER.info("Morphline Sink {} started.", getName());  }  @Override  public synchronized void stop() {    LOGGER.info("Morphline Sink {} stopping...", getName());    try {      if (handler != null) {        handler.stop();      }      sinkCounter.stop();      LOGGER.info("Morphline Sink {} stopped. Metrics: {}, {}", getName(), sinkCounter);    } finally {      super.stop();    }  }*/  @Override  public Status process() throws EventDeliveryException {    int batchSize = getMaxBatchSize();    long batchEndTime = System.currentTimeMillis() + getMaxBatchDurationMillis();    Channel myChannel = getChannel();    Transaction txn = myChannel.getTransaction();    txn.begin();    boolean isMorphlineTransactionCommitted = true;    try {      int numEventsTaken = 0;      handler.beginTransaction();      isMorphlineTransactionCommitted = false;//      List<Event> events = Lists.newLinkedList();      // repeatedly take and process events from the Flume queue      for (int i = 0; i < batchSize; i++) {        Event event = myChannel.take();        if (event == null) {          break;        }        sinkCounter.incrementEventDrainAttemptCount();        numEventsTaken++;//        LOGGER.info("Flume event: {}", event);        //StreamEvent streamEvent = createStreamEvent(event);        <span style="color:#ff0000;">handler.process(event, client);</span>//        events.add(event);        if (System.currentTimeMillis() >= batchEndTime) {          break;        }      }//      handler.process(events, client);      // update metrics      if (numEventsTaken == 0) {        sinkCounter.incrementBatchEmptyCount();      }      if (numEventsTaken < batchSize) {        sinkCounter.incrementBatchUnderflowCount();      } else {        sinkCounter.incrementBatchCompleteCount();      }      handler.commitTransaction();      isMorphlineTransactionCommitted = true;      txn.commit();      sinkCounter.addToEventDrainSuccessCount(numEventsTaken);      return numEventsTaken == 0 ? Status.BACKOFF : Status.READY;    } catch (Throwable t) {      // Ooops - need to rollback and back off      LOGGER.error("Morphline Sink " + getName() + ": Unable to process event from channel " + myChannel.getName()            + ". Exception follows.", t);      try {        if (!isMorphlineTransactionCommitted) {          handler.rollbackTransaction();        }      } catch (Throwable t2) {        LOGGER.error("Morphline Sink " + getName() + ": Unable to rollback morphline transaction. " +        "Exception follows.", t2);      } finally {        try {          txn.rollback();        } catch (Throwable t4) {          LOGGER.error("Morphline Sink " + getName() + ": Unable to rollback Flume transaction. " +              "Exception follows.", t4);        }      }      if (t instanceof Error) {        throw (Error) t; // rethrow original exception      } else if (t instanceof ChannelException) {        return Status.BACKOFF;      } else {        throw new EventDeliveryException("Failed to send events", t); // rethrow and backoff      }    } finally {      txn.close();    }  }      @Override  public String toString() {    int i = getClass().getName().lastIndexOf('.') + 1;    String shortClassName = getClass().getName().substring(i);    return getName() + " (" + shortClassName + ")";  }@Overrideprotected RpcClient initializeRpcClient(Properties props) {LOGGER.info("Attempting to create Avro Rpc client.");return RpcClientFactory.getInstance(props);}}

/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements.  See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License.  You may obtain a copy of the License at * *     http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package org.apache.flume.sink.avro.morphline;import java.io.File;import java.util.ArrayList;import java.util.Collection;import java.util.HashMap;import java.util.Iterator;import java.util.List;import java.util.Map;import java.util.Map.Entry;import org.apache.flume.Context;import org.apache.flume.Event;import org.apache.flume.EventDeliveryException;import org.apache.flume.api.RpcClient;import org.apache.flume.event.EventBuilder;import org.kitesdk.morphline.api.Command;import org.kitesdk.morphline.api.MorphlineCompilationException;import org.kitesdk.morphline.api.MorphlineContext;import org.kitesdk.morphline.api.Record;import org.kitesdk.morphline.base.Compiler;import org.kitesdk.morphline.base.FaultTolerance;import org.kitesdk.morphline.base.Fields;import org.kitesdk.morphline.base.Metrics;import org.kitesdk.morphline.base.Notifications;import org.slf4j.Logger;import org.slf4j.LoggerFactory;import com.codahale.metrics.Meter;import com.codahale.metrics.MetricRegistry;import com.codahale.metrics.SharedMetricRegistries;import com.codahale.metrics.Timer;import com.google.common.base.Preconditions;import com.google.common.collect.ListMultimap;import com.typesafe.config.Config;import com.typesafe.config.ConfigFactory;/** * A {@link MorphlineHandler} that processes it's events using a morphline * {@link Command} chain. */public class MorphlineHandlerImpl implements MorphlineHandler {private MorphlineContext morphlineContext;private Command morphline;private Command finalChild;private String morphlineFileAndId;private Timer mappingTimer;private Meter numRecords;private Meter numFailedRecords;private Meter numExceptionRecords;public static final String MORPHLINE_FILE_PARAM = "morphlineFile";public static final String MORPHLINE_ID_PARAM = "morphlineId";/** * Morphline variables can be passed from flume.conf to the morphline, e.g.: * agent.sinks.solrSink.morphlineVariable.zkHost=127.0.0.1:2181/solr */public static final String MORPHLINE_VARIABLE_PARAM = "morphlineVariable";private static final Logger LOG = LoggerFactory.getLogger(MorphlineHandlerImpl.class);// For test injectionvoid setMorphlineContext(MorphlineContext morphlineContext) {this.morphlineContext = morphlineContext;}// for interceptorvoid setFinalChild(Command finalChild) {this.finalChild = finalChild;}@Overridepublic void configure(Context context) {String morphlineFile = context.getString(MORPHLINE_FILE_PARAM);String morphlineId = context.getString(MORPHLINE_ID_PARAM);if (morphlineFile == null || morphlineFile.trim().length() == 0) {throw new MorphlineCompilationException("Missing parameter: "+ MORPHLINE_FILE_PARAM, null);}morphlineFileAndId = morphlineFile + "@" + morphlineId;if (morphlineContext == null) {FaultTolerance faultTolerance = new FaultTolerance(context.getBoolean(FaultTolerance.IS_PRODUCTION_MODE, false),context.getBoolean(FaultTolerance.IS_IGNORING_RECOVERABLE_EXCEPTIONS,false),context.getString(FaultTolerance.RECOVERABLE_EXCEPTION_CLASSES));morphlineContext = new MorphlineContext.Builder().setExceptionHandler(faultTolerance).setMetricRegistry(SharedMetricRegistries.getOrCreate(morphlineFileAndId)).build();}Config override = ConfigFactory.parseMap(context.getSubProperties(MORPHLINE_VARIABLE_PARAM + "."));<span style="color:#ff0000;">finalChild = new CollectorB();</span>morphline = new Compiler().compile(new File(morphlineFile),morphlineId, morphlineContext, finalChild, override);this.mappingTimer = morphlineContext.getMetricRegistry().timer(MetricRegistry.name("morphline.app", Metrics.ELAPSED_TIME));this.numRecords = morphlineContext.getMetricRegistry().meter(MetricRegistry.name("morphline.app", Metrics.NUM_RECORDS));this.numFailedRecords = morphlineContext.getMetricRegistry().meter(MetricRegistry.name("morphline.app", "numFailedRecords"));this.numExceptionRecords = morphlineContext.getMetricRegistry().meter(MetricRegistry.name("morphline.app", "numExceptionRecords"));}@Overridepublic void process(Event event, RpcClient client) {// LOG.info("entry into MorphlineHandlerImpl  process"   + event);numRecords.mark();Timer.Context timerContext = mappingTimer.time();try {Record record = new Record();for (Entry<String, String> entry : event.getHeaders().entrySet()) {record.put(entry.getKey(), entry.getValue());}byte[] bytes = event.getBody();if (bytes != null && bytes.length > 0) {record.put(Fields.ATTACHMENT_BODY, bytes);}try {Notifications.notifyStartSession(morphline);if (!morphline.process(record)) {numFailedRecords.mark();LOG.warn("Morphline {} failed to process record: {}",morphlineFileAndId, record);}<span style="color:#ff0000;">Map<String, String> headers = null;List<Record> tmp = ((CollectorB) finalChild).getRecords();List<Record> records = new ArrayList<Record>();records.addAll(tmp);tmp.clear();//LOG.info("records     00000---------   " + records.size());Iterator irt = records.iterator();while (irt.hasNext()) {Record r = (Record) irt.next();headers = new HashMap<String, String>();ListMultimap<String, Object> lmt = r.getFields();Map<String, Collection<Object>> m = lmt.asMap();Iterator it = m.entrySet().iterator();while (it.hasNext()) {Entry<String, Object> entry = (Entry<String, Object>) it.next();if (entry.getValue() != null) {List v = (List) entry.getValue();if (v.get(0) != null) {headers.put(entry.getKey(), v.get(0).toString());}}}Event e = EventBuilder.withBody(event.getBody(), headers);client.append(e);</span>}} catch (RuntimeException t) {numExceptionRecords.mark();morphlineContext.getExceptionHandler().handleException(t,record);} catch (EventDeliveryException e1) {numExceptionRecords.mark();morphlineContext.getExceptionHandler().handleException(e1,record);}} finally {timerContext.stop();}}@Overridepublic void beginTransaction() {Notifications.notifyBeginTransaction(morphline);}@Overridepublic void commitTransaction() {Notifications.notifyCommitTransaction(morphline);}@Overridepublic void rollbackTransaction() {Notifications.notifyRollbackTransaction(morphline);}@Overridepublic void stop() {Notifications.notifyShutdown(morphline);}<span style="color:#ff0000;">public static final class CollectorB implements Command {private final List<Record> results = new ArrayList();public List<Record> getRecords() {return results;}public void reset() {results.clear();}@Overridepublic Command getParent() {return null;}@Overridepublic void notify(Record notification) {}@Overridepublic boolean process(Record record) {Preconditions.checkNotNull(record);results.add(record);return true;}}</span>}
以上是费了几天的力气才完成的功能。发出来的目的:

1、知识的记录和备忘。

2、对正在做这部分的同行有点帮助。

3、请大家指正,是否有其他更好的办法,或者这种实现方式是否存在其他隐患。

4、与大家交流,希望能提高自己。

0 0
原创粉丝点击