如何编写Flume-ng-morphline-avro-sink
来源:互联网 发布:淘宝会员名大全简单 编辑:程序博客网 时间:2024/06/04 19:02
以下内容都是自己瞎琢磨,并实验通过了。不知是否还有其他更好的方法,请各位大侠指正。本人研究大数据不到两个月时间。
工作需要,在预研与大数据,主要是hadoop相关组件和子项目的一些技术。
预研的产品平台主要包含hadoop、flume、solr、spark。目前重点关注flume和solr部分。即:从flume采集回日志进行分词传给solr创建索引,或第三方平台发送的已经结构化的数据直接创建索引。
平台框架类似下图(自己用visio画的简单示意图,仅供参考。图中省略了flume channel,数据处理层与数据分发层中间缺少了一个source和channel):
数据源:部署flume agent和第三方agent。flume agent主要用于采集日志。第三方agent用户采集除日志之外的其他信息。
采集层:部署flume和第三方采集平台,flume source用于接收从flume agent采集回来的日志信息。3rd platform用于接收从3rd agent采集回来的其他数据。3rd platform将数据进行结构化传递给flume source。
数据处理层:判断数据类型(结构化、非结构化),对非结构化数据使用morphline进行分词。采用morphline-avro-sink发出。
数据分发层:接收morphline-avro-sink数据,使用不同的sink将数据分发给不同的业务进行处理,其中重要的一条路径是分发给solr创建数据索引。
业务处理层:只介绍solr创建索引。
本文重点介绍morphline-avro-sink的编写过程。以及周边所需要的一些功能。
主要有以下几点是需要重点关注的:
1、参照flume源码中的flume-ng-morphline-solr-sink代码。
2、由于该sink最后是需要将数据以avro格式发出,所以MorphlineSink要继承AbstractRpcSink。因为Flume-ng的AvroSink就是继承的这个类。
3、因为AbstractRpcSink对外提供了两个接口用于数据处理:RpcClient.append(Event)和RpcClient.appendBatch(List<Event>)。所以,要在MorphlineSink中做好RpcClient的初始化。
4、MorphlineHandlerImpl中要在上下文中初始化好一个finalChild的Command。这个command默认是morphline中所有命令的最后一个,来接收之前命令的处理结果。
以下是MorphlineSink的代码,其中与flume-ng-morphline-solr-sink不同之处用红色字体标识。
/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package org.apache.flume.sink.avro.morphline;import java.util.Properties;import java.util.Map.Entry;import org.apache.flume.Channel;import org.apache.flume.ChannelException;import org.apache.flume.Context;import org.apache.flume.Event;import org.apache.flume.EventDeliveryException;import org.apache.flume.Transaction;import org.apache.flume.api.RpcClient;import org.apache.flume.api.RpcClientConfigurationConstants;import org.apache.flume.api.RpcClientFactory;import org.apache.flume.conf.Configurable;import org.apache.flume.conf.ConfigurationException;import org.apache.flume.instrumentation.SinkCounter;import org.apache.flume.sink.AbstractRpcSink;import org.kitesdk.morphline.api.Command;import org.slf4j.Logger;import org.slf4j.LoggerFactory;/** * Flume sink that extracts search documents from Flume events and processes them using a morphline * {@link Command} chain. */public class MorphlineSink extends AbstractRpcSink implements Configurable {private RpcClient client;private Properties clientProps; private int maxBatchSize = 1000; private long maxBatchDurationMillis = 1000; private String handlerClass; private MorphlineHandler handler; private Context context; private SinkCounter sinkCounter; public static final String BATCH_SIZE = "batchSize"; public static final String BATCH_DURATION_MILLIS = "batchDurationMillis"; public static final String HANDLER_CLASS = "handlerClass"; private static final Logger LOGGER = LoggerFactory.getLogger(MorphlineSink.class); public MorphlineSink() { this(null); } /** For testing only */ protected MorphlineSink(MorphlineHandler handler) { this.handler = handler; } @Override public void configure(Context context) { this.context = context; maxBatchSize = context.getInteger(BATCH_SIZE, maxBatchSize); maxBatchDurationMillis = context.getLong(BATCH_DURATION_MILLIS, maxBatchDurationMillis); handlerClass = context.getString(HANDLER_CLASS, MorphlineHandlerImpl.class.getName()); if (sinkCounter == null) { LOGGER.info("sinkCount is null"); sinkCounter = new SinkCounter(getName()); } /*LOGGER.info("sinkCount is " + sinkCounter.toString());*/ clientProps = new Properties(); clientProps.setProperty(RpcClientConfigurationConstants.CONFIG_HOSTS, "h1"); clientProps.setProperty(RpcClientConfigurationConstants.CONFIG_HOSTS_PREFIX + "h1", context.getString("hostname") + ":" + context.getInteger("port")); for (Entry<String, String> entry: context.getParameters().entrySet()) { clientProps.setProperty(entry.getKey(), entry.getValue()); } <span style="color:#ff0000;">client = initializeRpcClient(clientProps);</span> if (handler == null) { MorphlineHandler tmpHandler; try { tmpHandler = (MorphlineHandler) Class.forName(handlerClass).newInstance(); } catch (Exception e) { throw new ConfigurationException(e); } tmpHandler.configure(context); handler = tmpHandler; } super.configure(context); } /** * Returns the maximum number of events to take per flume transaction; * override to customize */ private int getMaxBatchSize() { return maxBatchSize; } /** Returns the maximum duration per flume transaction; override to customize */ private long getMaxBatchDurationMillis() { return maxBatchDurationMillis; } /*@Override public synchronized void start() { LOGGER.info("Starting Morphline Sink {} ...", this); sinkCounter.start(); if (handler == null) { MorphlineHandler tmpHandler; try { tmpHandler = (MorphlineHandler) Class.forName(handlerClass).newInstance(); } catch (Exception e) { throw new ConfigurationException(e); } tmpHandler.configure(context); handler = tmpHandler; } super.start(); LOGGER.info("Morphline Sink {} started.", getName()); } @Override public synchronized void stop() { LOGGER.info("Morphline Sink {} stopping...", getName()); try { if (handler != null) { handler.stop(); } sinkCounter.stop(); LOGGER.info("Morphline Sink {} stopped. Metrics: {}, {}", getName(), sinkCounter); } finally { super.stop(); } }*/ @Override public Status process() throws EventDeliveryException { int batchSize = getMaxBatchSize(); long batchEndTime = System.currentTimeMillis() + getMaxBatchDurationMillis(); Channel myChannel = getChannel(); Transaction txn = myChannel.getTransaction(); txn.begin(); boolean isMorphlineTransactionCommitted = true; try { int numEventsTaken = 0; handler.beginTransaction(); isMorphlineTransactionCommitted = false;// List<Event> events = Lists.newLinkedList(); // repeatedly take and process events from the Flume queue for (int i = 0; i < batchSize; i++) { Event event = myChannel.take(); if (event == null) { break; } sinkCounter.incrementEventDrainAttemptCount(); numEventsTaken++;// LOGGER.info("Flume event: {}", event); //StreamEvent streamEvent = createStreamEvent(event); <span style="color:#ff0000;">handler.process(event, client);</span>// events.add(event); if (System.currentTimeMillis() >= batchEndTime) { break; } }// handler.process(events, client); // update metrics if (numEventsTaken == 0) { sinkCounter.incrementBatchEmptyCount(); } if (numEventsTaken < batchSize) { sinkCounter.incrementBatchUnderflowCount(); } else { sinkCounter.incrementBatchCompleteCount(); } handler.commitTransaction(); isMorphlineTransactionCommitted = true; txn.commit(); sinkCounter.addToEventDrainSuccessCount(numEventsTaken); return numEventsTaken == 0 ? Status.BACKOFF : Status.READY; } catch (Throwable t) { // Ooops - need to rollback and back off LOGGER.error("Morphline Sink " + getName() + ": Unable to process event from channel " + myChannel.getName() + ". Exception follows.", t); try { if (!isMorphlineTransactionCommitted) { handler.rollbackTransaction(); } } catch (Throwable t2) { LOGGER.error("Morphline Sink " + getName() + ": Unable to rollback morphline transaction. " + "Exception follows.", t2); } finally { try { txn.rollback(); } catch (Throwable t4) { LOGGER.error("Morphline Sink " + getName() + ": Unable to rollback Flume transaction. " + "Exception follows.", t4); } } if (t instanceof Error) { throw (Error) t; // rethrow original exception } else if (t instanceof ChannelException) { return Status.BACKOFF; } else { throw new EventDeliveryException("Failed to send events", t); // rethrow and backoff } } finally { txn.close(); } } @Override public String toString() { int i = getClass().getName().lastIndexOf('.') + 1; String shortClassName = getClass().getName().substring(i); return getName() + " (" + shortClassName + ")"; }@Overrideprotected RpcClient initializeRpcClient(Properties props) {LOGGER.info("Attempting to create Avro Rpc client.");return RpcClientFactory.getInstance(props);}}
/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package org.apache.flume.sink.avro.morphline;import java.io.File;import java.util.ArrayList;import java.util.Collection;import java.util.HashMap;import java.util.Iterator;import java.util.List;import java.util.Map;import java.util.Map.Entry;import org.apache.flume.Context;import org.apache.flume.Event;import org.apache.flume.EventDeliveryException;import org.apache.flume.api.RpcClient;import org.apache.flume.event.EventBuilder;import org.kitesdk.morphline.api.Command;import org.kitesdk.morphline.api.MorphlineCompilationException;import org.kitesdk.morphline.api.MorphlineContext;import org.kitesdk.morphline.api.Record;import org.kitesdk.morphline.base.Compiler;import org.kitesdk.morphline.base.FaultTolerance;import org.kitesdk.morphline.base.Fields;import org.kitesdk.morphline.base.Metrics;import org.kitesdk.morphline.base.Notifications;import org.slf4j.Logger;import org.slf4j.LoggerFactory;import com.codahale.metrics.Meter;import com.codahale.metrics.MetricRegistry;import com.codahale.metrics.SharedMetricRegistries;import com.codahale.metrics.Timer;import com.google.common.base.Preconditions;import com.google.common.collect.ListMultimap;import com.typesafe.config.Config;import com.typesafe.config.ConfigFactory;/** * A {@link MorphlineHandler} that processes it's events using a morphline * {@link Command} chain. */public class MorphlineHandlerImpl implements MorphlineHandler {private MorphlineContext morphlineContext;private Command morphline;private Command finalChild;private String morphlineFileAndId;private Timer mappingTimer;private Meter numRecords;private Meter numFailedRecords;private Meter numExceptionRecords;public static final String MORPHLINE_FILE_PARAM = "morphlineFile";public static final String MORPHLINE_ID_PARAM = "morphlineId";/** * Morphline variables can be passed from flume.conf to the morphline, e.g.: * agent.sinks.solrSink.morphlineVariable.zkHost=127.0.0.1:2181/solr */public static final String MORPHLINE_VARIABLE_PARAM = "morphlineVariable";private static final Logger LOG = LoggerFactory.getLogger(MorphlineHandlerImpl.class);// For test injectionvoid setMorphlineContext(MorphlineContext morphlineContext) {this.morphlineContext = morphlineContext;}// for interceptorvoid setFinalChild(Command finalChild) {this.finalChild = finalChild;}@Overridepublic void configure(Context context) {String morphlineFile = context.getString(MORPHLINE_FILE_PARAM);String morphlineId = context.getString(MORPHLINE_ID_PARAM);if (morphlineFile == null || morphlineFile.trim().length() == 0) {throw new MorphlineCompilationException("Missing parameter: "+ MORPHLINE_FILE_PARAM, null);}morphlineFileAndId = morphlineFile + "@" + morphlineId;if (morphlineContext == null) {FaultTolerance faultTolerance = new FaultTolerance(context.getBoolean(FaultTolerance.IS_PRODUCTION_MODE, false),context.getBoolean(FaultTolerance.IS_IGNORING_RECOVERABLE_EXCEPTIONS,false),context.getString(FaultTolerance.RECOVERABLE_EXCEPTION_CLASSES));morphlineContext = new MorphlineContext.Builder().setExceptionHandler(faultTolerance).setMetricRegistry(SharedMetricRegistries.getOrCreate(morphlineFileAndId)).build();}Config override = ConfigFactory.parseMap(context.getSubProperties(MORPHLINE_VARIABLE_PARAM + "."));<span style="color:#ff0000;">finalChild = new CollectorB();</span>morphline = new Compiler().compile(new File(morphlineFile),morphlineId, morphlineContext, finalChild, override);this.mappingTimer = morphlineContext.getMetricRegistry().timer(MetricRegistry.name("morphline.app", Metrics.ELAPSED_TIME));this.numRecords = morphlineContext.getMetricRegistry().meter(MetricRegistry.name("morphline.app", Metrics.NUM_RECORDS));this.numFailedRecords = morphlineContext.getMetricRegistry().meter(MetricRegistry.name("morphline.app", "numFailedRecords"));this.numExceptionRecords = morphlineContext.getMetricRegistry().meter(MetricRegistry.name("morphline.app", "numExceptionRecords"));}@Overridepublic void process(Event event, RpcClient client) {// LOG.info("entry into MorphlineHandlerImpl process" + event);numRecords.mark();Timer.Context timerContext = mappingTimer.time();try {Record record = new Record();for (Entry<String, String> entry : event.getHeaders().entrySet()) {record.put(entry.getKey(), entry.getValue());}byte[] bytes = event.getBody();if (bytes != null && bytes.length > 0) {record.put(Fields.ATTACHMENT_BODY, bytes);}try {Notifications.notifyStartSession(morphline);if (!morphline.process(record)) {numFailedRecords.mark();LOG.warn("Morphline {} failed to process record: {}",morphlineFileAndId, record);}<span style="color:#ff0000;">Map<String, String> headers = null;List<Record> tmp = ((CollectorB) finalChild).getRecords();List<Record> records = new ArrayList<Record>();records.addAll(tmp);tmp.clear();//LOG.info("records 00000--------- " + records.size());Iterator irt = records.iterator();while (irt.hasNext()) {Record r = (Record) irt.next();headers = new HashMap<String, String>();ListMultimap<String, Object> lmt = r.getFields();Map<String, Collection<Object>> m = lmt.asMap();Iterator it = m.entrySet().iterator();while (it.hasNext()) {Entry<String, Object> entry = (Entry<String, Object>) it.next();if (entry.getValue() != null) {List v = (List) entry.getValue();if (v.get(0) != null) {headers.put(entry.getKey(), v.get(0).toString());}}}Event e = EventBuilder.withBody(event.getBody(), headers);client.append(e);</span>}} catch (RuntimeException t) {numExceptionRecords.mark();morphlineContext.getExceptionHandler().handleException(t,record);} catch (EventDeliveryException e1) {numExceptionRecords.mark();morphlineContext.getExceptionHandler().handleException(e1,record);}} finally {timerContext.stop();}}@Overridepublic void beginTransaction() {Notifications.notifyBeginTransaction(morphline);}@Overridepublic void commitTransaction() {Notifications.notifyCommitTransaction(morphline);}@Overridepublic void rollbackTransaction() {Notifications.notifyRollbackTransaction(morphline);}@Overridepublic void stop() {Notifications.notifyShutdown(morphline);}<span style="color:#ff0000;">public static final class CollectorB implements Command {private final List<Record> results = new ArrayList();public List<Record> getRecords() {return results;}public void reset() {results.clear();}@Overridepublic Command getParent() {return null;}@Overridepublic void notify(Record notification) {}@Overridepublic boolean process(Record record) {Preconditions.checkNotNull(record);results.add(record);return true;}}</span>}以上是费了几天的力气才完成的功能。发出来的目的:
1、知识的记录和备忘。
2、对正在做这部分的同行有点帮助。
3、请大家指正,是否有其他更好的办法,或者这种实现方式是否存在其他隐患。
4、与大家交流,希望能提高自己。
- 如何编写Flume-ng-morphline-avro-sink
- 重写Flume-NG-morphline-avro-sink
- Flume NG flume-hdfs-sink 源代码分析
- Flume-ng HDFS sink原理解析
- Flume-ng HDFS Sink “丢数据”
- Flume-ng:avro源发送日志
- 【Flume】flume中Avro Sink到Avro Source的性能测试,是否压缩,是否加密
- Flume(ng) 自定义sink实现和属性注入
- 自定义的flume-ng的postgresql数据库sink
- Flume-ng 自定义sink实现和属性注入
- Flume NG之Agent部署和sink配置HDFS
- 【Hadoop】Flume-ng源码解析之Sink组件
- Flume(ng) 自定义sink实现和属性注入
- Flume(ng) 自定义sink实现和属性注入
- Flume Sink
- 【Flume】【源码分析】深入flume-ng的三大组件——source,channel,sink
- 【Flume】【源码分析】深入flume-ng的三大组件——source,channel,sink
- 【Flume】【源码分析】深入flume-ng的三大组件——source,channel,sink
- SSh结合Easyui实现Datagrid的分页显示
- MyEclipse破解以及注册码
- Cocos2d-x中android.mk文件中cpp文件的自动生成
- CButtonST使用技巧: CButtonST简介
- Oracle 取两个表中数据的交集并集差异集合
- 如何编写Flume-ng-morphline-avro-sink
- 首先列一下,sellect、poll、epoll三者的区别
- 指针的算术运算
- 网络技术员面试自我介绍范文
- 使用BIND来搭建简单的主辅DNS服务器
- swift开发知识点滴
- tomcat 部署
- 使用easyui中的datagrid 分页时,向后台传参数page 和rows
- Codeforces Round #268 (Div. 2) C 24 Game