Storm-kafka-hbase基础编程一

来源:互联网 发布:淘宝小号出售批发 编辑:程序博客网 时间:2024/05/21 17:01

上一篇文章介绍了Storm如果如何保证消息传送的,通过message ID以及anchered tuple机制来跟踪消息,如果完成了,返回ack,失败了返回fail以便重发消息。但是即使如此,大家也知道不能保证exactly once, 为什么? 大家去思考一下,因此本章所编写的程序是不保证exactly once 的,如果需要保证,需要使用Trident 接口,这个下次再进行介绍。


Storm的程序编写相比比较简单,但是管理依赖包却是一个非常烦的事情,我目前还没有找到一个很好的方法,通常是把程序打一个全包,因此包比较大,即使如此,还会出现某些类找不到,如果是有一定经验之后,你就会知道一个storm程序大概需要哪些包。

今天要展示的程序是从kafka读取数据,然后通过一个 bolt过滤数据后传送给另外一个bolt,最后存储到HBASE,所以需要准备相关 pom.xml。

pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><repositories><repository><id>cloudera</id><url>https://repository.cloudera.com/artifactory/cloudera-repos/</url></repository></repositories><groupId>com.isesol</groupId><artifactId>storm</artifactId><version>0.0.1-SNAPSHOT</version><packaging>jar</packaging><name>storm</name><url>http://maven.apache.org</url><properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding></properties><dependencies><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>3.8.1</version><scope>test</scope></dependency><dependency><groupId>org.apache.storm</groupId><artifactId>storm-core</artifactId><version>1.0.2</version><scope>provided</scope></dependency><!-- 0.9.0-kafka-2.0.1 --><dependency><groupId>org.apache.kafka</groupId><artifactId>kafka_2.10</artifactId><version>0.9.0-kafka-2.0.1</version><exclusions><exclusion><groupId>org.slf4j</groupId><artifactId>slf4j-log4j12</artifactId></exclusion><!-- <exclusion> <groupId>org.apache.zookeeper</groupId> <artifactId>zookeeper</artifactId> </exclusion> --></exclusions></dependency><!-- https://mvnrepository.com/artifact/org.apache.storm/storm-kafka --><dependency><groupId>org.apache.storm</groupId><artifactId>storm-kafka</artifactId><version>1.0.0</version></dependency><dependency><groupId>org.apache.kafka</groupId><artifactId>kafka-clients</artifactId><version>0.9.0-kafka-2.0.0</version></dependency><!-- https://mvnrepository.com/artifact/org.apache.storm/storm-hbase --><dependency><groupId>org.apache.storm</groupId><artifactId>storm-hbase</artifactId><version>1.1.0</version><exclusions><exclusion><groupId>jdk.tools</groupId><artifactId>jdk.tools</artifactId></exclusion></exclusions></dependency></dependencies><build><plugins><plugin><artifactId>maven-assembly-plugin</artifactId><version>2.6</version><configuration><archive><manifest><mainClass>com.isesol.storm.getKafka</mainClass></manifest></archive><!-- <descriptor>assembly.xml</descriptor> --><descriptorRefs><descriptorRef>jar-with-dependencies</descriptorRef></descriptorRefs></configuration></plugin></plugins></build></project>
pom.xml文件准备好之后,就开始编写程序了。 从kafka 读数据是通过storm-kafka的集成包完成的,写入HBASE通过storm-hbase集成包,整个流程就是kafkaspout->bolt->bolt->storm-hbase, 我们先来介绍一下编程的基本概念,先从bolt着手:

class kafkaBolt extends BaseRichBolt {private Map conf;private TopologyContext context;private OutputCollector collector;public void execute(Tuple input) {// TODO Auto-generated method stubtry{String line = input.getString(0);collector.emit(input, new Values(line));collector.ack(input);} catch (Exception ex){collector.fail(input);}}public void prepare(Map arg0, TopologyContext arg1, OutputCollector arg2) {// TODO Auto-generated method stubthis.conf = arg0;this.context = arg1;this.collector = arg2;}public void declareOutputFields(OutputFieldsDeclarer declarer) {// TODO Auto-generated method stubdeclarer.declare(new Fields("line"));}

kafkabolt集成BasicRichBolt, 一共3个方法prepare, execute, declareOutputFields,   prepare的作用是获取相关的系统变量,基本没有什么可处理的,直接赋值即可,execute作用是用来处理数据,declareOutputFields用来发送数据到下一个bolt, 如果你只有一个bolt,很显然这个地方什么也不用写,因为你不需要去发送数据。


这里着重说一下 collector.emit(input, new Values(line)) 以及 new Fields("line") :

collector.emit(input, new Values(line))  的input用来构建anchered tuple,new Values表示你要发送的数据,不同字段数据用逗号隔开,这里只有一个字段,就是line,假设你有2个字段,那么就应该是这样 new Values(line, line1).

new Fields("line")  的意思是你要发送的数据,和 new Values对应。 如果发送2个字段,那么就是 new Fields("line", "line1")  , 里面的名字可以随便给的,这个之后会介绍它的作用。


kafkabolt之后我还有一个 bolt, 用来发送数据到HBASEbolt:

class transferBolt extends BaseRichBolt {private Map conf;private TopologyContext context;private OutputCollector collector;public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {// TODO Auto-generated method stubthis.conf = stormConf;this.context = context;this.collector = collector;}public void execute(Tuple input) {// TODO Auto-generated method stubtry {String line = input.getString(0);collector.emit(input,new Values(UUID.randomUUID().toString() + "-test1", UUID.randomUUID().toString(), line));collector.ack(input);} catch (Exception ex) {collector.fail(input);}}public void declareOutputFields(OutputFieldsDeclarer declarer) {// TODO Auto-generated method stubdeclarer.declare(new Fields("rowkey", "linetest", "line"));}}

其实大家看到,这2个 bolt其实没干什么,kafkabolt根本没做处理,直接把接受到tuple直接就发送出来,transferbolt因为会把数据发给hbasebolt, 因此我做了一个简单的处理,添加rowkey. 


整个程序如下:


package com.isesol.storm;import java.util.ArrayList;import java.util.List;import java.util.Map;import java.util.UUID;import org.apache.storm.*;import org.apache.storm.generated.AlreadyAliveException;import org.apache.storm.generated.AuthorizationException;import org.apache.storm.generated.InvalidTopologyException;import org.apache.storm.hbase.bolt.HBaseBolt;import org.apache.storm.hbase.bolt.mapper.SimpleHBaseMapper;import org.apache.storm.shade.com.google.common.collect.Maps;import org.apache.storm.spout.SchemeAsMultiScheme;import org.apache.storm.task.OutputCollector;import org.apache.storm.task.TopologyContext;import org.apache.storm.topology.OutputFieldsDeclarer;import org.apache.storm.topology.TopologyBuilder;import org.apache.storm.topology.base.BaseRichBolt;import org.apache.storm.tuple.Fields;import org.apache.storm.tuple.Tuple;import org.apache.storm.tuple.Values;import org.apache.storm.utils.Utils;import org.apache.storm.kafka.BrokerHosts;import org.apache.storm.kafka.KafkaSpout;import org.apache.storm.kafka.SpoutConfig;import org.apache.storm.kafka.StringScheme;import org.apache.storm.kafka.ZkHosts;public class getKafka {public static void main(String[] args)throws AlreadyAliveException, InvalidTopologyException, AuthorizationException {String zkConnString = "datanode01.isesol.com:2181,datanode02.isesol.com:2181,datanode03.isesol.com:2181,datanode04.isesol.com:2181";String topicName = "2001";String zkRoot = "/data/storm";BrokerHosts hosts = new ZkHosts(zkConnString);SpoutConfig spoutConfig = new SpoutConfig(hosts, topicName, zkRoot, "jlwang");spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());spoutConfig.startOffsetTime = kafka.api.OffsetRequest.EarliestTime();KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);TopologyBuilder builder = new TopologyBuilder();List<String> fieldNameList = new ArrayList<String>();fieldNameList.add("linetest");fieldNameList.add("line");
builder.setSpout("kafka-reader", kafkaSpout, 1);                                        //kafka读取数据,spout为kafka-readerbuilder.setBolt("word-splitter", new kafkaBolt(), 2).shuffleGrouping("kafka-reader");   //kafkabolt获取kafka-reader数据,bolt名字是word-splitterbuilder.setBolt("word-transfer", new transferBolt(), 2).shuffleGrouping("word-splitter"); //transferbolt获取word-splitter数据Config conf = new Config();Map<String, String> HBConfig = Maps.newHashMap();HBConfig.put("hbase.zookeeper.property.clientPort", "2181");HBConfig.put("hbase.zookeeper.quorum","datanode01.isesol.com:2181,datanode02.isesol.com:2181,datanode03.isesol.com:2181,datanode04.isesol.com:2181");HBConfig.put("zookeeper.znode.parent", "/hbase");conf.put("HBCONFIG", HBConfig);SimpleHBaseMapper mapper = new SimpleHBaseMapper();mapper.withColumnFamily("cf");                       //设置hbase columnfamilymapper.withColumnFields(new Fields(fieldNameList));  //设置hbase的字段,这个值是从transferbolt,根据new Fields定义的名字获取的  mapper.withRowKeyField("rowkey");                    //设置rowkey,这个值是从transferbolt的"rowkey"获取的HBaseBolt hBaseBolt = new HBaseBolt("test3", mapper).withConfigKey("HBCONFIG");   //test3为hbase表hBaseBolt.withFlushIntervalSecs(10);                  //hbase定义的flush时间,10秒builder.setBolt("hbase", hBaseBolt, 3).shuffleGrouping("word-transfer");  //hbasebolt 从word-transfer 获取数据进行存储,存储方式按照上面定义的
                 column, column family, rowkey来存储String name = getKafka.class.getSimpleName();if (args != null && args.length > 0) {conf.setNumWorkers(2);// conf.setMessageTimeoutSecs(900);LocalCluster localCluster = new LocalCluster();localCluster.submitTopology(name, conf, builder.createTopology());// StormSubmitter.submitTopology(name, conf,// builder.createTopology());Utils.sleep(9999999);}}}class transferBolt extends BaseRichBolt {private Map conf;private TopologyContext context;private OutputCollector collector;public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {// TODO Auto-generated method stubthis.conf = stormConf;this.context = context;this.collector = collector;}public void execute(Tuple input) {// TODO Auto-generated method stubtry {String line = input.getString(0);collector.emit(input,new Values(UUID.randomUUID().toString() + "-test1", UUID.randomUUID().toString(), line));collector.ack(input);} catch (Exception ex) {collector.fail(input);}}public void declareOutputFields(OutputFieldsDeclarer declarer) {// TODO Auto-generated method stubdeclarer.declare(new Fields("rowkey", "linetest", "line"));}}class kafkaBolt extends BaseRichBolt {private Map conf;private TopologyContext context;private OutputCollector collector;public void execute(Tuple input) {// TODO Auto-generated method stubtry{String line = input.getString(0);collector.emit(input, new Values(line));collector.ack(input);} catch (Exception ex){collector.fail(input);}}public void prepare(Map arg0, TopologyContext arg1, OutputCollector arg2) {// TODO Auto-generated method stubthis.conf = arg0;this.context = arg1;this.collector = arg2;}public void declareOutputFields(OutputFieldsDeclarer declarer) {// TODO Auto-generated method stubdeclarer.declare(new Fields("line"));}}



对比上面这幅图,和我们写的程序,解释一下上图几个概念:

1. emit表示发送了多少数据

2. process latency表示处理一个数据平均花费多少时间

3. complete latency 表示一条消息从发送到处理完成,并接收到ack完成的总时间

4. ack表示确定多少条数据完全处理完成

很显然通过对比emit以及ack以及complete latency就可以知道整个处理量




通过上图的一目了然就知道处理了多少数据,以及效率, storm一定看清楚,你发送了多少数据,处理了多少,这样你可以知道spout以及 bolt速度,如果发送了1W,你才处理了100条,很显然有很大性能问题,因为数据在堆积,这样迟早会OOM, 所以关注发送和处理效率很重要。

原创粉丝点击