Storm实时大数据处理(二)

来源:互联网 发布:淘宝网韩版女装 编辑:程序博客网 时间:2024/05/17 23:25

在上一篇博客(Storm实时大数据处理(一))中,我介绍了Storm的基本概念和原理,本文我们开始基于Storm提供的API开发自己的应用程序。入门Storm应用程序开发很简单,这得益于设计者为我们精心设计的简单API。

一、搭建开发环境

在生产环境中,Storm集群运行在基于Linux操作系统的分布式集群中,可喜的是,Storm提供了本地模式(Local Mode)来方便开发者开发Storm Topology,而且本地模式支持Windows操作系统,因此搭建一个本地模式的Storm开发环境很简单。在已经搭建好的Java开发环境中,在Eclipse中安装配置好Maven项目管理工具即可,就这么简单,一步到位!

二、新建项目

在Eclipse中新建一个Maven Project,新建成功后,项目里面会有一个pom.xml文件,要开发Storm应用程序,需要Storm的jar包,之前的做法可能是自己去下载Storm的jar包,然后再导入到自己的项目中,有了Maven以后,这一切被彻底革命掉了,需要使用任何第三方jar包,直接去Maven中央仓库搜索一下,然后把依赖复制到pom.xml文件中即可,Maven会帮我们管理好项目依赖的jar包。Storm现在已经有很多版本了,在Maven中央仓库搜索以后,自己选择一个Release稳定版即可。我使用的是Storm 0.9.3版本,加入Storm依赖以后的pom.xml文件内容如下:

<pre name="code" class="html"><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">  <modelVersion>4.0.0</modelVersion>  <groupId>com.yistory</groupId>  <artifactId>WordCount</artifactId>  <version>0.0.1-SNAPSHOT</version>  <packaging>jar</packaging>  <name>WordCount</name>  <url>http://maven.apache.org</url>  <properties>    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>  </properties>  <dependencies>    <dependency>      <groupId>junit</groupId>      <artifactId>junit</artifactId>      <version>3.8.1</version>      <scope>test</scope>    </dependency>    <!-- https://mvnrepository.com/artifact/org.apache.storm/storm-core --><dependency>    <groupId>org.apache.storm</groupId>    <artifactId>storm-core</artifactId>    <version>0.9.3</version></dependency>  </dependencies></project>

三、编写Storm代码

本项目的功能就是统计生成的句子中各个单词出现的次数,业务逻辑很简单,我们的重点是关注怎么开发Storm应用程序。开发Storm应用程序的过程,就是搭建一个拓扑的过程,其实就是构造一个有向无环图。有向无环图由结点和有向边组成,而在Storm中,结点就是Spout或者Bolt,而边就是Spout和Bolt之间或者是Bolt和Bolt之间连接关系。
Storm把Topology中的流转的数据抽象为Stream(流),流的源头就是Spout,在这个例子中,随机生成句子的结点就是Spout,而Spout的具体体现形式就是一些特殊的类。在Storm中,编写Spout类的方法有2种,其一是实现IRichSpout接口,比如新建一个类WordEmitter,让其实现该接口,这是Eclipse会提醒你要实现该接口的所有方法,选择添加实现以后会自动生成一下代码。

package com.yistory.WordCount.spouts;import java.util.Map;import backtype.storm.spout.SpoutOutputCollector;import backtype.storm.task.TopologyContext;import backtype.storm.topology.IRichSpout;import backtype.storm.topology.OutputFieldsDeclarer;public class WordEmitter implements IRichSpout{public void ack(Object arg0) {// TODO Auto-generated method stub}public void activate() {// TODO Auto-generated method stub}public void close() {// TODO Auto-generated method stub}public void deactivate() {// TODO Auto-generated method stub}public void fail(Object arg0) {// TODO Auto-generated method stub}public void nextTuple() {// TODO Auto-generated method stub}public void open(Map arg0, TopologyContext arg1, SpoutOutputCollector arg2) {// TODO Auto-generated method stub}public void declareOutputFields(OutputFieldsDeclarer arg0) {// TODO Auto-generated method stub}public Map<String, Object> getComponentConfiguration() {// TODO Auto-generated method stubreturn null;}}

其二是继承BaseRichSpout类,继承该类以后Eclipse也会自动生成一些方法,代码如下。

package com.yistory.WordCount.spouts;import java.util.Map;import backtype.storm.spout.SpoutOutputCollector;import backtype.storm.task.TopologyContext;import backtype.storm.topology.OutputFieldsDeclarer;import backtype.storm.topology.base.BaseRichSpout;public class WordEmitter extends BaseRichSpout {public void nextTuple() {// TODO Auto-generated method stub}public void open(Map arg0, TopologyContext arg1, SpoutOutputCollector arg2) {// TODO Auto-generated method stub}public void declareOutputFields(OutputFieldsDeclarer arg0) {// TODO Auto-generated method stub}}
实现IRichSpout接口和继承BaseRichSpout类都可以实现Spout的业务逻辑,这两种方法中,实现IRichSpout接口需要实现其所有方法,而继承BaseRichSpout类只需要实现Spout最主要的3个方法即可。推荐使用后一种方式,这样我们的Spout类会比较清爽,没有臃肿的感觉。实际上,Storm的设计者在这里使用了适配器设计模式。接下来的代码是我们的生成句子的Spout类的完整代码,我已经在代码里添加了注释,读者应该能够理解。

package com.yistory.WordCount.spouts;import java.util.Map;import java.util.UUID;import java.util.concurrent.ConcurrentHashMap;import backtype.storm.spout.SpoutOutputCollector;import backtype.storm.task.TopologyContext;import backtype.storm.topology.OutputFieldsDeclarer;import backtype.storm.topology.base.BaseRichSpout;import backtype.storm.tuple.Fields;import backtype.storm.tuple.Values;import backtype.storm.utils.Utils;public class SentenceSpout extends BaseRichSpout {private static final long serialVersionUID = 4608825077450573093L;private ConcurrentHashMap<UUID, Values> pending;private SpoutOutputCollector collector;private String[] sentences = {"connecting the dots","love and loss","keep looking","do not settle","stay hungry","stay foolish"};private int index;/** * 在Storm中,这个方法相当于Spout的构造函数,类初始化时被调用, * 所以一般会把Spout初始化操作放在这个方法里 */public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {this.index = 0;this.collector = collector;this.pending = new ConcurrentHashMap<UUID, Values>();}/** * 声明输出元组的字段信息 */public void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("sentence"));}/** * 在Storm运行期间,这个方法会一直被调用,你可以把他理解为死循环 */public void nextTuple() {Values value = new Values(sentences[index]);UUID msgId = UUID.randomUUID();this.pending.put(msgId, value);this.collector.emit(value,msgId);index++;if(index >= sentences.length){index = 0;}// 休眠0.1毫秒Utils.sleep(100);}/** * 元组被正常处理后的操作 */public void ack(Object msgId){this.pending.remove(msgId);}/** * 如果元组未被正常处理就重发 */public void fail(Object msgId){this.collector.emit(this.pending.get(msgId),msgId);}}
接下来就是实现一系列的Bolt了,同样,在Storm中,实现Bolt也有2种方式,其一,实现IRichBolt接口,其二,继承BaseRichBolt类,二种方式的区别和前面介绍的Spout的实现方式一样。

这个Bolt的功能是把句子分割成为单词,然后传递到下游的Bolt,代码如下。

package com.yistory.WordCount.bolts;import java.util.Map;import backtype.storm.task.OutputCollector;import backtype.storm.task.TopologyContext;import backtype.storm.topology.OutputFieldsDeclarer;import backtype.storm.topology.base.BaseRichBolt;import backtype.storm.tuple.Fields;import backtype.storm.tuple.Tuple;import backtype.storm.tuple.Values;public class SplitSentenceBolt extends BaseRichBolt {private static final long serialVersionUID = 2390867112177953110L;private OutputCollector collector;/** * 在Storm中,这个方法相当于Bolt的构造函数,类初始化时被调用, * 所以一般会把Bolt初始化操作放在这个方法里 */public void prepare(Map conf, TopologyContext context, OutputCollector collector) {this.collector = collector;}/** * 声明输出元组的字段信息 */public void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("word"));}/** * 在Storm运行期间,一旦这个Bolt订阅的元组到达,这个方法就会被调用 */public void execute(Tuple tuple) {String sentence = tuple.getStringByField("sentence");String[] words = sentence.split(" ");for(String word:words){word = word.trim();// 将输出的tuple和输入的tuple锚定this.collector.emit(tuple,new Values(word));}// 告诉Spout,这个元组已经被成功处理了this.collector.ack(tuple);}}

这个Bolt的功能是统计各个单词出现的次数,然后传递给下游的Bolt,代码如下。

package com.yistory.WordCount.bolts;import java.util.HashMap;import java.util.Map;import backtype.storm.task.OutputCollector;import backtype.storm.task.TopologyContext;import backtype.storm.topology.OutputFieldsDeclarer;import backtype.storm.topology.base.BaseRichBolt;import backtype.storm.tuple.Fields;import backtype.storm.tuple.Tuple;import backtype.storm.tuple.Values;public class WordCountBolt extends BaseRichBolt {private static final long serialVersionUID = 360868701353402042L;private OutputCollector collector;private HashMap<String,Integer> counters;public void prepare(Map conf, TopologyContext context, OutputCollector collector) {this.collector = collector;counters = new HashMap<String, Integer>();}public void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("word","count"));}public void execute(Tuple tuple) {String word = tuple.getStringByField("word");Integer count = counters.get(word);if(null == count){count = 0;}count++;this.counters.put(word, count);// 将输出的tuple和输入的tuple锚定this.collector.emit(tuple,new Values(word,count));// 告诉上游Bolt,这个元组已经被成功处理了this.collector.ack(tuple);}}
这个Bolt的功能是当拓扑运行结束时打印单词计数(这里只是演示而这样做的,生成环境中Storm会一直运行下去,除非你主动停止它)
package com.yistory.WordCount.bolts;import java.util.HashMap;import java.util.Map;import backtype.storm.task.OutputCollector;import backtype.storm.task.TopologyContext;import backtype.storm.topology.OutputFieldsDeclarer;import backtype.storm.topology.base.BaseRichBolt;import backtype.storm.tuple.Tuple;public class ReportBolt extends BaseRichBolt {private static final long serialVersionUID = -1884042962508663765L;private HashMap<String,Integer> counts;public void prepare(Map conf, TopologyContext context, OutputCollector arg2) {this.counts = new HashMap<String, Integer>();}/** * 这个Bolt什么也不输出 */public void declareOutputFields(OutputFieldsDeclarer arg0) {}public void execute(Tuple tuple) {String word = tuple.getStringByField("word");Integer count = tuple.getIntegerByField("count");this.counts.put(word, count);}public void cleanup(){System.out.println("******count result******");for (Map.Entry<String, Integer> entry : counts.entrySet()) {            System.out.println(entry.getKey() + ": " + entry.getValue());        }}}

到此为止,Topology中的各结点已经构造完毕,接下来要把它们连接起来,构成一张有向无环图,下面的代码做的就是这件事。

package com.yistory.WordCount;import com.yistory.WordCount.bolts.ReportBolt;import com.yistory.WordCount.bolts.SplitSentenceBolt;import com.yistory.WordCount.bolts.WordCountBolt;import com.yistory.WordCount.spouts.SentenceSpout;import backtype.storm.Config;import backtype.storm.LocalCluster;import backtype.storm.topology.TopologyBuilder;import backtype.storm.tuple.Fields;import backtype.storm.utils.Utils;public class WordCountTopology {private static final String CENTENER_SPOUT_ID = "sentence-spout";private static final String SPLIT_BOLT_ID = "split-bolt";private static final String COUNT_BOLT_ID = "count-bolt";private static final String REPORT_BOLT_ID = "report-bolt";private static final String TOPOLOGY_NAME = "word-count-toplogy";public static void main(String[] args){SentenceSpout spout = new SentenceSpout();SplitSentenceBolt splitBolt = new SplitSentenceBolt();WordCountBolt countBolt = new WordCountBolt();ReportBolt reportBolt = new ReportBolt();TopologyBuilder builder = new TopologyBuilder();builder.setSpout(CENTENER_SPOUT_ID, spout);// SentenceSpout ---> SplitSentenceBoltbuilder.setBolt(SPLIT_BOLT_ID, splitBolt).shuffleGrouping(CENTENER_SPOUT_ID);// SplitSentenceBolt ---> WordCountBoltbuilder.setBolt(COUNT_BOLT_ID, countBolt).fieldsGrouping(SPLIT_BOLT_ID, new Fields("word"));// WordCountBolt ---> ReportBoltbuilder.setBolt(REPORT_BOLT_ID, reportBolt).globalGrouping(COUNT_BOLT_ID);Config config = new Config();LocalCluster cluster = new LocalCluster();cluster.submitTopology(TOPOLOGY_NAME, config, builder.createTopology());// 休眠10秒Utils.sleep(10000);cluster.killTopology(TOPOLOGY_NAME);cluster.shutdown();}}

四、本地模式运行Storm拓扑

本地模式运行Storm拓扑很简单,在项目的入口(包含main函数的类)中选择Run as > Java Application即可。

上面的代码组织起来以后,整个项目的结构如下。



运行项目后的结果如下



至此,一个Storm的入门项目已经完成了,以后基于Storm的开发基本是这个思路,本文前后逻辑处理可能不太恰当,

读者可以通览一遍,再细看各个部分。






1 0