storm-[1]-Basics of Storm学习笔记

来源:互联网 发布:飞飞cms下载 编辑:程序博客网 时间:2024/06/06 08:35
  • Documentation · nathanmarz/storm Wiki https://github.com/nathanmarz/storm/wiki/Documentation
  • Storm, distributed and fault-tolerant realtime computation : http://storm-project.net/
  • http://www.slideshare.net/nathanmarz/storm-distributed-and-faulttolerant-realtime-computation

           使用storm时候可以参考下面三个链接

  • Getting Started With Storm:
  • 通常storm都是和kafka对接的,这个项目 可以作为参考:https://github.com/miguno/kafka-storm-starter
  • 另外storm还支持multilang, 我这里也做了一个 demo:https://github.com/dirtysalt/tomb/tree/master/scala/kafka-streaming
  • http://storm.apache.org/documentation/Tutorial.html
  • http://storm.apache.org/documentation/Concepts.htm

    Concepts

     The concepts discussed are:

    1. Topologies
    2. Streams
    3. Spouts
    4. Bolts
    5. Stream groupings
    6. Reliability
    7. Tasks
    8. Workers

    Topologies

     A Storm topology is analogous to a MapReduce job.

    A topology is a graph of spouts and bolts that are connected with stream groupings. These concepts are described below.

    Resources:

    • TopologyBuilder: use this class to construct topologies in Java
    • Running topologies on a production cluster
    • Local mode: Read this to learn how to develop and test topologies in local mode.

    Streams

    A stream is an unbounded sequence of tuples. Streams are defined with a schema that names the fields in the stream's tuples. By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.

    Every stream is given an id when declared. Since single-stream spouts and bolts are so common, OutputFieldsDeclarer has convenience methods for declaring a single stream without specifying an id. In this case, the stream is given the default id of "default".

    Resources:

    • Tuple: streams are composed of tuples
    • OutputFieldsDeclarer: used to declare streams and their schemas
    • Serialization: Information about Storm's dynamic typing of tuples and declaring custom serializations

    Spouts

    A spout is a source of streams in a topology. Generally spouts will read tuples from an external source and emit them into the topology (e.g. a Kestrel queue or the Twitter API). 

    Spouts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on SpoutOutputCollector.

    The main method on spouts :

    •  nextTuple.nextTuple either emits a new tuple into the topology or simply returns if there are no new tuples to emit.
    •  ack and fail: These are called when Storm detects that a tuple emitted from the spout either successfully completed through the topology or failed to be completed. ack and fail are only called for reliable spouts. See the Javadoc for more information.

    Resources:

    • IRichSpout: this is the interface that spouts must implement.
    • Guaranteeing message processing

    Bolts

    All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins, talking to databases, and more.

    Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts. 

    Bolts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on OutputCollector.

    When you declare a bolt's input streams, you always subscribe to specific streams of another component. If you want to subscribe to all the streams of another component, you have to subscribe to each one individually. InputDeclarer has syntactic sugar for subscribing to streams declared on the default stream id. Saying declarer.shuffleGrouping("1") subscribes to the default stream on component "1" and is equivalent to declarer.shuffleGrouping("1", DEFAULT_STREAM_ID).

    The main method in bolts is the execute method which takes in as input a new tuple. Bolts emit new tuples using the OutputCollector object. Bolts must call the ack method on the OutputCollector for every tuple they process so that Storm knows when tuples are completed (and can eventually determine that its safe to ack the original spout tuples). For the common case of processing an input tuple, emitting 0 or more tuples based on that tuple, and then acking the input tuple, Storm provides an IBasicBolt interface which does the acking automatically.

    Its perfectly fine to launch new threads in bolts that do processing asynchronously. OutputCollector is thread-safe and can be called at any time.

    Resources:

    • IRichBolt: this is general interface for bolts.
    • IBasicBolt: this is a convenience interface for defining bolts that do filtering or simple functions.
    • OutputCollector: bolts emit tuples to their output streams using an instance of this class
    • Guaranteeing message processing

    Stream groupings

     grouping defines how that stream should be partitioned among the bolt's tasks.

     implement a custom stream grouping by implementing the CustomStreamGrouping interface:

    1. Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples.
    2. Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task, but tuples with different "user-id"'s may go to different tasks.
    3. Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed. This paper provides a good explanation of how it works and the advantages it provides.
    4. All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care.
    5. Global grouping: The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task with the lowest id.
    6. None grouping: Currently, none groupings are equivalent to shuffle groupings. 
    7. Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/org/apache/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to).
    8. Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.

    Resources:

    • TopologyBuilder: use this class to define topologies
    • InputDeclarer: this object is returned whenever setBolt is called on TopologyBuilder and is used for declaring a bolt's input streams and how those streams should be grouped

    Workers

    Topologies execute across one or more worker processes. Each worker process is a physical JVM and executes a subset of all the tasks for the topology. For example, if the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks (as threads within the worker). Storm tries to spread the tasks evenly across all the workers.

    Resources:

    • Config.TOPOLOGY_WORKERS: this config sets the number of workers to allocate for executing the topology

    Configuration

     

    •  Default value defined in defaults.yaml in the Storm codebase
    •  override these configurations by defining a storm.yaml in the classpath of Nimbus and the supervisors.
    •  define a topology-specific configuration that you submit along with your topology when using StormSubmitter. However, the topology-specific configuration can only override configs prefixed with "TOPOLOGY".

    The Java API lets you specify component specific configurations in two ways:

    1. Internally: Override getComponentConfiguration in any spout or bolt and return the component-specific configuration map.
    2. Externally: setSpout and setBolt in TopologyBuilder return an object with methods addConfiguration and addConfigurations that can be used to override the configurations for the component.

    The preference order for configuration values is:

    defaults.yaml < storm.yaml < topology specific configuration < internal component specific configuration < external component specific configuration.

    Resources:

    • Config: a listing of all configurations as well as a helper class for creating topology specific configurations
    • defaults.yaml: the default values for all configurations
    • Setting up a Storm cluster: explains how to create and configure a Storm cluster
    • Running topologies on a production cluster: lists useful configurations when running topologies on a cluster
    • Local mode: lists useful configurations when using local mode

    Guaranteeing Message Processing

    Spout实现可靠消费

    以KestrelSpout消费kestrel消息队列为例:
    KestrelSpout从Kestrel queue open一条消息,但并不意味这消息在queue被拿走,只是标识为“pending”状态(等待ACK),“pending”状态的消息不能被其他consumers消费KestrelSpout从Kestrel queue open一条消息Kestrel,Kestrel会反馈一条标识id的消息,然后Kestrel call KestrelSpout  ack or fail。KestrelSpout会根据消息是否被消费或timeout未成功消费进行 ack or fail,已确定从Kestrel queue真正拿走消费,或是失败则把消费放回Kestrel queue(去除pending状态,供其他consumers消费)
    TopologyBuilder builder = new TopologyBuilder();builder.setSpout("sentences", new KestrelSpout("kestrel.backtype.com",                                               22133,                                               "sentence_queue",                                               new StringScheme()));builder.setBolt("split", new SplitSentence(), 10)        .shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 20)        .fieldsGrouping("split", new Fields("word"));

    如何应用API实现可靠消费

    需要做两点:
    • anchor告知Storm对tuples 树已建立了一个新link(通过call anchoring link 一个tuple,anchoring在emit调用时将输入tuple最为第一个参数即可: _collector.emit(tuple, new Values(word))),正将输入tuple和即将发射的tuple产生anchored关系
    • 处理完tuple后告知Storm

    1-anchor告知Storm对tuples 树已建立了一个新link

    Specifying a link in the tuple tree is called anchoring. Anchoring is done at the same time you emit a new tuple. Let's use the following bolt as an example. This bolt splits a tuple containing a sentence into a tuple for each word:

    public class SplitSentence extends BaseRichBolt {        OutputCollector _collector;        public void prepare(Map conf, TopologyContext context, OutputCollector collector) {            _collector = collector;        }        public void execute(Tuple tuple) {            String sentence = tuple.getString(0);            for(String word: sentence.split(" ")) {                _collector.emit(tuple, new Values(word));            }            _collector.ack(tuple);        }        public void declareOutputFields(OutputFieldsDeclarer declarer) {            declarer.declare(new Fields("word"));        }            }
     一个输出tuple可以anchored到多个输入tuple(在streaming joins or aggregations场景中),当输出tuple失败时将调起spouts中多个tuples重新分配
    List<Tuple> anchors = new ArrayList<Tuple>();anchors.add(tuple1);anchors.add(tuple2);_collector.emit(anchors, new Values(1, 2, 3));

    2-处理完tuple后告知Storm

    OutputCollector的ack and fail方法告知Storm。正如SplitSentence的例子里在所有word tuples被emitted,调用一次acke  _collector.ack(tuple);

    fail用于告知 spout tuple 下游tuple的失败信息,可以选择捕获的exception标识为错误信息,这样spout tuple就不用等到 time-out之后才得知失败

    Storm会利用内存跟踪每一个tuple,处理的每一个tuple必须acked or failed,否则最终会run out of memory

    通常bolts都是读取一个input tuple,在input tuple基础上emitting tuples,这时只需在execute方法的最后调用一次acking,IBasicBolt 接口(不支持多anchored的情况)即已内置了这种处理方式。实现BasicBolt接口的SplitSentence如下:


    public class SplitSentence extends BaseBasicBolt {        public void execute(Tuple tuple, BasicOutputCollector collector) {            String sentence = tuple.getString(0);            for(String word: sentence.split(" ")) {                collector.emit(new Values(word));            }        }        public void declareOutputFields(OutputFieldsDeclarer declarer) {            declarer.declare(new Fields("word"));        }            }

    Command Line Client

    http://storm.apache.org/releases/1.1.0/Command-line-client.html

     

    Understanding the Parallelism of a Storm Topology

    1. Worker processes
    2. Executors (threads)
    3. Tasks

    Here is a simple illustration of their relationships:

    The relationships of worker processes, executors (threads) and tasks in Storm

    2-配置topology的并行度

    Number of worker processes

    • Description: How many worker processes to create for the topology across machines in the cluster.
    • Configuration option: TOPOLOGY_WORKERS
    • How to set in your code (examples):
      • Config#setNumWorkers

    Number of executors (threads)

    • Description: How many executors to spawn per component.
    • Configuration option: None (pass parallelism_hint parameter to setSpout or setBolt)
    • How to set in your code (examples):
      • TopologyBuilder#setSpout()
      • TopologyBuilder#setBolt()
      • Note that as of Storm 0.8 the parallelism_hint parameter now specifies the initial number of executors (not tasks!) for that bolt.

    Number of tasks

    • Description: How many tasks to create per component.
    • Configuration option: TOPOLOGY_TASKS
    • How to set in your code (examples):
      • ComponentConfigurationDeclarer#setNumTasks()

    Here is an example code snippet to show these settings in practice:

    topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)               .setNumTasks(4)               .shuffleGrouping("blue-spout");

    Run the bolt GreenBolt with:2 executors、 4 associated tasks.  每个executor 运行两个task

    一个 topology并行度的例子

    The following illustration shows how a simple topology would look like in operation. The topology consists of three components: one spout called BlueSpout and two bolts called GreenBolt and YellowBolt. The components are linked such that BlueSpout sends its output to GreenBolt, which in turns sends its own output to YellowBolt.

    Example of a running topology in Storm

    The GreenBolt was configured as per the code snippet above whereas BlueSpout and YellowBolt only set the parallelism hint (number of executors). Here is the relevant code:

    Config conf = new Config();conf.setNumWorkers(2); // use two worker processestopologyBuilder.setSpout("blue-spout", new BlueSpout(), 2); // set parallelism hint to 2topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)               .setNumTasks(4)               .shuffleGrouping("blue-spout");topologyBuilder.setBolt("yellow-bolt", new YellowBolt(), 6)               .shuffleGrouping("green-bolt");StormSubmitter.submitTopology(        "mytopology",        conf,        topologyBuilder.createTopology()    );

    也可以 通过配置项控制parallelism:

    • TOPOLOGY_MAX_TASK_PARALLELISM: a single component的最大并行度. 一般用于限制local mode threads 的数目, 设置方式 e.g. Config#setMaxTaskParallelism().

    改变运行态topoloy的并行度

    rebalancing:不需要重启cluster或topology重置并行度

    You have two options to rebalance a topology:

    1. Use the Storm web UI to rebalance the topology.
    2. Use the CLI tool storm rebalance as described below.

    Here is an example of using the CLI tool:

    ## Reconfigure the topology "mytopology" to use 5 worker processes,## the spout "blue-spout" to use 3 executors and## the bolt "yellow-bolt" to use 10 executors.$ storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10

    0 0
    原创粉丝点击