storm concepts

来源：互联网发布：网络会员制营销通过编辑：程序博客网时间：2024/06/15 01:31

Concepts

This page lists the main concepts of Storm and links to resources where you can find more information. The concepts discussed are:

Topologies 拓扑
Streams 流
Spouts 水龙头
Bolts 螺栓
Stream groupings 流分组
Reliability 可靠性
Tasks
Workers

Topologies

The logic for a realtime application is packaged into a Storm topology。A topology is a graph of spouts and bolts that are connected with stream groupings。

These concepts are described below.

Resources:

TopologyBuilder: use this class to construct topologies in Java
Running topologies on a production cluster
Local mode: Read this to learn how to develop and test topologies in local mode.

Streams

The stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. Streams are defined with a schema that names the fields in the stream's tuples.By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.

Every stream is given an id when declared. Since single-stream spouts and bolts are so common,OutputFieldsDeclarer has convenience methods for declaring a single stream without specifying an id. In this case, the stream is given the default id of "default".

Resources:

Tuple: streams are composed of tuples
OutputFieldsDeclarer: used to declare streams and their schemas
Serialization: Information about Storm's dynamic typing of tuples and declaring custom serializations

Spouts

A spout is a source of streams in a topology. Generally spouts will read tuples from an external source and emit them into the topology

Spouts can emit more than one stream. To do so, declare multiple streams using thedeclareStream method of OutputFieldsDeclarer and specify the stream to emit to when using theemit method on SpoutOutputCollector.

The main method on spouts is nextTuple.nextTuple either emits a new tuple into the topology or simply returns if there are no new tuples to emit. It is imperative thatnextTuple does not block for any spout implementation, because Storm calls all the spout methods on the same thread.

The other main methods on spouts are ack andfail. These are called when Storm detects that a tuple emitted from the spout either successfully completed through the topology or failed to be completed.ack and fail are only called for reliable spouts. Seethe Javadoc for more information.

Resources:

IRichSpout: this is the interface that spouts must implement.
Guaranteeing message processing

Bolts

All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins, talking to databases, and more.

Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts

The main method in bolts is the execute method which takes in as input a new tuple. Bolts emit new tuples using theOutputCollector object. Bolts must call the ack method on theOutputCollector for every tuple they process so that Storm knows when tuples are completed (and can eventually determine that its safe to ack the original spout tuples). For the common case of processing an input tuple, emitting 0 or more tuples based on that tuple, and then acking the input tuple, Storm provides anIBasicBolt interface which does the acking automatically.

Its perfectly fine to launch new threads in bolts that do processing asynchronously.OutputCollector is thread-safe and can be called at any time.

Resources:

IRichBolt: this is general interface for bolts.
IBasicBolt: this is a convenience interface for defining bolts that do filtering or simple functions.
OutputCollector: bolts emit tuples to their output streams using an instance of this class
Guaranteeing message processing

Stream groupings

Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping defines how that stream should be partitioned among the bolt's tasks.

There are eight built-in stream groupings in Storm, and you can implement a custom stream grouping by implementing theCustomStreamGrouping interface:

Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples.
Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task, but tuples with different "user-id"'s may go to different tasks.
Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed. This paper provides a good explanation of how it works and the advantages it provides.
All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care.
Global grouping: The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task with the lowest id.
None grouping: This grouping specifies that you don't care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none groupings to execute in the same thread as the bolt or spout they subscribe from (when possible).
Direct grouping: This is a special kind of grouping. A stream grouped this way means that theproducer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/org/apache/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the providedTopologyContext or by keeping track of the output of theemit method in OutputCollector (which returns the task ids that the tuple was sent to).
Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.

Resources:

TopologyBuilder: use this class to define topologies
InputDeclarer: this object is returned wheneversetBolt is called on TopologyBuilder and is used for declaring a bolt's input streams and how those streams should be grouped

Reliability

Storm guarantees that every spout tuple will be fully processed by the topology. It does this by tracking the tree of tuples triggered by every spout tuple and determining when that tree of tuples has been successfully completed. Every topology has a "message timeout" associated with it. If Storm fails to detect that a spout tuple has been completed within that timeout, then it fails the tuple and replays it later.

Tasks

Each spout or bolt executes as many tasks across the cluster. Each task corresponds to one thread of execution, and stream groupings define how to send tuples from one set of tasks to another set of tasks. You set the parallelism for each spout or bolt in the setSpout and setBolt methods ofTopologyBuilder.

Workers

Topologies execute across one or more worker processes. Each worker process is a physical JVM and executes a subset of all the tasks for the topology. For example, if the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks (as threads within the worker). Storm tries to spread the tasks evenly across all the workers.

Resources:

Config.TOPOLOGY_WORKERS: this config sets the number of workers to allocate for executing the topology

阅读全文

0 0