Understanding the parallelism of a Storm topology

来源:互联网 发布:系统数据备份方案 编辑:程序博客网 时间:2024/06/05 21:59

In the past few days I have been test-driving Twitter’s Storm project, which is a distributed real-time data processing platform. One of my findings so far has been that the quality of Storm’s documentation and example code is pretty good — it is very easy to get up and running with Storm. Big props to the Storm developers! At the same time, I found the sections on how a Storm topology runs in a cluster not perfectly clear, and learned that the recent releases of Storm changed some of its behavior in a way that is not yet fully reflected in the Storm wiki and in the API docs.

In this article I want to share my own understanding of the parallelism of a Storm topology after reading the documentation and writing some first prototype code. More specifically, I describe the relationships of worker processes, executors (threads) and tasks, and how you can configure them according to your needs. The article is based on Storm release 0.8.1.

What is Storm?

For those readers unfamiliar with Storm here is a brief description taken from its homepage:

Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!

Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

What makes a running topology: worker processes, executors and tasks

Storm distinguishes between the following three main entities that are used to actually run a topology in a Storm cluster:

  • Worker processes
  • Executors (threads)
  • Tasks

Here is a simple illustration of their relationships:

Storm: Worker processes, executors (threads) and tasks

Figure 1: The relationships of worker processes, executors (threads) and tasks in Storm

worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors for one or more components (spouts or bolts) of this topology. A running topology consists of many such processes running on many machines within a Storm cluster.

An executor is a thread that is spawned by a worker process. It may run one or more tasks for the same component (spout or bolt).

task performs the actual data processing — each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time. This means that the following condition holds true: #threads ≤ #tasks. By default, the number of tasks is set to be the same as the number of executors, i.e. Storm will run one task per thread.

Configuring the parallelism of a topology

Note that in Storm’s terminology “parallelism” is specifically used to describe the so-called parallelism hint, which means the initial number of executor (threads) of a component. In this article though I use the term “parallelism” in a more general sense to describe how you can configure not only the number of executors but also the number of worker processes and the number of tasks of a Storm topology. I will specifically call out when “parallelism” is used in the narrow definition of Storm.

The following table gives an overview of the various configuration options and how to set them in your code. There is more than one way of setting these options though, and the table lists only some of them. Storm currently has the following order of precedence for configuration settings: defaults.yaml < storm.yaml < topology-specific configuration < internal component-specific configuration < external component-specific configuration. Please take a look at the Storm documentation for more details.

WhatDescriptionConfiguration option

How to set in your code (examples)#worker processesHow many worker processes to create for the topology across machines in the cluster.TOPOLOGY_WORKERSConfig#setNumWorkers#executors (threads)How many executors to spawnper component.?TopologyBuilder#setSpout() andTopologyBuilder#setBolt()

Note that as of Storm 0.8 theparallelism_hint parameter now specifies the initial number of executors (not tasks!) for that bolt.

#tasksHow many tasks to create per component.TOPOLOGY_TASKSComponentConfigurationDeclarer
#setNumTasks()

Here is an example code snippet to show these settings in practice:

1topologyBuilder.setBolt("green-bolt"new GreenBolt(), 2)
2               .setNumTasks(4)
3               .shuffleGrouping("blue-spout);

In the above code we configured Storm to run the bolt GreenBolt with an initial number of two executors and four associated tasks. Storm will run two tasks per executor (thread). If you do not explicitly configure the number of tasks, Storm will run by default one task per executor.

Example of a running topology

The following illustration shows how a simple topology would look like in operation. The topology consists of three components: one spout called BlueSpout and two bolts called GreenBolt and YellowBolt. The components are linked such that BlueSpout sends its output to GreenBolt, which in turns sends its own output to YellowBolt.

Storm: Example of a running topology

Figure 2: Example of a running topology in Storm

The GreenBolt was configured as per the code snippet above whereas BlueSpout and YellowBolt only set the parallelism hint (number of executors). Here is the relevant code:

01Config conf = new Config();
02conf.setNumWorkers(2); // use two worker processes
03 
04topologyBuilder.setSpout("blue-spout"new BlueSpout(), 2); // parallelism hint
05 
06topologyBuilder.setBolt("green-bolt"new GreenBolt(), 2)
07               .setNumTasks(4)
08               .shuffleGrouping("blue-spout");
09 
10topologyBuilder.setBolt("yellow-bolt"new YellowBolt(), 6)
11               .shuffleGrouping("green-bolt");
12 
13StormSubmitter.submitTopology(
14        "mytopology",
15        conf,
16        topologyBuilder.createTopology()
17    );

And of course Storm comes with additional configuration settings to control the parallelism of a topology, including:

  • TOPOLOGY_MAX_TASK_PARALLELISM: This setting puts a ceiling on the number of executors that can be spawned for a single component. It is typically used during testing to limit the number of threads spawned when running a topology in local mode. You can set this option via e.g. Config#setMaxTaskParallelism().

Update Oct 18: Nathan informed me that TOPOLOGY_OPTIMIZE will be removed in a future release. I have therefore removed its entry from the configuration list above.

How to change the parallelism of a running topology

A nifty feature of Storm is that you can increase or decrease the number of worker processes and/or executors without being required to restart the cluster or the topology. The act of doing so is called rebalancing.

You have two options to rebalance a topology:

  1. Use the Storm web UI to rebalance the topology.
  2. Use the CLI tool storm rebalance as described below.

Here is an example of using the CLI tool:

# Reconfigure the topology "mytopology" to use 5 worker processes,
# the spout "blue-spout" to use 3 executors and
# the bolt "yellow-bolt" to use 10 executors.

$ storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10

References for this article

To compile this article (and to write my related test code) I used information primarily from the following sources:

  • The Storm wiki, most notably the pages Concepts, Configuration, Running topologies on a production cluster, Local mode and Tutorial.
  • The Storm 0.8.1 API documentation, most notably the class Config.
  • The announcement of Storm 0.8.0 release by Nathan Marz on the storm-user mailing list.

Summary

My personal impression is that Storm is a very promising tool. On the one hand I like its clean and elegant design, and on the other hand I loved to find out that a young open source tool can still have an excellent documentation. In this article I tried to summarize my own understanding of the parallelism of topologies, which may or may not be 100% correct -- feel free to let me know if there are any mistakes in the description above!

原创粉丝点击
热门问题 老师的惩罚 人脸识别 我在镇武司摸鱼那些年 重生之率土为王 我在大康的咸鱼生活 盘龙之生命进化 天生仙种 凡人之先天五行 春回大明朝 姑娘不必设防,我是瞎子 小鱼翅卡喉咙了怎么办 鱼翅卡在喉咙里怎么办 斗鱼身份证被使用怎么办 做的鱼丸太腥了怎么办 做鱼丸太稀了怎么办 斗鱼手机号换了怎么办 斗鱼直播掉帧怎么办 手机一直卡顿点不动怎么办呢 斗鱼直播分值底怎么办 斗鱼6000鱼丸怎么办卡 斗鱼直播没人看怎么办 淘宝直播间没人气怎么办 挂水了还是有热度怎么办 陌陌工会不结算工资怎么办 滴滴给了差评怎么办 饿了么星级低怎么办 滴滴乘客给低星怎么办 蘑菇街自动收货前还没到怎么办 小主播人气太少别人看不到怎么办 税收分类编码不可用怎么办 斗鱼鱼翅充错了怎么办 苹果指纹摔坏了怎么办 小米5指纹坏了怎么办 苹果5s指纹失灵怎么办 学生赌博输了3万怎么办 电脑录屏没有声音怎么办 别人说你没有他美怎么办 没有你我怎么办歌词是什么意思 要是没有他我怎么办啊歌词 用喀秋莎保存的视频黑屏怎么办 电脑杀毒之后开不了机怎么办 夫妻离婚分房分车怎么办 请的护身符丢了怎么办 老车轻微烧机油怎么办 电脑下软件变卡怎么办 机械表日历偏了怎么办 子宫内膜异位痛经怎么办 凉着了坏肚子怎么办 昆虫叮咬后疼痛起水泡怎么办 每次来月经都痛经怎么办 人左肩的灯灭了怎么办