storm并行机制

来源:互联网 发布:魔兽争霸mac版 编辑:程序博客网 时间:2024/04/28 08:51

Understandingthe Parallelism of a Storm Topology

STORM的并行机制

What makes a running topology: worker processes,executors and tasks

一个运行着的topology是什么构成:工作进程,执行器和任务

Stormdistinguishes between the following three main entities that are used toactually run a topology in a Storm cluster:

1.      Worker processes

2.      Executors (threads)

3.      Tasks

 Storm对下面三个在storm集群中运行topology的实体进行了区分,分别是:

1, 工作进程

2, 执行器(线程)

3, 任务

 

Here is asimple illustration of their relationships:

下面是他们关系的一个简单图示:

集群中的一个机器可能为一个或多个topologies运行一个或多个进程,每一个工作进程为一个特定的topology运行多个执行器。

一个单独的工作进程可能运行一个或多个执行器,第一个执行器都是工作进程创建的一个线程,第一个执行器运行一个或多个同一组件的任务(spoutbolt)。

任务执行特定的数据处理。

 

Aworkerprocess executes a subset of a topology. A worker process belongs to aspecific topology and may run one or more executors for one or more components(spouts or bolts) of this topology. A running topology consists of many suchprocesses running on many machines within a Storm cluster.

一个工作进程执行一个topology的子集,一个工作进程属于一个特定的topology并且针对一个或多个组件运行着一个或多个执行器。一个运行着的topology由若干个运行在同一个storm集群中的进程组成。

Anexecutoris a thread that is spawned by a worker process. It may run one or more tasksfor the same component (spout or bolt).

执行器是由工作进程创建的一个线程,它可能会为同一个组件(spoutbolt)运行一个或多个任务。

Ataskperforms the actual data processing — each spout or bolt that you implement inyour code executes as many tasks across the cluster. The number of tasks for acomponent is always the same throughout the lifetime of a topology, but thenumber of executors (threads) for a component can change over time. This meansthat the following condition holds true:#threads ≤ #tasks. Bydefault, the number of tasks is set to be the same as the number of executors,i.e. Storm will run one task per thread.

任务执行特定的数据处理-你代码中实现的每一个spoutbolt都做为集群中的多个任务来执行。组件的任务数在topology的整个生命周期中是维持不变,但是执行器的数据是一直在变的。这意味着下面这个条件是永久成立的:#threads ≤ #tasks.默认来讲任务的数量被设置成和执行器的数据一致,例如storm让一个线程执行一个任务。

Configuring the parallelism of a topology

配置topology的并行机制

Note that inStorm’s terminology "parallelism" is specifically used to describethe so-calledparallelism hint, which means the initial number ofexecutor (threads) of a component. In this document though we use the term"parallelism" in a more general sense to describe how you canconfigure not only the number of executors but also the number of workerprocesses and the number of tasks of a Storm topology. We will specificallycall out when "parallelism" is used in the normal, narrow definitionof Storm

注意,storm的术语并行被特定用来描述所谓的并行提示,用来表示一个组件的初始线程数。在这篇文章中,使用并行度这个术语,更通用的意义上来讲,不仅用来描述执行器的数量,而且用来描述storm中工作进程的数量和任务的数量。我们将从通用和狭义的角度来讨论一下storm并行机制.

Thefollowing sections give an overview of the various configuration options andhow to set them in your code. There is more than one way of setting theseoptions though, and the table lists only some of them. Storm currently has thefollowing order of precedencefor configuration settings:defaults.yaml <storm.yaml <topology-specific configuration < internal component-specific configuration< external component-specific configuration.

下面这部分给出了不同配置部分的一个全貌以及如何在你的代码中设置它们。有多种方式设置这些选项,下面列表仅仅列出了一部分,目前storm对配置选项的引用顺序:

Defaults.yaml<storm.yaml<topology特殊配置<内部组件特殊配置<外部组件特殊配置。

Number of worker processes

工作进程的数量

·        Description: How many worker processes tocreatefor the topology across machines in the cluster.

·        描述:在集群的机器中为topology创建多少个工作进程。

·        Configuration option:TOPOLOGY_WORKERS

·        配置选项:TOPOLOGY_WORKERS

·        How to set in your code (examples):

·        如何在代码中设置

o   Config#setNumWorkers

o    

Number of executors (threads)

执行器数量

·        Description: How many executors to spawnpercomponent.

·        描述:每个组件创建多少个执行器

·        Configuration option: None (passparallelism_hintparameter tosetSpout orsetBolt)

·        配置选项:无(传递parallelism_hintsetSpout or setBolt

·        How to set in your code (examples):

·        代码中如何设置

o   TopologyBuilder#setSpout()

o   TopologyBuilder#setBolt()

o   Note that as of Storm 0.8 theparallelism_hintparameter now specifies the initial number of executors (not tasks!) for thatbolt.

o   注意storm0.8parallelism_hint指的是bolt执行器的数量。

Number of tasks

任务数

·        Description: How many tasks to createpercomponent.

·        描述:每个组件创建多少个任务

·        Configuration option:TOPOLOGY_TASKS

·        配置选项:TOPOLOGY_TASKS

·        How to set in your code (examples):

·        代码中如何设置

o   ComponentConfigurationDeclarer#setNumTasks()

Here is anexample code snippet to show these settings in practice:

下面是一个代码片断显示如何在实践中使用这些设置:

topologyBuilder.setBolt("green-bolt",new GreenBolt(),2)

              .setNumTasks(4)

              .shuffleGrouping("blue-spout");

In the abovecode we configured Storm to run the boltGreenBoltwith aninitial number of two executors and four associated tasks. Storm will run twotasks per executor (thread). If you do not explicitly configure the number oftasks, Storm will run by default one task per executor.

下面代码中,我们配置storm运行一个bolt GreenBolt,这个bolt初始化了两个执行器和四个相关任务。如果不明确配置任务的数量,storm将默认将会一个执行器一个任务执行。

Example of a running topology

运行topology的例子

The followingillustration shows how a simple topology would look like in operation. Thetopology consists of three components: one spout calledBlueSpout and twobolts calledGreenBolt andYellowBolt. Thecomponents are linked such thatBlueSpout sends its output toGreenBolt, whichin turns sends its own output toYellowBolt.

下图展示了一个运行中topology是什么样子,这个topology由三个组件组成:一个BlueSpout,一个GreenBolt和一个YellowBolt。这些组件是有联系的,BlueSpout输出给GreenBoltGreenBolt输出给YellowBolt

TheGreenBolt wasconfigured as per the code snippet above whereasBlueSpout andYellowBolt only setthe parallelism hint (number of executors). Here is the relevant code:

GreenBolt如上面代码配置一样,BlueSpoutYellowBolt仅仅设置了parallelism hint(执行器的数量),下面是相关代码:

Config conf= new Config();

conf.setNumWorkers(2);// use two worker processes

 

topologyBuilder.setSpout("blue-spout",new BlueSpout(),2);// set parallelism hint to 2

 

topologyBuilder.setBolt("green-bolt",new GreenBolt(),2)

              .setNumTasks(4)

              .shuffleGrouping("blue-spout");

 

topologyBuilder.setBolt("yellow-bolt",new YellowBolt(),6)

              .shuffleGrouping("green-bolt");

 

StormSubmitter.submitTopology(

       "mytopology",

       conf,

       topologyBuilder.createTopology()

    );

And ofcourse Storm comes with additional configuration settings to control theparallelism of a topology, including:

当然,storm也于其它的配置信息一起来控制topology的并行机制:

·        TOPOLOGY_MAX_TASK_PARALLELISM:This setting puts a ceiling on the number of executors that can be spawned fora single component. It is typically used during testing to limit the number ofthreads spawned when running a topology in local mode. You can set this optionvia e.g.Config#setMaxTaskParallelism().

·        TOPOLOGY_MAX_TASK_PARALLELISM:这个设置限制一个单独组件能够创建的最大线程数。它通常被用在本地模式运行的时候,测试执行一个topology的最大创建线程数据限制。

How to change the parallelism of a running topology

如何改变一个运行topology的并行限制

A niftyfeature of Storm is that you can increase or decrease the number of workerprocesses and/or executors without being required to restart the cluster or thetopology. The act of doing so is called rebalancing.

Storm有一个漂亮的特性,就是你可以在不重启群集和topology的情况下增加或减少工作时程或执行器的数量,这就是所谓的再平衡机制。

You have twooptions to rebalance a topology:

有两个选项可以再平衡一个topology

1.      Use the Storm web UI to rebalance thetopology.

使用web UI

2.      Use the CLI tool storm rebalance asdescribed below.

使用CLI工具

Here is anexample of using the CLI tool:

## Reconfigure the topology"mytopology" to use 5 worker processes,

## the spout "blue-spout" to use3 executors and

## the bolt "yellow-bolt" to use10 executors.

 

$ storm rebalance mytopology -n 5 -eblue-spout=3 -e yellow-bolt=10

References

·        Concepts

·        Configuration

·        Runningtopologies on a production cluster

·        Local mode

·        Tutorial

·        Storm API documentation,most notably the class Config

 

0 0
原创粉丝点击