Storm Trident Internals

来源:互联网 发布:java 节假日判断 编辑:程序博客网 时间:2024/05/17 19:18

In this article I’m going to talk about the internal design of storm trident. This article is for those who want to understand the internal design of storm trident and how it ensures that exactly once semantic, if you are not familiar with storm, please read the storm doc first.

Topology Creation

Trident is also built on top of normal storm topology. When you use trident api to build topology, it uses a node to represent each of your operations internally for optimisation and building topology. There are three kinds of nodes: spout node, partition node and processor node. These nodes together constitute a direct graph, from which trident tries to optimise topology and merge as many nodes as possible to reduce network transfer between nodes. After merging nodes, all nodes are grouped in two layers:

  • A batch group is a maximum connected component
  • A group are nodes that can be merged in to one bolt.

As I have mentioned, trident topology builder will eventually create a storm topology. Here is how a topology will be made:

  1. A MasterBatchCoordinator will be created for each batch group, and this is the only spout for each batch group.
  2. For each real spout(e.g. TridentKafkaSpout) you create when building topology, trident will create a subtopology for it. This subtopology consists of a TridentSpoutCoordinator and a TridentSpoutExecutor, both are bolts that execute the Coordinator and Emitter (e.g. generated by ITridentSpout) respectively.
  3. A SubtopologyBolt will be created for each group and is executed by TridentBoltExecutor.

Message Flow

The most interesting part of trident is how it ensures exactly-once semantic. The basic idea is simple: exactly-once semantic is built on at-least-once semantic, but if state is involved, the system need to keep order between transactions. Since failure does not happen frequently, trident uses pipeline to reduce the resource waste on waiting, this means that stateless processor can execute latter transactions before earlier transactions committed.

trident

The above graph shows the message flow between each component of a trident topology. There are some interesting points here:

  • The coordination stream is emitted by a TridentBoltExecutor to indicate the end of a batch. When a downstream TridentBoltExecutor receives coordination tuple from all upstream bolts, it knows that it has received all tuples of a batch and then it will execute the finishBatch method of ITridentBatchBolt, and then emit coordination tuple to downstream bolts.
  • Here we assume that TridentBoltExecutor1 has stateful processor and that’s why MasterBatchCoordinator sends commit stream to it.

Experience

As the time of writing, we are trying to deploy our first trident topology. When writing the topology we have learned some lessons:

  1. The trident state api is not user friendly. It’s not that easy to understand and use.
  2. Trident is not that easy to debug. The name of components of trident topology is meaningless and you can only guess in which bolt you processor is and thus difficult to adjust the parallelism of each processor according to storm ui.
  3. Trident uses small batches to improve throughput, however the batch size can not be too large, since batches need to be processed together and that means it takes more resources to avoid the timeout and replay of the batch.
0 0
原创粉丝点击