storm 基本原理

来源：互联网发布：知道网络课答案编辑：程序博客网时间：2024/05/21 04:20

Rationale
基本原理

The past decade has seen a revolution in data processing. MapReduce, Hadoop, and related technologies have made it possible to store and

过去几十年见证了数据处理的改革，MapReduce, Hadoop和其他相关技术使存储和处理大规模的数据成为可能，这在以前是不敢想的，

process data at scales previously unthinkable. Unfortunately, these data processing technologies are not realtime systems, nor are they meant to

但不幸的事是，这些处理技术不是实时的处理系统，他们注定不是这种系统。

be. There’s no hack that will turn Hadoop into a realtime system; realtime data processing has a fundamentally different set of requirements than batch processing.

也没有办法，把hadoop变成一个实时的数据处理系统，实时数据处理，相对于批处理来说有一些根本的不同的要求。

However, realtime data processing at massive scale is becoming more and more of a requirement for businesses. The lack of a “Hadoop of realtime” has become the biggest hole in the data processing ecosystem.

然而，商业越来越需要，这一个可以实时处理大数据的系统，hadoop上的实时处理系统的缺失，是最大的一个缺失，在hadoop生态系统上，

Storm fills that hole.

storm 填补了那个空白

Before Storm, you would typically have to manually build a network of queues and workers to do realtime processing. Workers would process

在strom之前，你不得不自己建立一个网络队列和工作者来做实时处理，工作

messages off a queue, update databases, and send new messages to other queues for further processing. Unfortunately, this approach has

会处理消息队列，更新数据库，再发送新的消息到其它队列来进一步处理，不幸的是，

serious limitations:

这样做有一些限制。

Tedious: You spend most of your development time configuring where to send messages, deploying workers, and deploying intermediate

太无聊了：你花费大量的开发时间去配置这些消息发往哪里，部署处理工作者，部署中间的

queues. The realtime processing logic that you care about corresponds to a relatively small percentage of your codebase.

队列，一个你关心的实时处理逻辑和你代码库相关一致性是很小的。

Brittle: There’s little fault-tolerance. You’re responsible for keeping each worker and queue up.

脆弱的：容错性很低，你负责管理每个worker 并让他们有序。

Painful to scale: When the message throughput get too high for a single worker or queue, you need to partition how the data is spread around. You need to reconfigure the other workers to know the new locations to send messages. This introduces moving parts and new pieces that can fail.

扩展是痛苦的:当吞吐量对单个worker或者queue来说过大时，你需要拆分数据，然后分发下去，你需重新配置其他的worker 告诉他们往那个新位置发送数据，这里需要注意，移动数据或者新部分是会失败的。

Although the queues and workers paradigm breaks down for large numbers of messages, message processing is clearly the fundamental paradigm for realtime computation. The question is: how do you do it in a way that doesn’t lose data, scales to huge volumes of messages, and is dead-simple to use and operate?

即使queue和workers 范例会崩溃由于很大的数据量，消息处理是实时计算最根本的功能，问题是，你怎么做才能使数据不丢失，吞吐大量消息，而且非常简单的使用和操作。

Storm satisfies these goals.

Storm 符合这些要求

Why Storm is important

为什么storm 是重要的？

Storm exposes a set of primitives for doing realtime computation. Like how MapReduce greatly eases the writing of parallel batch processing, Storm’s primitives greatly ease the writing of parallel realtime computation.

storm为实时计算暴露了一系列基础操作。就map/reduce使编写并行批处理变得简单。 storm的一些基本操作很大程度上简化了编写并写实时计算的过程。

The key properties of Storm are:

Extremely broad set of use cases: Storm can be used for processing messages and updating databases (stream processing), doing a continuous query on data streams and streaming the results into clients (continuous computation), parallelizing an intense query like a search query on the fly (distributed RPC), and more. Storm’s small set of primitives satisfy a stunning number of use cases.

异常广泛的使用场景:storm可以用来处理消息，更新数据库(流出里)，在一个数据流上做一个持续的查询，流化结果到客户端（进一步计算）。并行化一个查询还有更多，storm这些基本操作可以满足数量惊人的用户场景。

Scalable: Storm scales to massive numbers of messages per second. To scale a topology, all you have to do is add machines and increase the parallelism settings of the topology. As an example of Storm’s scale, one of Storm’s initial applications processed 1,000,000 messages per second on a 10 node cluster, including hundreds of database calls per second as part of the topology. Storm’s usage of Zookeeper for cluster coordination makes it scale to much larger cluster sizes.

扩展: storm每秒吞吐大量的消息，要扩大你topology的规模，所有你需要做的就是添加机器，然后提高这个topology的并行配置。作为一个storm的吞吐量的例子，在一个10个节点的集群删每秒处理1百万的消息，包括每秒数以百记的查询，storm用zookeeper来保持集群一直性，这使它容易扩张到更大的集群。

Guarantees no data loss: A realtime system must have strong guarantees about data being successfully processed. A system that drops data has a very limited set of use cases. Storm guarantees that every message will be processed, and this is in direct contrast with other systems like S4.

确保没有消息丢失:实时计算系统一定要保证数据被成功的处理了，一个有数据丢失的系统，只有很小的使用场景，storm确保每个消息都被处理了，

这和S4这种系统截然相反。

Extremely robust: Unlike systems like Hadoop, which are notorious for being difficult to manage, Storm clusters just work. It is an explicit goal of the Storm project to make the user experience of managing Storm clusters as painless as possible.

异常稳定:不像hadoop那样很难管理。storm集群就工作起来，这是storm目标，让用户管理起来异常的简单

Fault-tolerant: If there are faults during execution of your computation, Storm will reassign tasks as necessary. Storm makes sure that a computation can run forever (or until you kill the computation).

接受失败:计算过程若出了错，如果需要的话，storm会重新指派任务，storm保证在你干掉这个计算任务之前，他会一直运算下去。

Programming language agnostic: Robust and scalable realtime processing shouldn’t be limited to a single platform. Storm topologies and processing components can be defined in any language, making Storm accessible to nearly anyone.

语言无关性:稳定又容易扩张的实时处理系统不能局限在一种平台上，storm topologies 和计算组件可以被定义为为何语言，这样几乎每个人都可以使用它。

0 0