Lambda Architecture

来源：互联网发布：mac 强制重启finder 编辑：程序博客网时间：2024/05/21 20:11

Making Sense of it All

Building a well-designed, reliable and functional big data application that caters to a variety of end-user latency requirements can be an extremely challenging proposition. It can be daunting enough to just keep up with the rapid pace of technology innovation happening in this space, let alone building applications that work for the problem at hand. “Start slow and build one application at a time” is perhaps the most common advice given to beginners today. However, there are certain high-level architectural constructs that can help you mentally visualize how different types of applications fit into the big data architecture and how some of these technologies are transforming the existing enterprise software landscape.

Lambda Architecture

Lambda Architecture is a useful framework to think about designing big data applications.Nathan Marz came up with the term Lambda Architecture (LA) for a generic, scalable and fault-tolerant data processing architecture, based on his experience working on distributed data processing systems at Backtype and Twitter.

The LA aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.

Here’s how it looks like, from a high-level perspective:

The Lambda Architecture as seen in the picture has three major components.

Batch layer that provides the following functionality
1. managing the master dataset, an immutable, append-only set of raw data
2. pre-computing arbitrary query functions, called batch views.
Serving layer—This layer indexes the batch views so that they can be queried in ad hoc with low latency.
Speed layer—This layer accommodates all requests that are subject to low latency requirements. Using fast and incremental algorithms, the speed layer deals with recent data only.

All data entering the system is dispatched to both the batch layer and the speed layer for processing.
The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
Any incoming query can be answered by merging results from batch views and real-time views.

Each of these layers can be realized using various big data technologies. For instance, the batch layer datasets can be in a distributed filesystem, while MapReduce can be used to create batch views that can be fed to the serving layer. The serving layer can be implemented using NoSQL technologies such as HBase, while querying can be implemented by technologies such as Apache Drill or Impala. Finally, the speed layer can be realized with data streaming technologies such as Apache Storm or Spark Streaming.

Up to now, the description of the Lambda Architecture here makes use of the basic capabilities that are pretty much common to all distributions powered by Hadoop. There are somethings you can do, however, with a MapR cluster that improves the basic operation of the Lambda architecture.

For instance, most Storm topologies avoid the use of much persisted state. This is fast and easy, since tuples can be acknowledged as soon as their effect has been impressed on memory. In the Lambda Architecture, this is not supposed to be too big of a deal, since any in-memory state that is lost due to software version upgrades or failures will be repaired within a matter of hours or so as the affected time window ages out of the real-time part of the architecture.

When you have a MapR cluster underneath a Lambda Architecture, however, you can do a bit better than this, so that the times that failures are visible drops to seconds instead of hours.

One way that this works is that MapR allows high-speed streaming data to be written directly to the Hadoop storage layer, while allowing stream-processing applications such as Storm or Spark Streaming to run as an independent service within the cluster. The processing application now becomes more of a subscriber to the incoming data feed. If a failure occurs, and the original application goes down, a new instance of the application can pick up the data stream within seconds of where the original application instance dropped off. An added advantage of this architecture is the availability of streaming data for batch as well as the serving layers.

In addition, individual processing elements can delay their acknowledgement of incoming tuples until they have logged the tuple to a log file in the distributed file system. This log file need only persist until the state of the bolt is either persisted or a summary is sent down stream. At that moment, a new log is started.

In the case of failure or orderly exit of a topology, the new version of the bolt can read this log and reconstruct the necessary state of the bolt very quickly. Once the log is read, tuples coming from the spout can be processed as if nothing ever happened. Since all tuples that arrived after the last record in the log have not been acknowledged, the spout will replay them so the bolt will get a complete set of tuples.

阅读全文

0 0