Hadoop Introduction

来源：互联网发布：伊尔汗国知乎编辑：程序博客网时间：2024/05/29 03:53

Hadoop is an opensource framework for writing and running distributed applications that processlarge amounts of data. Distributed computing is a wide and varied field, butthe key distributions of Hadoop are that it is

Acciable – Hadoop runson large clusters of commodity machines or on cloud computing services such asAmazon’s Elastic Compute Cloud (EC2).

Robust - Because itis intended to run on commodity hardware, Hadoop is architected with theassumption of frequent hardaware malfunctions. It can gracefully handle mostsuch failures.

Scalable – Hadoop scaleslinearly to handle larger data by adding more nodes to the cluster.

Simple – Hadoop allowsusers to quickly write efficient parallel code.

SQL (StructuredQuery Language) is by design targeted at structured data. Many of Hadoop’sinitial applications deal with unstructured data such as text. From thisperspective Hadoop provides a more general paradigm than SQL. A machine withfour times the power of a standard PC costs a lot more than putting four suchPCs in a cluster. Hadoop is designed to be a scale-out architecture operatingon a cluster of commodity PC machines. Adding more resources means adding moremachines to the Hadoop cluster. Hadoop Clusters with ten to hundreds ofmachines is standard. In fact, other than for development purposes, there’s noreason to run Hadoop on a single server.

Large data sets areoften unstructured or semistrcutrued. Hadoop uses key/value pairs as its basicdata unit, which is flexible enough to work with the less-structured datatypes. In Hadoop, data can originate in any form, but it eventually transformsinto (key/value) pairs for the processing functions to work on.

Hadoop is best usedas a write-once, read-many-times type of data store. In this aspect it’ssimilar to data warehouses in the SQL world.

Data processingmodels such as pipelines and message queues. Pipelines can help the reuse ofprocessing primitives; simple chaining of existing modules creates new ones.Message queues can help the synchronization of processing primitives. Theprogrammer writes her data processing task as processing primitives in the formof either a producer or a consumer. The timing of their execution is managed bythe system. Similarly, MapReduce is also a data processing model. Its greatestadvantage is the easy scaling of data processing over multiple computingnodes.Under the MapReduce model, the data processing primitives are calledmappers and reducers.

If the documentsare all stored in one central storage server, then the bottleneck is in thebandwidth of that server.

In the mappingphase, MapReduce takes the input data and feeds each data to the mapper. In thereducing phase, the reducer processes all the outputs from the mapper andarrives at a final result. In simple terms, the mapper is meant to filter andtransform the input into something that the reducer can aggregate over.