Hadoop Introduction
来源:互联网 发布:伊尔汗国 知乎 编辑:程序博客网 时间:2024/05/29 03:53
Hadoop is an opensource framework for writing and running distributed applications that processlarge amounts of data. Distributed computing is a wide and varied field, butthe key distributions of Hadoop are that it is
Acciable – Hadoop runson large clusters of commodity machines or on cloud computing services such asAmazon’s Elastic Compute Cloud (EC2).
Robust - Because itis intended to run on commodity hardware, Hadoop is architected with theassumption of frequent hardaware malfunctions. It can gracefully handle mostsuch failures.
Scalable – Hadoop scaleslinearly to handle larger data by adding more nodes to the cluster.
Simple – Hadoop allowsusers to quickly write efficient parallel code.
SQL (StructuredQuery Language) is by design targeted at structured data. Many of Hadoop’sinitial applications deal with unstructured data such as text. From thisperspective Hadoop provides a more general paradigm than SQL. A machine withfour times the power of a standard PC costs a lot more than putting four suchPCs in a cluster. Hadoop is designed to be a scale-out architecture operatingon a cluster of commodity PC machines. Adding more resources means adding moremachines to the Hadoop cluster. Hadoop Clusters with ten to hundreds ofmachines is standard. In fact, other than for development purposes, there’s noreason to run Hadoop on a single server.
Large data sets areoften unstructured or semistrcutrued. Hadoop uses key/value pairs as its basicdata unit, which is flexible enough to work with the less-structured datatypes. In Hadoop, data can originate in any form, but it eventually transformsinto (key/value) pairs for the processing functions to work on.
Hadoop is best usedas a write-once, read-many-times type of data store. In this aspect it’ssimilar to data warehouses in the SQL world.
Data processingmodels such as pipelines and message queues. Pipelines can help the reuse ofprocessing primitives; simple chaining of existing modules creates new ones.Message queues can help the synchronization of processing primitives. Theprogrammer writes her data processing task as processing primitives in the formof either a producer or a consumer. The timing of their execution is managed bythe system. Similarly, MapReduce is also a data processing model. Its greatestadvantage is the easy scaling of data processing over multiple computingnodes.Under the MapReduce model, the data processing primitives are calledmappers and reducers.
If the documentsare all stored in one central storage server, then the bottleneck is in thebandwidth of that server.
In the mappingphase, MapReduce takes the input data and feeds each data to the mapper. In thereducing phase, the reducer processes all the outputs from the mapper andarrives at a final result. In simple terms, the mapper is meant to filter andtransform the input into something that the reducer can aggregate over.
- Hadoop Introduction
- hadoop -- introduction
- Hadoop's Introduction
- Apache Hadoop introduction
- Note on Hadoop Tutorial : Introduction
- the introduction of the hadoop
- Yahoo! Hadoop Module 1: Tutorial Introduction
- Introduction
- Introduction
- Introduction
- Introduction
- Introduction
- Introduction
- Introduction
- Introduction
- Introduction
- Introduction
- introduction
- javascript 中 <script> 元素几个属性的解释
- JAVA基础知识
- MongoDB学习笔记3(查询)
- LisView设置 上下文菜单
- C++ 对象的内存布局(上)
- Hadoop Introduction
- 阿里云、华为云、百度云等比较
- MongoDB学习笔记4(索引)
- Mac和iOS开发资源汇总—更新于2013-07-19
- Android LayoutInflater作用及使用
- 学习新技术的10个建议
- php错误日志
- Ext.data.Store的基本用法
- android 数据持久化——File