Hadoop Ecosystem
来源:互联网 发布:电脑软件怎么加密 编辑:程序博客网 时间:2024/05/28 19:23
https://blogs.walkingtree.in/2013/09/26/hadoop-ecosystem/
Problem Statement:
When we start learning Hadoop techology, we come across many components in Hadoop ecosystem. It would be of great interest for all of us to know the what specific purpose each component will serve with in Hadoop ecosystem.
Scope of the Article:
This article talks describes the use of different components in hadoop ecosystem
Details:
Following diagram depicts the general ecosystem of Hadoop. Not all components are mandatory. Many times one component compliments other components.
Hadoop Distributed File System(HDFS):
HDFS is a distributed file system which distributes data across multiple servers with automatic recovery in case of any node failure. It is built with concept of write once and read multiple times. It does not support multiple writers in one go, it allows only writer at a time. Typical Hadoop instance can withstand peta bytes of data with the power of this file system.
HBase:
Hbase is distributed column oriented database where as HDFS is file system. But it is built on top of HDFS system. HBase does not support SQL, but it solves concurrent write limitation we have in HDFS. HBase is not a replacement for HDFS. HBase internally stores the data in HDFS format. If have need for concurrent writes in your big data solution then HBase can be used.
MapReduce:
MapReduce is framework for distributed parallel data processing. It provides programming model for large data processing. Mapreduce programs can be written in Jave, Ruby, Python and C++. It has inherent capability to run the programs in parallel across multiple nodes in a big data cluster. As processing has is distributed across multiple nodes we can expect better performance and throughput. Mapreduce performs data processing in two stages i.e. map and reduce. Map will convert an input data in intermdiate data format which is basically and key value pair. Reduce will combine all the maps which share common key and generates reduced set of key value pairs. It has two components i.e. Job tracker and task tracker.Job tracker acts like master and send commands to slaves for specific task.Task tracker will take care of real execution of task and report back to job tracker
YARN:
YARN means ‘yet another resource negotiator’. Map reduce was rewritten to overcome the potential bottleneck of single job tracker in old mapreduce which has responsibilities of job scheduling and monitoring task progress. Now YARN divides that into those two responsibilities into two seperate deamons i.e. resource manager and application master. Existing mapreduce programs can work directly on YARN but some times we need make some changes.
Avro:
Avro is data serialization format which brings data interoperability among mutlple components of apache hadoop. Most of the components in hadoop started supporting Avro data format. It works with basic premise of data produced by component should be readily consumed by other component.
Avro has following features
- Rich data types.
- Fast and compact serialization
- Support many programming langguages like java, Python
Pig:
Pig is platform for big data analysis and processing. Then immediate question comes to our mind that map reduce is also serving same purpose then what other benefits pig is providing. Pig adds one more level abstraction in data processing and it makes writing and maintaining data processing jobs very easy. At the time of compilation, pig script will be converted into multiple map reduce programs and they will be executed as per the logic written in pig script. Pig has two pieces
- The lanuguage to write programs which is named as Pig Latin
- Execution environment where pig scripts will be executed
Pig can process tera bytes of data with half dozen lines of code.
Hive:
Hive is a dataware housing framework on top of Hadoop. Hive allows to write SQL like queries to process and analyze the big data stored in HDFS. Hive is primarly intended for the resources who want to process big data but does not have programming background around java or other related technologies. While execution hive scripts will be converted to series of mapreduce jobs.
Sqoop:
Sqoop is tool which can be used to transfer the data from relational database environments like oracle, mysql and postgresql into hadoop environment. It can transfter large amount of data into hadoop system. It can store the data in HDFS in arvo fromat.
Zookeeper:
Zookeeper is a distributed coordination and governing service for hadoop cluster. Zookeeper runs on multiple nodes in a cluster and in general hadoop nodes and zookeeper nodes will be same. Zookeeper can notify in case of any changes happened master or any of its child. In hadoop this will be useful to track if particular node is down and plan necessary communication protocol around node failure.
Mahout:
Mahout adds data mining and machine learning capabilities for big data. It can be used to build recommendation engine based on certain usage patterns of user.
Summary:
In this article we understood hadoop ecosystem and learned about different components and their primary purpose.
References:
- http://hadoop.apache.org
- Hadoop ecosystem
- Hadoop Ecosystem
- Hadoop EcoSystem
- Hadoop Ecosystem World-Map
- hadoop's ecosystem
- Hadoop Ecosystem Map
- 初识Hadoop's Ecosystem
- hadoop ecosystem map
- hadoop ecosystem map(hadoop生态系统)
- Hadoop生态系统(Hadoop Ecosystem)
- Hadoop ecosystem notes (all components)
- Hadoop ecosystem HDFS and HDFS2
- Hadoop ecosystem自己的理解
- Hadoop生态圈简介(Hadoop Ecosystem)
- Ambari Install Hadoop ecosystem for 9 steps
- Hadoop学习-生态体系(ecosystem)概览
- Practical Hadoop Ecosystem.pdf 英文原版 免费下载
- [hadoop ecosystem] 序列化和反序列化的工具
- jQuery判断元素上是否绑定了指定事件的方法
- QCustomPlot绘制时间日期折线图
- swift中变量的get和set
- Android apk dex与odex
- nvenc使用
- Hadoop Ecosystem
- 织梦写的1元夺宝系统,简单手机版
- recylerview的使用
- 红外接收代码解析
- Android NDK日志符号化 查找crash原因
- jQuery判断数组是否包含了指定的元素
- 为什么忙我们得像八爪鱼一样,效率还是不高?(深度好文)
- Hadoop HDFS Shell
- 人数不定的工资类