hadoop及其集群

来源：互联网发布：叶利钦与普京知乎编辑：程序博客网时间：2024/05/22 12:30

he Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
he project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.

Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Other Hadoop-related projects at Apache include:
Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
ZooKeeper™: A high-performance coordination service for distributed applications.

Apache Hadoop™®项目开发开源软件可靠、可扩展、分布式计算。
Apache Hadoop软件图书馆是一个框架,允许跨集群的分布式处理大型数据集的电脑使用简单的编程模型。它的目的是扩大从单一服务器到成千上万的机器,每个提供本地计算和存储。而不是依靠硬件来实现高可用性,图书馆本身的目的是检测和处理失败在应用程序层,所以提供高可用性服务的一个计算机集群,每一种都可能容易失败。
他这些模块项目包括:
HadoopCommon :常见的实用程序,支持其他Hadoop模块。
Hadoop分布式文件系统(HDFS™):一个分布式文件系统,它提供了高通量访问应用程序数据。
Hadoop Yarn:一个集群作业调度和资源管理的框架。
Hadoop MapReduce:YARN-based系统并行处理大型数据集。

其他在Apache Hadoop-related项目包括:
Ambar™:提供一个基于web的工具,管理和监视Apache Hadoop集群,包括支持Hadoop的HDFS,Hadoop MapReduce,蜂巢,HCatalog,HBase,动物园管理员,Oozie,猪和Sqoop。Ambar还提供了一个仪表板查看集群健康如热图和能力视图MapReduce,Pig和Hive应用视觉出具的特性来诊断性能特征以用户友好的方式。
Avro™:数据序列化系统。
卡桑德拉™:一个可伸缩的多主机数据库没有单点故障。
Chukwa™:管理大型分布式系统的数据采集系统。
HBase™:一个可伸缩的分布式数据库,支持大型表的结构化数据存储。
Hive™:数据仓库基础设施,提供了数据总结和特别查询。
Mahout™:一个可扩展的机器学习和数据挖掘库。
猪™:高级数据流语言和并行计算的执行框架。
Spark™:Hadoop数据的快速、通用的计算引擎。火花提供了一个简单的和富有表现力的编程模型,支持范围广泛的应用程序,包括ETL、机器学习、流处理,图计算。
特斯™:广义数据流编程框架,基于Hadoop纱,它提供了一个功能强大且灵活的引擎来执行任意DAG任务来处理数据的批处理和交互用例。特斯正在采用HIve™、Pig™和其他框架Hadoop生态系统,以及其他商业软件(例如ETL工具),以取代Hadoop MapReduce™作为底层执行引擎。
动物园管理员™:一个高性能的分布式应用程序的协调服务。

0 0