Big Data Ecosystem and Components

来源：互联网发布：网络喷子编辑：程序博客网时间：2024/04/30 02:42

Apache Spark Components
1. Spark Core Component

 - special data structure RDD - basic I/O functionalities - jobs and task scheduling and monitoring - memory management - fault recovery - interacting with storage systems - so on

Spark SQL component
Spark Streaming
GraphX
MLlib

 - clustering - classification - decomposition - regression - collaborative filtering

Zookeeper

Coordination

Oozie

Workflow and Scheduling

Scripting, data access

Mahout

Machine Learing library.

Hive

Query

Hbase

NoSQL Database

Ambari

Management and Monitoring，提供Hadoop集群的部署、管理和监控等功能，为运维人员管理Hadoop集群提供了强大的Web界面。

MapReduce

Distributed Processing

Sqoop

Data Integration, importing or exporting data.

Mesos

open source cluster managers.

Hadoop
Cassandra

a free open-source distributed database management system designed to handle large amounts of data across many commodity servers, providig high availability with no single point of failure.

Hadoop YARN

open source cluster managers.

Amazon EC2

Amazon Elastic Compute Cloud(Amazon EC2) is a web service that provides resizable compute capacity in the cloud.

Flume

Gathering and aggregate large amounts of data.

Simba

A distributed in-memory spatial analytics engine based on Apache Spark.

Alluxio

Open source memory speed virtual distributed storage.

airflow

Airflow is a platform to programmatically author, schedule and monitor workflows.

Apache Oozie

Oozie, Workflow Engine for Hadoop.

Apache Kafka

public-subscribe messaging system

Tachyon

now is Alluxio.

BlinkDB

a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data.

Shark

stop update

RabbitMQ

An open source message broker software (message-orented middleware) that implements the Advanced Message Queuing Protocol(AMQP).

Impala

高层语言

RHadoop

机器学习库

Flume

数据传输工具，可用于日志数据收集、处理和传输，功能类似于Chukwa，但比Chukwa更小巧实用。

Avro

数据序列化系统，用于大批量数据实时动态交换，它是新的数据序列化与传输工具，估计会逐步取代Hadoop原有的RPC机制。

Chukwa

数据传输工具，它可以将各种各样类型的数据收集与导入Hadoop。

Sqoop

数据传输工具，将一个关系型数据库（MySQL 、Oracle 、Postgres等）中的数据导入Hadoop的HDFS中，也可以将HDFS的数据导入关系型数据库中。

Hadoop及其生态圈组件的Web编辑工具。实现对HDFS、Yarn、MapReduce、Hbase、Hive、Pig等的Web化操作。

BigTop

针对Hadoop及其周边组件的打包、分发和测试工具。解决组件间版本依赖、冲突问题，实际上当用户用rpm或yum方式部署时，脚本内部会用到它。

,alluxio
,airflow-homepage
,oozie
,simba
apache-spark-ecosystem-components
mllib-statistics
google-math
programming-guide
tuning-spark
amazon-ec2
AirFlow-Joins-Apache-Incubator
Data Workflow Management Using Airbnb’s Airflow
You-Tube: Airflow An open source platform to author and monitor data pipelines
why-airflow-blog
incubator-airflow-github
解密Airbnb的数据流编程神器：Airflow中的技巧和陷阱
blinkdb-homepage

0 0