Big Data Ecosystem and Components

来源:互联网 发布:网络喷子 编辑:程序博客网 时间:2024/04/30 02:42
  • Apache Spark Components
    1. Spark Core Component
 - special data structure RDD - basic I/O functionalities - jobs and task scheduling and monitoring - memory management - fault recovery - interacting with storage systems - so on
  1. Spark SQL component

  2. Spark Streaming

  3. GraphX

  4. MLlib

 - clustering - classification - decomposition - regression - collaborative filtering
  • Zookeeper

Coordination

  • Oozie

Workflow and Scheduling

  • Pig

Scripting, data access

  • Mahout

Machine Learing library.

  • Hive

Query

  • Hbase

NoSQL Database

  • Ambari

Management and Monitoring,提供Hadoop集群的部署、管理和监控等功能,为运维人员管理Hadoop集群提供了强大的Web界面。

  • MapReduce

Distributed Processing

  • Sqoop

Data Integration, importing or exporting data.

  • Mesos

open source cluster managers.

  • Hadoop

  • Cassandra

a free open-source distributed database management system designed to handle large amounts of data across many commodity servers, providig high availability with no single point of failure.

  • Hadoop YARN

open source cluster managers.

  • Amazon EC2

Amazon Elastic Compute Cloud(Amazon EC2) is a web service that provides resizable compute capacity in the cloud.

  • Flume

Gathering and aggregate large amounts of data.

  • Simba

A distributed in-memory spatial analytics engine based on Apache Spark.

  • Alluxio

Open source memory speed virtual distributed storage.

  • airflow

Airflow is a platform to programmatically author, schedule and monitor workflows.

  • Apache Oozie

Oozie, Workflow Engine for Hadoop.

  • Apache Kafka

public-subscribe messaging system

  • Tachyon

now is Alluxio.

  • BlinkDB

a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data.

  • Shark

stop update

  • RabbitMQ

An open source message broker software (message-orented middleware) that implements the Advanced Message Queuing Protocol(AMQP).

  • Impala

高层语言

  • RHadoop

机器学习库

  • Flume

数据传输工具, 可用于日志数据收集、处理和传输,功能类似于Chukwa,但比Chukwa更小巧实用。

  • Avro

数据序列化系统,用于大批量数据实时动态交换,它是新的数据序列化与传输工具,估计会逐步取代Hadoop原有的RPC机制。

  • Chukwa

数据传输工具,它可以将各种各样类型的数据收集与导入Hadoop。

  • Sqoop

数据传输工具,将一个关系型数据库(MySQL 、Oracle 、Postgres等)中的数据导入Hadoop的HDFS中,也可以将HDFS的数据导入关系型数据库中。

  • Hue

Hadoop及其生态圈组件的Web编辑工具。实现对HDFS、Yarn、MapReduce、Hbase、Hive、Pig等的Web化操作。

  • BigTop

针对Hadoop及其周边组件的打包、分发和测试工具。解决组件间版本依赖、冲突问题,实际上当用户用rpm或yum方式部署时,脚本内部会用到它。

  • ,alluxio

  • ,airflow-homepage

  • ,oozie

  • ,simba

  • apache-spark-ecosystem-components

  • mllib-statistics

  • google-math

  • programming-guide

  • tuning-spark

  • amazon-ec2

  • AirFlow-Joins-Apache-Incubator

  • Data Workflow Management Using Airbnb’s Airflow

  • You-Tube: Airflow An open source platform to author and monitor data pipelines

  • why-airflow-blog

  • incubator-airflow-github

  • 解密Airbnb的数据流编程神器:Airflow中的技巧和陷阱

  • blinkdb-homepage

0 0
原创粉丝点击