『 Spark 』1. spark 简介
来源:互联网 发布:vs2017 java 编辑:程序博客网 时间:2024/06/10 13:43
如何向别人介绍 spark
Apache Spark™ is a fast and general engine for large-scale data processing.
Apache Spark is a fast and general-purpose cluster computing system.
It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.
It also supports a rich set of higher-level tools including :
- Spark SQL for SQL and structured data processing, extends to DataFrames and DataSets
- MLlib for machine learning
- GraphX for graph processing
- Spark Streaming for stream data processing
spark 诞生的一些背景
Spark started in 2009, open sourced 2010, unlike the various specialized systems[hadoop, storm], Spark’s goal was to :
generalize MapReduce to support new apps within same engine
- it’s perfectly compatible with hadoop, can run on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
speed up iteration computing over hadoop.
use memory + disk instead of disk as data storage medium
design a new programming modal, RDD, which make the data processing more graceful.
[RDD transformation, action, distributed jobs, stages and tasks]
为何选用 spark
- designed, implemented and used as libs, instead of specialized systems;
- much more useful and maintainable
- from history, it is designed and improved upon hadoop and storm, it has perfect genes;
- documents, community, products and trends;
- it provides sql, dataframes, datasets, machine learning lib, graph computing lib and activitily growth 3-party lib, easy to use, cover lots of use cases in lots field;
- it provides ad-hoc exploring, which boost your data exploring and pre-processing and help you build your data ETL, processing job;
参考文章
Intro to Apache Spark
introducing spark
- 『 Spark 』1. spark 简介
- 『 Spark 』1. spark 简介
- 1.spark简介
- Spark-Spark简介
- spark简介
- spark简介
- Spark简介
- Spark简介
- spark简介
- Spark简介
- Spark简介
- Spark简介
- spark简介
- Spark简介
- Spark简介
- Spark简介
- Spark简介
- Spark简介
- 自己看别人看了没用,dicomcs相关操作
- Log4j与common-logging
- 基于python的spark mongodb
- LeetCode | 83. Remove Duplicates from Sorted List
- mysql 查询出多级父级内容和多级子级内容
- 『 Spark 』1. spark 简介
- android 7.0适配
- [Leetcode] 414. Third Maximum Number 解题报告
- linux logrotate 日志切割管理
- kaggle+mnist手写字体识别
- Paint Canvas 类属性方法学习笔记
- 分布式之消息队列
- JDK源码中的help GC 与 JVM的可达性算法分析
- JS数组去重的几种方法