Spark的介绍

来源：互联网发布：mac腾讯视频编辑：程序博客网时间：2024/06/10 08:15

Spark 是快速的、hadoop数据的通用处理引擎；可以运行在hadoop的YARN集群上或者单机模式；可以处理任何格式的hadoop数据；为了批处理和新的流处理、机器学习而设计。

一、优点：

1、快速：

在内存中，spark可以比hadoop的MR快100多倍；在磁盘上，快10多倍。

2、易用性：

Java, Scala, Python, R.

3、普适性：

including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming

4、运行环境多：

runs on Hadoop, Mesos, standalone, or in the cloud.

二、内置库：

1、Spark SQL：http://spark.apache.org/docs/latest/sql-programming-guide.html

无缝融入sql查询语句：
context = HiveContext(sc)
results = context.sql(
  "SELECT * FROM people")
names = results.map(lambda p: p.name)

统一数据连接：可以用sql关联不同数据源的表！！！
context.jsonFile("s3n://...")
  .registerTempTable("json")
results = context.sql(
  """SELECT *
     FROM people
     JOIN json ...""")

HIVE兼容：可以用UDF。

标准连接：BI工具利用JDBC｜ODBC连接，通过Spark SQL来访问大数据。

2、Spark Streaming:http://spark.apache.org/docs/latest/streaming-programming-guide.html

可以很容易的构建可伸缩性、容错性强的流处理应用。

3、MLlib：http://spark.apache.org/docs/latest/ml-guide.html

可用性：Usable in Java, Scala, Python, and R.

效果好：比mapreduce快100多倍

易部署。

4、GraphX：http://spark.apache.org/docs/latest/graphx-programming-guide.html

图形计算

5、第三方项目：

第三方公共库：spark-packages.org

基础项目：

SparkR - R frontend for Spark、

Zeppelin - an IPython-like notebook for Spark.

Reference：

http://spark.apache.org/

0 0