Spark 1.6.0 (Scala 2.11)版本的编译与安装部署

来源：互联网发布：荼靡知乎编辑：程序博客网时间：2024/06/04 19:53

2016年元月4号, spark 在其官网上公开了1.6.0版本,于是进行下载和编译.

有了前面的编译经验和之前下载好的java类包，花了大概一分钟就编译妥当，于是重新部署配置一下，马上OK。简直是高效率。

对于scala的编译，还是只需要一条语句。build/sbt -Dscala=2.11 -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver assembly。

对spark 1.6中的新特性进行测试: (DataSet)

其中1.6的新特性还包括:

Spark Core/SQL

API Updates
- SPARK-9999 Dataset API - A new Spark API, similar to RDDs, that allows users to work with custom objects and lambda functions while still gaining the benefits of the Spark SQL execution engine.
- SPARK-10810Session Management - Different users can share a cluster while having different configuration and temporary tables.
- SPARK-11197SQL Queries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table.
- SPARK-11745Reading non-standard JSON files - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)
- SPARK-10412Per-operator Metrics for SQL Execution - Display statistics on a per-operator basis for memory usage and spilled data size.
- SPARK-11329Star (*) expansion for StructTypes - Makes it easier to nest and unnest arbitrary numbers of columns
- SPARK-4849 Advanced Layout of Cached Data - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API
- SPARK-11778 - DataFrameReader.table supports specifying database name. For example, sqlContext.read.table(“dbName.tableName”) can be used to create a DataFrame from a table called “tableName” in the database “dbName”.
- SPARK-10947 - With schema inference from JSON into a Dataframe, users can set primitivesAsString to true (in data source options) to infer all primitive value types as Strings. The default value of primitivesAsString is false.
Performance
- SPARK-10000Unified Memory Management - Shared memory for execution and caching instead of exclusive division of the regions.
- SPARK-11787Parquet Performance - Improve Parquet scan performance when using flat schemas.
- SPARK-9241 Improved query planner for queries having distinct aggregations - Query plans of distinct aggregations are more robust when distinct columns have high cardinality.
- SPARK-9858 Adaptive query execution - Initial support for automatically selecting the number of reducers for joins and aggregations.
- SPARK-10978Avoiding double filters in Data Source API - When implementing a data source with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.
- SPARK-11111Fast null-safe joins - Joins using null-safe equality (<=>) will now execute using SortMergeJoin instead of computing a cartisian product.
- SPARK-10917,SPARK-11149In-memory Columnar Cache Performance - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.
- SPARK-11389SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead

Spark Streaming

API Updates
- SPARK-2629 New improved state management - mapWithState - a DStream transformation for stateful stream processing, supercedesupdateStateByKey in functionality and performance.
- SPARK-11198Kinesis record deaggregation - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
- SPARK-10891Kinesis message handler function - Allows arbitrary function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory.
- SPARK-6328 Python Streaming Listener API - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.
UI Improvements
- Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.
- Made output operations visible in the streaming tab as progress bars.

MLlib

New algorithms/models
- SPARK-8518 Survival analysis - Log-linear model for survival analysis
- SPARK-9834 Normal equation for least squares - Normal equation solver, providing R-like model summary statistics
- SPARK-3147 Online hypothesis testing - A/B testing in the Spark Streaming framework
- SPARK-9930 New feature transformers - ChiSqSelector, QuantileDiscretizer, SQL transformer
- SPARK-6517 Bisecting K-Means clustering - Fast top-down clustering variant of K-Means
API improvements
- ML Pipelines
  - SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.ml algorithms
  - SPARK-5565 LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
- R API
  - SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model)
  - SPARK-9681 Feature interactions in R formula - Interaction operator “:” in R formula
- Python API - Many improvements to Python API to approach feature parity
Misc improvements
- SPARK-7685 ,SPARK-9642 Instance weights for GLMs - Logistic and Linear Regression can take instance weights
- SPARK-10384,SPARK-10385Univariate and bivariate statistics in DataFrames - Variance, stddev, correlations, etc.
- SPARK-10117LIBSVM data source - LIBSVM as a SQL data source
Documentation improvements
- SPARK-7751 @since versions - Documentation includes initial version when classes and methods were added
- SPARK-11337Testable example code - Automated testing for code in user guide examples

0 0