Spark 1.6.0 (Scala 2.11)版本的编译与安装部署

来源:互联网 发布:荼靡 知乎 编辑:程序博客网 时间:2024/06/04 19:53

2016年元月4号, spark 在其官网上公开了1.6.0版本,于是进行下载和编译.


有了前面的编译经验和之前下载好的java类包,花了大概一分钟就编译妥当,于是重新部署配置一下,马上OK。简直是高效率。


对于scala的编译,还是只需要一条语句。build/sbt -Dscala=2.11 -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver assembly。




对spark 1.6中的新特性进行测试: (DataSet)


其中1.6的新特性还包括:

Spark Core/SQL

  • API Updates
    • SPARK-9999 Dataset API - A new Spark API, similar to RDDs, that allows users to work with custom objects and lambda functions while still gaining the benefits of the Spark SQL execution engine.
    • SPARK-10810Session Management - Different users can share a cluster while having different configuration and temporary tables.
    • SPARK-11197SQL Queries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table.
    • SPARK-11745Reading non-standard JSON files - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)
    • SPARK-10412Per-operator Metrics for SQL Execution - Display statistics on a per-operator basis for memory usage and spilled data size.
    • SPARK-11329Star (*) expansion for StructTypes - Makes it easier to nest and unnest arbitrary numbers of columns
    • SPARK-4849 Advanced Layout of Cached Data - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API
    • SPARK-11778  - DataFrameReader.table supports specifying database name. For example, sqlContext.read.table(“dbName.tableName”) can be used to create a DataFrame from a table called “tableName” in the database “dbName”.
    • SPARK-10947  - With schema inference from JSON into a Dataframe, users can set primitivesAsString to true (in data source options) to infer all primitive value types as Strings. The default value of primitivesAsString is false.
  • Performance
    • SPARK-10000Unified Memory Management - Shared memory for execution and caching instead of exclusive division of the regions.
    • SPARK-11787Parquet Performance - Improve Parquet scan performance when using flat schemas.
    • SPARK-9241 Improved query planner for queries having distinct aggregations - Query plans of distinct aggregations are more robust when distinct columns have high cardinality.
    • SPARK-9858 Adaptive query execution - Initial support for automatically selecting the number of reducers for joins and aggregations.
    • SPARK-10978Avoiding double filters in Data Source API - When implementing a data source with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.
    • SPARK-11111Fast null-safe joins - Joins using null-safe equality (<=>) will now execute using SortMergeJoin instead of computing a cartisian product.
    • SPARK-10917,SPARK-11149In-memory Columnar Cache Performance - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.
    • SPARK-11389SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead

Spark Streaming

  • API Updates
    • SPARK-2629 New improved state management - mapWithState - a DStream transformation for stateful stream processing, supercedesupdateStateByKey in functionality and performance.
    • SPARK-11198Kinesis record deaggregation - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
    • SPARK-10891Kinesis message handler function - Allows arbitrary function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory.
    • SPARK-6328 Python Streaming Listener API - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.
  • UI Improvements
    • Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.
    • Made output operations visible in the streaming tab as progress bars.

MLlib

  • New algorithms/models
    • SPARK-8518 Survival analysis - Log-linear model for survival analysis
    • SPARK-9834 Normal equation for least squares - Normal equation solver, providing R-like model summary statistics
    • SPARK-3147 Online hypothesis testing - A/B testing in the Spark Streaming framework
    • SPARK-9930 New feature transformers - ChiSqSelector, QuantileDiscretizer, SQL transformer
    • SPARK-6517 Bisecting K-Means clustering - Fast top-down clustering variant of K-Means
  • API improvements
    • ML Pipelines
      • SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.ml algorithms
      • SPARK-5565 LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
    • R API
      • SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model)
      • SPARK-9681 Feature interactions in R formula - Interaction operator “:” in R formula
    • Python API - Many improvements to Python API to approach feature parity
  • Misc improvements
    • SPARK-7685 ,SPARK-9642 Instance weights for GLMs - Logistic and Linear Regression can take instance weights
    • SPARK-10384,SPARK-10385Univariate and bivariate statistics in DataFrames - Variance, stddev, correlations, etc.
    • SPARK-10117LIBSVM data source - LIBSVM as a SQL data source
  • Documentation improvements
    • SPARK-7751 @since versions - Documentation includes initial version when classes and methods were added
    • SPARK-11337Testable example code - Automated testing for code in user guide examples

0 0
原创粉丝点击