spark 1.6 preview

来源：互联网发布：adobe xd cc mac 编辑：程序博客网时间：2024/06/05 09:49

A new Dataset API

RDD API 使用非常灵活，但是在某些case下，比较难于优化执行。DataFrame API内部执行虽然有优化，但是lacks some of the nice perks of the RDD API (e.g. harder to use UDFs, lack of strong types in Scala/Java 即缺少RDD API的一些好处).
Spark Dataset的目标是：让开发者很容易的编写操作 Domain Objects的transmatons, 同时提供spark sql执行引擎的性能和稳定的优点。
具体参见：SPARK-9999
设计文档里说：DataSet是DataFrame的超集，即DataSet[T] <- DataFrame[Row], DataFrame操作的是一行的Row。而DataSet可以操作定义的各种类型-String，POJO等等。

Requirements

Fast - In most cases, the performance of Datasets should be equal to or better than working with RDDs. Encoders should be as fast or faster than Kryo and Java serialization, and unnecessary conversion should be avoided.
Typesafe - Similar to RDDs, objects and functions that operate on those objects should provide compile-time safety where possible. When converting from data where the schema is not known at compile-time (for example data read from an external source such as JSON), the conversion function should fail-fast if there is a schema mismatch.
Support for a variety of object models - Default encoders should be provided for a variety of object models: primitive types, case classes, tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard conventions, such as Avro SpecificRecords, should also work out of the box.
Java Compatible - Datasets should provide a single API that works in both Scala and Java. Where possible, shared types like Array will be used in the API. Where not possible, overloaded functions should be provided for both languages. Scala concepts, such as ClassTags should not be required in the user-facing API.
Interoperates with DataFrames - Users should be able to seamlessly transition between Datasets and DataFrames, without specifying conversion boiler-plate. When names used in the input schema line-up with fields in the given class, no extra mapping should be necessary. Libraries like MLlib should not need to provide different interfaces for accepting DataFrames and Datasets as input.

Autoatic memory configuration

开发者不用在为优化内存比例分配而发愁，现在Spark可以根据需要自动的增长或缩减Application执行时内存各个区域的比例。这样对于join,aggregation有着明显的性能提升。

Optimized state storage in Spark Streaming

spark streaming的状态tracking api:a new “delta-tracking” approach .优化了spark streaming状态计算时amounts of state.
即：原来的updateStateByKey 换成了 trackingStateByKey.

Pipeline persistence inSpark ML

Spark ML pipelines 可以被持久化，他们可以在执行状态被reload.这个是非常有用的，比如training large models的pipelines.

Spark 1.6 改动

0 0