最新的spark2.1.0 ReleaseNote[Release date: 18/Dec/16]

来源：互联网发布：皇甫圣华淘宝编辑：程序博客网时间：2024/05/24 06:50

Sub-task

[SPARK-1267] - Add a pip installer for PySpark
[SPARK-10372] - Add end-to-end tests for the scheduling code
[SPARK-14300] - Scala MLlib examples code merge and clean up
[SPARK-14480] - Remove meaningless StringIteratorReader for CSV data source for better performance
[SPARK-15232] - Add subquery SQL building tests to LogicalPlanToSQLSuite
[SPARK-15353] - Making peer selection for block replication pluggable
[SPARK-15453] - FileSourceScanExec to extract `outputOrdering` information
[SPARK-15590] - Paginate Job Table in Jobs tab
[SPARK-15591] - Paginate Stage Table in Stages tab
[SPARK-15698] - Ability to remove old metadata for structure streaming MetadataLog
[SPARK-15780] - Support mapValues on KeyValueGroupedDataset
[SPARK-15814] - Aggregator can return null result
[SPARK-15926] - Improve readability of DAGScheduler stage creation methods
[SPARK-15927] - Eliminate redundant code in DAGScheduler's getParentStages and getAncestorShuffleDependencies methods.
[SPARK-16000] - Make model loading backward compatible with saved models using old vector columns
[SPARK-16104] - Do not creaate CSV writer object for every flush when writing
[SPARK-16137] - Random Forest wrapper in SparkR
[SPARK-16282] - Implement percentile SQL function
[SPARK-16283] - Implement percentile_approx SQL function
[SPARK-16287] - Implement str_to_map SQL function
[SPARK-16312] - Docs for Kafka 0.10 consumer integration
[SPARK-16318] - xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string, and xpath
[SPARK-16380] - Update SQL examples and programming guide for Python language binding
[SPARK-16391] - KeyValueGroupedDataset.reduceGroups should support partial aggregation
[SPARK-16403] - Example cleanup and fix minor issues
[SPARK-16421] - Improve output from ML examples
[SPARK-16436] - checkEvaluation should support NaN values
[SPARK-16443] - ALS wrapper in SparkR
[SPARK-16444] - Isotonic Regression wrapper in SparkR
[SPARK-16445] - Multilayer Perceptron Classifier wrapper in SparkR
[SPARK-16446] - Gaussian Mixture Model wrapper in SparkR
[SPARK-16447] - LDA wrapper in SparkR
[SPARK-16508] - Fix documentation warnings found by R CMD check
[SPARK-16510] - Move SparkR test JAR into Spark, include its source code
[SPARK-16519] - Handle SparkR RDD generics that create warnings in R CMD check
[SPARK-16524] - Add RowBatch and RowBasedHashMapGenerator
[SPARK-16525] - Enable Row Based HashMap in HashAggregateExec
[SPARK-16577] - Add check-cran script to Jenkins
[SPARK-16579] - Add a spark install function
[SPARK-16581] - Making JVM backend calling functions public
[SPARK-16621] - Generate stable SQLs in SQLBuilder
[SPARK-16734] - Make sure examples in all language bindings are consistent
[SPARK-16735] - Fail to create a map contains decimal type with literals having different inferred precessions and scales
[SPARK-16757] - Set up caller context to HDFS and Yarn
[SPARK-16774] - Fix use of deprecated TimeStamp constructor (also providing incorrect results)
[SPARK-16776] - Fix Kafka deprecation warnings
[SPARK-16777] - Parquet schema converter depends on deprecated APIs
[SPARK-16778] - Fix use of deprecated SQLContext constructor
[SPARK-16779] - Fix unnecessary use of postfix operations
[SPARK-16800] - Fix Java Examples that throw exception
[SPARK-16814] - Fix deprecated use of ParquetWriter in Parquet test suites
[SPARK-16866] - Basic infrastructure for file-based SQL end-to-end tests
[SPARK-16904] - Removal of Hive Built-in Hash Functions and TestHiveFunctionRegistry
[SPARK-16963] - Change Source API so that sources do not need to keep unbounded state
[SPARK-16980] - Load only catalog table partition metadata required to answer a query
[SPARK-17007] - Move test data files into a test-data folder
[SPARK-17008] - Normalize query results using sorting
[SPARK-17009] - Use a new SparkSession for each test case
[SPARK-17011] - Support testing exceptions in queries
[SPARK-17015] - group-by-ordinal and order-by-ordinal test cases
[SPARK-17018] - literals.sql for testing literal parsing
[SPARK-17042] - Repl-defined classes cannot be replicated
[SPARK-17072] - generate table level stats:stats generation/storing/loading
[SPARK-17073] - generate basic stats for column
[SPARK-17090] - Make tree aggregation level in linear/logistic regression configurable
[SPARK-17096] - Fix StreamingQueryListener to return message and stacktrace of actual exception
[SPARK-17138] - Python API for multinomial logistic regression
[SPARK-17149] - array.sql for testing array related functions
[SPARK-17157] - Add multiclass logistic regression SparkR Wrapper
[SPARK-17163] - Merge MLOR into a single LOR interface
[SPARK-17165] - FileStreamSource should not track the list of seen files indefinitely
[SPARK-17183] - put hive serde table schema to table properties like data source table
[SPARK-17188] - Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx
[SPARK-17235] - MetadataLog should support purging old logs
[SPARK-17269] - Move finish analysis stage into its own file
[SPARK-17270] - Move object optimization rules into its own file
[SPARK-17272] - Move subquery optimizer rules into its own file
[SPARK-17273] - Move expression optimizer rules into a separate file
[SPARK-17274] - Move join optimizer rules into a separate file
[SPARK-17346] - Kafka 0.10 support in Structured Streaming
[SPARK-17372] - Running a file stream on a directory with partitioned subdirs throw NotSerializableException/StackOverflowError
[SPARK-17475] - HDFSMetadataLog should not leak CRC files
[SPARK-17513] - StreamExecution should discard unneeded metadata
[SPARK-17586] - Use Static member not via instance reference
[SPARK-17699] - from_json function for parsing json Strings into Structs
[SPARK-17731] - Metrics for Structured Streaming
[SPARK-17764] - to_json function for parsing Structs to json Strings
[SPARK-17790] - Support for parallelizing R data.frame larger than 2GB
[SPARK-17800] - Introduce InterfaceStability annotation definition
[SPARK-17812] - More granular control of starting offsets (assign)
[SPARK-17813] - Maximum data per trigger
[SPARK-17830] - Annotate Spark SQL public APIs with InterfaceStability
[SPARK-17834] - Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer
[SPARK-17864] - Mark data type APIs as stable, rather than DeveloperApi
[SPARK-17900] - Mark the following Spark SQL APIs as stable
[SPARK-17925] - Break fileSourceInterfaces.scala into multiple pieces
[SPARK-17926] - Add methods to convert StreamingQueryStatus to json
[SPARK-17927] - Remove dead code in WriterContainer
[SPARK-17946] - Python crossJoin API similar to Scala
[SPARK-17964] - Enable SparkR with Mesos client mode
[SPARK-17965] - Enable SparkR with Mesos cluster mode
[SPARK-17970] - Use metastore for managing filesource table partitions as well
[SPARK-17974] - Refactor FileCatalog classes to simplify the inheritance tree
[SPARK-17980] - Fix refreshByPath for converted Hive tables
[SPARK-17983] - Can't filter over mixed case parquet columns of converted Hive tables
[SPARK-17990] - ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition column names
[SPARK-17991] - Enable metastore partition pruning for unconverted hive tables by default
[SPARK-17992] - HiveClient.getPartitionsByFilter throws an exception for some unsupported filters when hive.metastore.try.direct.sql=false
[SPARK-17994] - Add back a file status cache for catalog tables
[SPARK-18012] - Simplify WriterContainer code
[SPARK-18013] - R cross join API similar to python and Scala
[SPARK-18019] - Log instrumentation in GBTs
[SPARK-18021] - Refactor file name specification for data sources
[SPARK-18024] - Introduce an internal commit protocol API along with OutputCommitter implementation
[SPARK-18025] - Port streaming to use the commit protocol API
[SPARK-18026] - should not always lowercase partition columns of partition spec in parser
[SPARK-18042] - OutputWriter needs to return the path of the file written
[SPARK-18060] - Avoid unnecessary standardization in multinomial logistic regression training
[SPARK-18087] - Optimize insert to not require REPAIR TABLE
[SPARK-18101] - ExternalCatalogSuite should test with mixed case fields
[SPARK-18109] - Log instrumentation in GMM
[SPARK-18129] - Sign pip artifacts
[SPARK-18143] - History Server is broken because of the refactoring work in Structured Streaming
[SPARK-18145] - Update documentation for hive partition management in 2.1
[SPARK-18146] - Avoid using Union to chain together create table and repair partition commands
[SPARK-18151] - CLONE - MetadataLog should support purging old logs
[SPARK-18152] - CLONE - FileStreamSource should not track the list of seen files indefinitely
[SPARK-18153] - CLONE - Ability to remove old metadata for structure streaming MetadataLog
[SPARK-18154] - CLONE - Change Source API so that sources do not need to keep unbounded state
[SPARK-18156] - CLONE - StreamExecution should discard unneeded metadata
[SPARK-18164] - ForeachSink should fail the Spark job if `process` throws exception
[SPARK-18173] - data source tables should support truncating partition
[SPARK-18183] - INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition
[SPARK-18184] - INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot handle partitions with custom locations
[SPARK-18185] - Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
[SPARK-18191] - Port RDD API to use commit protocol
[SPARK-18192] - Support all file formats in structured streaming
[SPARK-18217] - Disallow creating permanent views based on temporary views or UDFs
[SPARK-18239] - Gradient Boosted Tree wrapper in SparkR
[SPARK-18244] - Rename partitionProviderIsHive -> tracksPartitionsInCatalog
[SPARK-18260] - from_json can throw a better exception when it can't find the column or be nullSafe
[SPARK-18264] - Build and package R vignettes
[SPARK-18283] - Add a test to make sure the default starting offset is latest
[SPARK-18295] - Match up to_json to from_json in null safety
[SPARK-18302] - correct several partition related behaviours of ExternalCatalog
[SPARK-18317] - ML, Graph 2.1 QA: API: Binary incompatible changes
[SPARK-18318] - ML, Graph 2.1 QA: API: New Scala APIs, docs
[SPARK-18319] - ML, Graph 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit
[SPARK-18320] - ML 2.1 QA: API: Python API coverage
[SPARK-18321] - ML 2.1 QA: API: Java compatibility, docs
[SPARK-18322] - ML, Graph 2.1 QA: Update user guide for new features & APIs
[SPARK-18323] - Update MLlib, GraphX websites for 2.1
[SPARK-18324] - ML, Graph 2.1 QA: Programming guide update and migration guide
[SPARK-18325] - SparkR 2.1 QA: Check for new R APIs requiring example code
[SPARK-18326] - SparkR 2.1 QA: New R APIs and API docs
[SPARK-18332] - SparkR 2.1 QA: Programming guide, migration guide, vignettes updates
[SPARK-18333] - Revert hacks in parquet and orc reader to support case insensitive resolution
[SPARK-18416] - State Store leaks temporary files
[SPARK-18422] - Fix wholeTextFiles test to pass on Windows in JavaAPISuite
[SPARK-18423] - ReceiverTracker should close checkpoint dir when stopped even if it was not started
[SPARK-18440] - Fix FileStreamSink with aggregation + watermark + append mode
[SPARK-18445] - Fix `Note:`/`NOTE:`/`Note that` across Scala/Java API documentation
[SPARK-18447] - Fix `Note:`/`NOTE:`/`Note that` across Python API documentation
[SPARK-18459] - Rename triggerId to batchId in StreamingQueryStatus.triggerDetails
[SPARK-18460] - Include triggerDetails in StreamingQueryStatus.json
[SPARK-18461] - Improve docs on StreamingQueryListener and StreamingQuery.status
[SPARK-18477] - Enable interrupts for HDFS in HDFSMetadataLog
[SPARK-18505] - Simplify AnalyzeColumnCommand
[SPARK-18507] - Major performance regression in SHOW PARTITIONS on partitioned Hive tables
[SPARK-18514] - Fix `Note:`/`NOTE:`/`Note that` across R API documentation
[SPARK-18522] - Create explicit contract for column stats serialization
[SPARK-18544] - Append with df.saveAsTable writes data to wrong location
[SPARK-18545] - Verify number of hive client RPCs in PartitionedTablePerfStatsSuite
[SPARK-18582] - Whitelist LogicalPlan operators allowed in correlated subqueries
[SPARK-18635] - Partition name/values not escaped correctly in some cases
[SPARK-18639] - Build only a single pip package
[SPARK-18652] - Include the example data and third-party licenses in pyspark package
[SPARK-18659] - Incorrect behaviors in overwrite table for datasource tables
[SPARK-18661] - Creating a partitioned datasource table should not scan all files for table
[SPARK-18679] - Regression in file listing performance
[SPARK-18685] - Fix all tests in ExecutorClassLoaderSuite to pass on Windows
[SPARK-18815] - NPE when collecting column stats for string/binary column having only null values

0 0