最新的spark2.1.0 ReleaseNote[Release date: 18/Dec/16]

来源:互联网 发布:皇甫圣华淘宝 编辑:程序博客网 时间:2024/05/24 06:50

Sub-task

  • [SPARK-1267] - Add a pip installer for PySpark
  • [SPARK-10372] - Add end-to-end tests for the scheduling code
  • [SPARK-14300] - Scala MLlib examples code merge and clean up
  • [SPARK-14480] - Remove meaningless StringIteratorReader for CSV data source for better performance
  • [SPARK-15232] - Add subquery SQL building tests to LogicalPlanToSQLSuite
  • [SPARK-15353] - Making peer selection for block replication pluggable
  • [SPARK-15453] - FileSourceScanExec to extract `outputOrdering` information
  • [SPARK-15590] - Paginate Job Table in Jobs tab
  • [SPARK-15591] - Paginate Stage Table in Stages tab
  • [SPARK-15698] - Ability to remove old metadata for structure streaming MetadataLog
  • [SPARK-15780] - Support mapValues on KeyValueGroupedDataset
  • [SPARK-15814] - Aggregator can return null result
  • [SPARK-15926] - Improve readability of DAGScheduler stage creation methods
  • [SPARK-15927] - Eliminate redundant code in DAGScheduler's getParentStages and getAncestorShuffleDependencies methods.
  • [SPARK-16000] - Make model loading backward compatible with saved models using old vector columns
  • [SPARK-16104] - Do not creaate CSV writer object for every flush when writing
  • [SPARK-16137] - Random Forest wrapper in SparkR
  • [SPARK-16282] - Implement percentile SQL function
  • [SPARK-16283] - Implement percentile_approx SQL function
  • [SPARK-16287] - Implement str_to_map SQL function
  • [SPARK-16312] - Docs for Kafka 0.10 consumer integration
  • [SPARK-16318] - xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string, and xpath
  • [SPARK-16380] - Update SQL examples and programming guide for Python language binding
  • [SPARK-16391] - KeyValueGroupedDataset.reduceGroups should support partial aggregation
  • [SPARK-16403] - Example cleanup and fix minor issues
  • [SPARK-16421] - Improve output from ML examples
  • [SPARK-16436] - checkEvaluation should support NaN values
  • [SPARK-16443] - ALS wrapper in SparkR
  • [SPARK-16444] - Isotonic Regression wrapper in SparkR
  • [SPARK-16445] - Multilayer Perceptron Classifier wrapper in SparkR
  • [SPARK-16446] - Gaussian Mixture Model wrapper in SparkR
  • [SPARK-16447] - LDA wrapper in SparkR
  • [SPARK-16508] - Fix documentation warnings found by R CMD check
  • [SPARK-16510] - Move SparkR test JAR into Spark, include its source code
  • [SPARK-16519] - Handle SparkR RDD generics that create warnings in R CMD check
  • [SPARK-16524] - Add RowBatch and RowBasedHashMapGenerator
  • [SPARK-16525] - Enable Row Based HashMap in HashAggregateExec
  • [SPARK-16577] - Add check-cran script to Jenkins
  • [SPARK-16579] - Add a spark install function
  • [SPARK-16581] - Making JVM backend calling functions public
  • [SPARK-16621] - Generate stable SQLs in SQLBuilder
  • [SPARK-16734] - Make sure examples in all language bindings are consistent
  • [SPARK-16735] - Fail to create a map contains decimal type with literals having different inferred precessions and scales
  • [SPARK-16757] - Set up caller context to HDFS and Yarn
  • [SPARK-16774] - Fix use of deprecated TimeStamp constructor (also providing incorrect results)
  • [SPARK-16776] - Fix Kafka deprecation warnings
  • [SPARK-16777] - Parquet schema converter depends on deprecated APIs
  • [SPARK-16778] - Fix use of deprecated SQLContext constructor
  • [SPARK-16779] - Fix unnecessary use of postfix operations
  • [SPARK-16800] - Fix Java Examples that throw exception
  • [SPARK-16814] - Fix deprecated use of ParquetWriter in Parquet test suites
  • [SPARK-16866] - Basic infrastructure for file-based SQL end-to-end tests
  • [SPARK-16904] - Removal of Hive Built-in Hash Functions and TestHiveFunctionRegistry
  • [SPARK-16963] - Change Source API so that sources do not need to keep unbounded state
  • [SPARK-16980] - Load only catalog table partition metadata required to answer a query
  • [SPARK-17007] - Move test data files into a test-data folder
  • [SPARK-17008] - Normalize query results using sorting
  • [SPARK-17009] - Use a new SparkSession for each test case
  • [SPARK-17011] - Support testing exceptions in queries
  • [SPARK-17015] - group-by-ordinal and order-by-ordinal test cases
  • [SPARK-17018] - literals.sql for testing literal parsing
  • [SPARK-17042] - Repl-defined classes cannot be replicated
  • [SPARK-17072] - generate table level stats:stats generation/storing/loading
  • [SPARK-17073] - generate basic stats for column
  • [SPARK-17090] - Make tree aggregation level in linear/logistic regression configurable
  • [SPARK-17096] - Fix StreamingQueryListener to return message and stacktrace of actual exception
  • [SPARK-17138] - Python API for multinomial logistic regression
  • [SPARK-17149] - array.sql for testing array related functions
  • [SPARK-17157] - Add multiclass logistic regression SparkR Wrapper
  • [SPARK-17163] - Merge MLOR into a single LOR interface
  • [SPARK-17165] - FileStreamSource should not track the list of seen files indefinitely
  • [SPARK-17183] - put hive serde table schema to table properties like data source table
  • [SPARK-17188] - Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx
  • [SPARK-17235] - MetadataLog should support purging old logs
  • [SPARK-17269] - Move finish analysis stage into its own file
  • [SPARK-17270] - Move object optimization rules into its own file
  • [SPARK-17272] - Move subquery optimizer rules into its own file
  • [SPARK-17273] - Move expression optimizer rules into a separate file
  • [SPARK-17274] - Move join optimizer rules into a separate file
  • [SPARK-17346] - Kafka 0.10 support in Structured Streaming
  • [SPARK-17372] - Running a file stream on a directory with partitioned subdirs throw NotSerializableException/StackOverflowError
  • [SPARK-17475] - HDFSMetadataLog should not leak CRC files
  • [SPARK-17513] - StreamExecution should discard unneeded metadata
  • [SPARK-17586] - Use Static member not via instance reference
  • [SPARK-17699] - from_json function for parsing json Strings into Structs
  • [SPARK-17731] - Metrics for Structured Streaming
  • [SPARK-17764] - to_json function for parsing Structs to json Strings
  • [SPARK-17790] - Support for parallelizing R data.frame larger than 2GB
  • [SPARK-17800] - Introduce InterfaceStability annotation definition
  • [SPARK-17812] - More granular control of starting offsets (assign)
  • [SPARK-17813] - Maximum data per trigger
  • [SPARK-17830] - Annotate Spark SQL public APIs with InterfaceStability
  • [SPARK-17834] - Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer
  • [SPARK-17864] - Mark data type APIs as stable, rather than DeveloperApi
  • [SPARK-17900] - Mark the following Spark SQL APIs as stable
  • [SPARK-17925] - Break fileSourceInterfaces.scala into multiple pieces
  • [SPARK-17926] - Add methods to convert StreamingQueryStatus to json
  • [SPARK-17927] - Remove dead code in WriterContainer
  • [SPARK-17946] - Python crossJoin API similar to Scala
  • [SPARK-17964] - Enable SparkR with Mesos client mode
  • [SPARK-17965] - Enable SparkR with Mesos cluster mode
  • [SPARK-17970] - Use metastore for managing filesource table partitions as well
  • [SPARK-17974] - Refactor FileCatalog classes to simplify the inheritance tree
  • [SPARK-17980] - Fix refreshByPath for converted Hive tables
  • [SPARK-17983] - Can't filter over mixed case parquet columns of converted Hive tables
  • [SPARK-17990] - ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition column names
  • [SPARK-17991] - Enable metastore partition pruning for unconverted hive tables by default
  • [SPARK-17992] - HiveClient.getPartitionsByFilter throws an exception for some unsupported filters when hive.metastore.try.direct.sql=false
  • [SPARK-17994] - Add back a file status cache for catalog tables
  • [SPARK-18012] - Simplify WriterContainer code
  • [SPARK-18013] - R cross join API similar to python and Scala
  • [SPARK-18019] - Log instrumentation in GBTs
  • [SPARK-18021] - Refactor file name specification for data sources
  • [SPARK-18024] - Introduce an internal commit protocol API along with OutputCommitter implementation
  • [SPARK-18025] - Port streaming to use the commit protocol API
  • [SPARK-18026] - should not always lowercase partition columns of partition spec in parser
  • [SPARK-18042] - OutputWriter needs to return the path of the file written
  • [SPARK-18060] - Avoid unnecessary standardization in multinomial logistic regression training
  • [SPARK-18087] - Optimize insert to not require REPAIR TABLE
  • [SPARK-18101] - ExternalCatalogSuite should test with mixed case fields
  • [SPARK-18109] - Log instrumentation in GMM
  • [SPARK-18129] - Sign pip artifacts
  • [SPARK-18143] - History Server is broken because of the refactoring work in Structured Streaming
  • [SPARK-18145] - Update documentation for hive partition management in 2.1
  • [SPARK-18146] - Avoid using Union to chain together create table and repair partition commands
  • [SPARK-18151] - CLONE - MetadataLog should support purging old logs
  • [SPARK-18152] - CLONE - FileStreamSource should not track the list of seen files indefinitely
  • [SPARK-18153] - CLONE - Ability to remove old metadata for structure streaming MetadataLog
  • [SPARK-18154] - CLONE - Change Source API so that sources do not need to keep unbounded state
  • [SPARK-18156] - CLONE - StreamExecution should discard unneeded metadata
  • [SPARK-18164] - ForeachSink should fail the Spark job if `process` throws exception
  • [SPARK-18173] - data source tables should support truncating partition
  • [SPARK-18183] - INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition
  • [SPARK-18184] - INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot handle partitions with custom locations
  • [SPARK-18185] - Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
  • [SPARK-18191] - Port RDD API to use commit protocol
  • [SPARK-18192] - Support all file formats in structured streaming
  • [SPARK-18217] - Disallow creating permanent views based on temporary views or UDFs
  • [SPARK-18239] - Gradient Boosted Tree wrapper in SparkR
  • [SPARK-18244] - Rename partitionProviderIsHive -> tracksPartitionsInCatalog
  • [SPARK-18260] - from_json can throw a better exception when it can't find the column or be nullSafe
  • [SPARK-18264] - Build and package R vignettes
  • [SPARK-18283] - Add a test to make sure the default starting offset is latest
  • [SPARK-18295] - Match up to_json to from_json in null safety
  • [SPARK-18302] - correct several partition related behaviours of ExternalCatalog
  • [SPARK-18317] - ML, Graph 2.1 QA: API: Binary incompatible changes
  • [SPARK-18318] - ML, Graph 2.1 QA: API: New Scala APIs, docs
  • [SPARK-18319] - ML, Graph 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit
  • [SPARK-18320] - ML 2.1 QA: API: Python API coverage
  • [SPARK-18321] - ML 2.1 QA: API: Java compatibility, docs
  • [SPARK-18322] - ML, Graph 2.1 QA: Update user guide for new features & APIs
  • [SPARK-18323] - Update MLlib, GraphX websites for 2.1
  • [SPARK-18324] - ML, Graph 2.1 QA: Programming guide update and migration guide
  • [SPARK-18325] - SparkR 2.1 QA: Check for new R APIs requiring example code
  • [SPARK-18326] - SparkR 2.1 QA: New R APIs and API docs
  • [SPARK-18332] - SparkR 2.1 QA: Programming guide, migration guide, vignettes updates
  • [SPARK-18333] - Revert hacks in parquet and orc reader to support case insensitive resolution
  • [SPARK-18416] - State Store leaks temporary files
  • [SPARK-18422] - Fix wholeTextFiles test to pass on Windows in JavaAPISuite
  • [SPARK-18423] - ReceiverTracker should close checkpoint dir when stopped even if it was not started
  • [SPARK-18440] - Fix FileStreamSink with aggregation + watermark + append mode
  • [SPARK-18445] - Fix `Note:`/`NOTE:`/`Note that` across Scala/Java API documentation
  • [SPARK-18447] - Fix `Note:`/`NOTE:`/`Note that` across Python API documentation
  • [SPARK-18459] - Rename triggerId to batchId in StreamingQueryStatus.triggerDetails
  • [SPARK-18460] - Include triggerDetails in StreamingQueryStatus.json
  • [SPARK-18461] - Improve docs on StreamingQueryListener and StreamingQuery.status
  • [SPARK-18477] - Enable interrupts for HDFS in HDFSMetadataLog
  • [SPARK-18505] - Simplify AnalyzeColumnCommand
  • [SPARK-18507] - Major performance regression in SHOW PARTITIONS on partitioned Hive tables
  • [SPARK-18514] - Fix `Note:`/`NOTE:`/`Note that` across R API documentation
  • [SPARK-18522] - Create explicit contract for column stats serialization
  • [SPARK-18544] - Append with df.saveAsTable writes data to wrong location
  • [SPARK-18545] - Verify number of hive client RPCs in PartitionedTablePerfStatsSuite
  • [SPARK-18582] - Whitelist LogicalPlan operators allowed in correlated subqueries
  • [SPARK-18635] - Partition name/values not escaped correctly in some cases
  • [SPARK-18639] - Build only a single pip package
  • [SPARK-18652] - Include the example data and third-party licenses in pyspark package
  • [SPARK-18659] - Incorrect behaviors in overwrite table for datasource tables
  • [SPARK-18661] - Creating a partitioned datasource table should not scan all files for table
  • [SPARK-18679] - Regression in file listing performance
  • [SPARK-18685] - Fix all tests in ExecutorClassLoaderSuite to pass on Windows
  • [SPARK-18815] - NPE when collecting column stats for string/binary column having only null values
0 0
原创粉丝点击