Apache CarbonData

来源：互联网发布：淘宝纠纷报警有用吗编辑：程序博客网时间：2024/06/05 23:51

Abstract

Apache CarbonData is a new Apache Hadoop native file format for faster interactivecolumnar storage, index, compression and encoding techniquesquery using advancedto improve computing efficiency, in turn it will help speedup queries an order ofmagnitude faster over PetaBytes of data.

https://github.com/HuaweiBigData/carbondata

Background

Support interactive OLAP-style query over big data in seconds.
Support fast query on individual record which require touching all fields.
Fast data loading speed and support incremental load in period of minutes.
Support HDFS so that customer can leverage existing Hadoop cluster.
Support time based data retention.

Rationale

CarbonData contains multiple modules, which are classified into two categories:

CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc.
CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime

Feature

Indexing

1. Multi-dimensional Key (B+ Tree index)

The Data block are written in sequence to the disk and within each data blocks each column block is written in sequence. Finally, the metadata block for the file is written with information about byte positions of each block in the file, Min-Max statistics index and the start and end MDK of each data block. Since, the entire data in the file is in sorted order, the start and end MDK of each data block can be used to construct a B+Tree and the file can be logically represented as a B+Tree with the data blocks as leaf nodes (on disk) and the remaining non-leaf nodes in memory.

2. Inverted index

Inverted index is widely used in search engine. By using this index, it helps processing/query engine to do filtering inside one HDFS block. Furthermore, query acceleration for count distinct like operation is made possible when combining bitmap and inverted index in query time.

3. MinMax index

For all columns, minmax index is created so that processing/query engine can skip scan that is not required.

Global Dictionary

Besides I/O reduction, CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observed dramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query engine returning results to user.

Column Group

Sometimes users want to perform processing/query on multi-columns in one table, for example, performing scan for individual record in troubleshooting scenario. In this case, row format is more efficient than columnar format since all columns will be touched by the workload. To accelerate this, CarbonData supports storing a group of column in row format, so data in column group is stored together and enable fast retrieval.

Optimized for multiple use cases

CarbonData indices and dictionary is highly configurable. To make storage optimized for different use cases, user can configure what to index, so user can decide and tune the format before loading data intoCarbonData.

For example

Use Case

Supporting Features

Interactive OLAP query

Columnar format, Multi-dimensional Key (B+ Tree index), Minmax index, Inverted index

High throughput scan

Global dictionary, Minmax index

Low latency point query

Multi-dimensional Key (B+ Tree index), Partitioning

Individual record query

Column group, Global dictionary

igData Processing Framework Integration

CarbonData providesInputFormat/OutputFormat interfaces for Reading/Writing data from the CarbonData files and at the same time provides abstract API for processing data stored as Carbondata format with data processing framework.
CarbonData provides deep integration with Apache Spark including predicate push down, column pruning, aggregation push down etc. So users can use Spark SQL to connect and query from CarbonData.
CarbonData can integrate with various big data Query/Processing framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.

Example: https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala

0 0