Apache CarbonData
来源:互联网 发布:淘宝纠纷报警有用吗 编辑:程序博客网 时间:2024/06/05 23:51
Abstract
Apache CarbonData is a new Apache Hadoop native file format for faster interactivecolumnar storage, index, compression and encoding techniquesquery using advancedto improve computing efficiency, in turn it will help speedup queries an order ofmagnitude faster over PetaBytes of data.https://github.com/HuaweiBigData/carbondata
Background
- Support interactive OLAP-style query over big data in seconds.
- Support fast query on individual record which require touching all fields.
- Fast data loading speed and support incremental load in period of minutes.
- Support HDFS so that customer can leverage existing Hadoop cluster.
- Support time based data retention.
Rationale
CarbonData contains multiple modules, which are classified into two categories:
CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc.
CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime
Feature
Indexing
1. Multi-dimensional Key (B+ Tree index)
- The Data block are written in sequence to the disk and within each data blocks each column block is written in sequence. Finally, the metadata block for the file is written with information about byte positions of each block in the file, Min-Max statistics index and the start and end MDK of each data block. Since, the entire data in the file is in sorted order, the start and end MDK of each data block can be used to construct a B+Tree and the file can be logically represented as a B+Tree with the data blocks as leaf nodes (on disk) and the remaining non-leaf nodes in memory.
2. Inverted index
- Inverted index is widely used in search engine. By using this index, it helps processing/query engine to do filtering inside one HDFS block. Furthermore, query acceleration for count distinct like operation is made possible when combining bitmap and inverted index in query time.
3. MinMax index
For all columns, minmax index is created so that processing/query engine can skip scan that is not required.Global Dictionary
Besides I/O reduction, CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observed dramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query engine returning results to user.
Column Group
Sometimes users want to perform processing/query on multi-columns in one table, for example, performing scan for individual record in troubleshooting scenario. In this case, row format is more efficient than columnar format since all columns will be touched by the workload. To accelerate this, CarbonData supports storing a group of column in row format, so data in column group is stored together and enable fast retrieval.
Optimized for multiple use cases
CarbonData indices and dictionary is highly configurable. To make storage optimized for different use cases, user can configure what to index, so user can decide and tune the format before loading data intoCarbonData.For example
Use Case
Supporting Features
Interactive OLAP query
Columnar format, Multi-dimensional Key (B+ Tree index), Minmax index, Inverted index
High throughput scan
Global dictionary, Minmax index
Low latency point query
Multi-dimensional Key (B+ Tree index), Partitioning
Individual record query
Column group, Global dictionary
igData Processing Framework Integration
CarbonData providesInputFormat/OutputFormat interfaces for Reading/Writing data from the CarbonData files and at the same time provides abstract API for processing data stored as Carbondata format with data processing framework.
CarbonData provides deep integration with Apache Spark including predicate push down, column pruning, aggregation push down etc. So users can use Spark SQL to connect and query from CarbonData.
CarbonData can integrate with various big data Query/Processing framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
Example: https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
- Apache CarbonData
- Apache CarbonData快速入门指南
- carbondata 介绍
- Apache CarbonData(from华为) :一种为更加快速数据分析而生的新Hadoop文件版式
- Holodesk VS CarbonData
- Carbondata源码阅读(1)
- carbondata 安装文档
- CarbonData源码阅读(3)
- CarbonData 使用性能测试
- spark+carbondata使用
- cloudera cdh编译carbondata
- CarbonData使用示例(java)
- CarbonData初体验与性能测试
- CarbonData源码阅读(2)-Hadoop InputFormat
- cdh上使用spark-thriftserver操作carbondata
- CarbonData编译、安装和集成Spark 2.2
- Carbondata源码系列(一)文件生成过程
- 深度访谈:华为开源数据格式CarbonData项目,实现大数据即席查询秒级响应
- 导航栏、状态栏字体颜色大小和背景颜色
- 查看所有的jar是否打包到apk里
- 构造函数和析构函数中抛出异常
- 学习KVC和KVO
- H5+App开发框架汇总
- Apache CarbonData
- Session的应用
- JAVA 中线程队列BlockingQueue的使用
- spring boot使用mongo:code1
- mac 下终端访问文件出现“Permission Denied”解决方案
- SpringMvc登录检查
- Hbase API中常用类介绍和使用
- HTML常用代码的搜集
- c++中用new和不用new创建对象的本质区别