Carbon Data 字典编码
来源:互联网 发布:怎样在淘宝联盟买东西 编辑:程序博客网 时间:2024/05/18 16:37
INTRODUCTION 介绍
Encoded data for reduced storage space and faster processing.
编码数据可以减少存储空间,加快处理的速度。
DESCRIPTION 描述
Most databases and big data SQL data stores employ columnar encoding to achieve data compression by storing small integer numbers (surrogate values) instead of full string values. However, almost all existing databases and data stores divide the data into row groups containing anywhere from few thousand to a million rows and employ dictionary encoding only within each row group. Hence, the same column value can have different surrogate values in different row groups. So, while reading the data, conversion from surrogate value to actual value needs to be done immediately after the data is read from the disk. But CarbonData employs global surrogate key which means that a common dictionary is maintained for the full store on one machine/node. So CarbonData can perform all the query processing work such as grouping/aggregation, sorting etc on light weight surrogate values. The conversion from surrogate to actual values needs to be done only on the final result. This procedure improves performance on two aspects. Conversion from surrogate values to actual values is done only for the final result rows which are much less than the actual rows read from the store. All query processing and computation such as grouping/aggregation, sorting, and so on is done on lightweight surrogate values which requires less memory and CPU time compared to actual values.
大多数的数据库和大数据存储都是采用存储一个小的整数(integer numbers(替代值))去替代一个完整的字符串值,来达到数据的压缩。然而,几乎所有现有的数据库和数据存储将数据分成包含从几千到一百万行的行组,并且仅在每个行组内使用字典编码。因此,相同的列值可能在不同的行组中具有不同的替代值。所以,在读取数据时,在从磁盘读取数据后,需要立即从代理值到实际值的转换,减慢了速度。 但是CarbonData使用全局替代值,在一个机器/节点上完整存储并维护了一个通用字典。 所以CarbonData可以在执行所有的查询处理工作时,如分组/聚合,排序等上使用轻量级替代值。 从替代值到实际值的转换只需要在最终结果上完成。 该过程在两个方面提高了性能。从替代值到真实值得转化只是在最终的结果行上完成比直接从存储中读取真实值要快很多。所有查询处理和计算,如分组/聚合,排序等都是在轻量级替代值上完成的,与实际值相比,花费更少的内存和CPU时间。
ENCODING TECHNIQUE 编码技术
Original Data 原始数据
图一
Dictionary Generation 字典生成
All the Multi Dimensional Keys(MDK)* are compressed to some lightweight(surrogate) values, which results in less memory usage. This encoding is used to achieve data compression by storing small integer numbers (surrogate values) instead of full string values. All nulls have a default value 0, Others are assigned values accordingly.
所有的Multi Dimensional Keys(MDK) 被压缩为一些轻量级的替代值,结果会减少内存的用。这种编码方式是通过存储更小的数值(替代值)来代替完整的字符串值来达到压缩数据的目的。所有的null值有一个默认的值0,其他的值也会被相应的分配替代值。
图二
All query processing and computation such as grouping/aggregation, sorting, and so on is done on lightweight surrogate values which requires less memory and CPU time compared to actual values.
所有查询处理和计算,如分组/聚合,排序等都是在轻量级替代值上完成的,与实际值相比,需要更少的内存和CPU时间。
Dictionary Encoding 字典编码
After generating the dictionary(the surrogate values for column values), the table data is updated accordingly with the new surrogate values.
在字典生成之后(每个列值都生成了相应的替代值),表数据也会相应的被新的替代值更新。
图三
Sorting(on MDK : Multi Dimensional Keys) : 依据Multi Deimensional Keys排序
The multi dimensional keys are then sorted, and table data is arranged accordingly.
MDK被排序了之后,表数据也相应的变化。
图四
Blocklet Logical View Blocklet的逻辑视图
图五
Conversion from surrogate values to actual values is done only for the final result rows which are much less than the actual rows read from the store.
从替代值到真实值得转化只是在最终的结果行上完成比直接从存储中读取真实值要快很多。
*(MDK)Multi Dimensional Keys are the columns which represent dimensions(the keys to analyse data) of the table(ex: Location, Months etc)
(MDK)Multi Dimesional Keys 表示了表格的这些维度(分析数据的关键点)的列(例如:位置,月份等)。
原文地址
- Dictionary Encoding - CarbonData - Apache Software Foundation
- Carbon Data 字典编码
- Carbon Data 文件结构
- Carbon Data 多级索引
- 字典编码
- Laravel : InvalidArgumentException in Carbon.php line 425: Data missing
- 数据字典(Data Dictionary)
- 数据字典(Data Dictionary)
- Huffman编码字典构造
- Oracle 数据字典(Data Dictionary)
- 编码解码 data compression
- iOS 编码 URL Data
- 字典编码(C++实现)
- 字典学习和稀疏编码
- [转]字典学习/稀疏编码
- 五笔字典编码-------------->_<
- Data Structure: Trie(字典树)
- UVA 1262编码(第k字典序)
- 字典和Base64编码的转换
- Unity图片优化神器
- 广播实时监听网络状态
- Java 集合源码解析(2):ListIterator
- app declares a dependency from configuration 'compile' to configuration 'default' which is not decla
- Spring Boot 集成 resteasy篇 — jax-rs初步介绍和spring boot集成
- Carbon Data 字典编码
- Flip Game POJ
- ??运算符
- Java中的Filter过滤器
- iOS ATS测试跳过无效证书
- jedis总结
- maven生成可以直接运行的jar(包括依赖和mainclass)
- 静态注册 ACTION_SCREEN_ON & ACTION_SCREEN_OFF 收不到
- springmvc参数绑定