Elasticsearch 两个可以用于压缩数据尺寸的特性

来源：互联网发布：java中类和对象的区别编辑：程序博客网时间：2024/05/21 16:39

Doc Values

Doc Values 会压缩存储重复的内容。
给定这样一个简单的 mapping

mappings = {    'testdata': {        '_source': {'enabled': False},        '_all': {'enabled': False},        'properties': {            'name': {                'type': 'string',                'index': 'no',                'store': False,                'dynamic': 'strict',                'fielddata': {'format': 'doc_values'}            }        }    }}

插入100万行随机的重复值

words = ['hello', 'world', 'there', 'here']def read_test_data_in_batches():    batch = []    for i in range(10000 * 100):        if i % 50000 == 0:            print(i)        if len(batch) > 10000:            yield batch            batch = []        batch.append({            '_index': 'wentao-test-doc-values',            '_type': 'testdata',            '_source': {'name': random.choice(words)}        })    print(i)    yield batch

磁盘占用是

size: 28.5Mi (28.5Mi)docs: 1,000,000 (1,000,000)

把每个word搞长一些，同样是插入100万行

words = ['hello' * 100, 'world' * 100, 'there' * 100, 'here' * 100]def read_test_data_in_batches():    batch = []    for i in range(10000 * 100):        if i % 50000 == 0:            print(i)        if len(batch) > 10000:            yield batch            batch = []        batch.append({            '_index': 'wentao-test-doc-values',            '_type': 'testdata',            '_source': {'name': random.choice(words)}        })    print(i)    yield batch

磁盘占用不声反降

size: 14.4Mi (14.4Mi)docs: 1,000,000 (1,000,000)

这说明了lucene在底层用列式存储这些字符串的时候是做了压缩的。这个要是在某个商业列式数据库里，就这么点优化都是要大书特书的dictionary encoding优化云云。

Nested Document

实验表明把一堆小文档打包成一个大文档的nested document可以压缩存储空间。把前面的mapping改成这样：

mappings = {    'testdata': {        '_source': {'enabled': False},        '_all': {'enabled': False},        'properties': {            'children': {                'type': 'nested',                'properties': {                    'name': {                        'type': 'string',                        'index': 'no',                        'store': False,                        'dynamic': 'strict',                        'fielddata': {'format': 'doc_values'}                    }                }            }        }    }}

还是插入100万行，但是每一千行打包成一个大文档

words = ['hello', 'world', 'there', 'here']def read_test_data_in_batches():    batch = []    for i in range(10000 * 100):        if i % 50000 == 0:            print(i)        if len(batch) > 1000:            yield [{                '_index': 'wentao-test-doc-values2',                '_type': 'testdata',                '_source': {'children': batch}            }]            batch = []        batch.append({'name': random.choice(words)})    print(i)    yield [{        '_index': 'wentao-test-doc-values2',        '_type': 'testdata',        '_source': {'children': batch}    }]

磁盘占用是

size: 2.47Mi (2.47Mi)docs: 1,001,000 (1,001,000)

文档数没有变小，但是磁盘空间仅仅占用了2.47M。这个应该受益于lucene内部对于嵌套文档的存储优化。

0 0