ES官方文档整理-4.field data 内存控制

来源:互联网 发布:淘宝怎么提高排名信誉 编辑:程序博客网 时间:2024/05/04 07:04

官网说明:

Once analyzed strings have been loaded into fielddata, they will sit there until evicted (or your node crashes). For that reason it is important to keep an eye on this memory usage, understand how and when it loads, and how you can limit the impact on your cluster.

当被分词的field数据被加载,会一直贮存在内存,直到被驱逐或节点崩溃。

Fielddata is loaded lazily. If you never aggregate on an analyzed string, you’ll never load fielddata into memory. Furthermore, fielddata is loaded on a per-field basis, meaning only actively used fields will incur the “fielddata tax”.

FiledData是懒加载的,只有当你使用被分词字段查询或聚合时才会被载入内存。fileddata是基于field的,意味着只有常用的fields才会引起 “field tax”。

However, there is a subtle surprise lurking here. Suppose your query is highly selective and only returns 100 hits. Most people assume fielddata is only loaded for those 100 documents.

然而,这里有一个美丽的误会,如果你的查询经过精心挑选,只返回100个hits,许多人认为fielddata只载入了这100个文档。

In reality, fielddata will be loaded for all documents in that index (for that particular field), regardless of the query’s specificity. The logic is: if you need access to documents X, Y, and Z for this query, you will probably need access to other documents in the next query.

事实上,fielddata把所有文档的查询特定字段都加载进了内存。这里面的逻辑是:如果你这次查询需要X,Y和Z,你下次查询可能会需要其他文档。

Unlike doc values, the fielddata structure is not created at index time. Instead, it is populated on-the-fly when the query is run. This is a potentially non-trivial operation and can take some time. It is cheaper to load all the values once, and keep them in memory, than load only a portion of the total fielddata repeatedly.

与doc values不同,fielddata结构不是在索引时创建的,是在查询时构建的。这是一个潜在的非法操作,需要花一点时间。相对于加载一部分,加载全部要更加便利。

The JVM heap is a limited resource that should be used wisely. A number of mechanisms exist to limit the impact of fielddata on heap usage. These limits are important because abuse of the heap will cause node instability (thanks to slow garbage collections) or even node death (with an OutOfMemory exception).

JVM的堆是限制资源,应该被有效利用。许多机器存在对fielddata使用堆得限制。这些限制很重要,可以避免堆滥用导致节点不稳定(源于垃圾收集器的缓慢回收)或节点死亡(因为outOfMeory异常)。

Choosing a Heap Size

There are two rules to apply when setting the Elasticsearch heap size, with the $ES_HEAP_SIZE environment variable:

No more than 50% of available RAM*
Lucene makes good use of the filesystem caches, which are managed by the kernel. Without enough filesystem cache space, performance will suffer. Furthermore, the more memory dedicated to the heap means less available for all your other fields using doc values.

专用内存不要超过50%,设得越大doc values可用的就越小。

No more than 32 GB
If the heap is less than 32 GB, the JVM can use compressed pointers, which saves a lot of memory: 4 bytes per pointer instead of 8 bytes.

The indices.fielddata.cache.size controls how much heap space is allocated to fielddata. As you are issuing queries, aggregations on analyzed strings will load into fielddata if the field wasn’t previously loaded. If the resulting fielddata size would exceed the specified size, other values will be evicted in order to make space.

indices.fielddata.cache.size配置field缓存大小,当你查询聚合的分词字段没有加载过fielddata,将会加载fielddata,如果内存不够,会把其他值驱逐出内存。

By default, this setting is unbounded—Elasticsearch will never evict data from fielddata.

默认的,这个设置是禁止的,Es不会从fielddata淘汰数据。

This default was chosen deliberately: fielddata is not a transient cache. It is an in-memory data structure that must be accessible for fast execution, and it is expensive to build. If you have to reload data for every request, performance is going to be awful.

这是个刻意的默认选择:fielddata不是一个瞬时缓存,是为了快速执行的内存数据结构,构建很耗费资源。如果你每次请求都需要重新加载,会非常糟糕。

Monitoring fielddataedit

It is important to keep a close watch on how much memory is being used by fielddata, and whether any data is being evicted. High eviction counts can indicate a serious resource issue and a reason for poor performance.

Fielddata usage can be monitored:

  • per-index using the indices-stats API:
    GET /_stats/fielddata?fields=*
    GET /_nodes/stats/indices/fielddata?fields=*

  • By setting ?fields=*, the memory usage is broken down for each field.
    Node Stats
    curl -XGET ‘http://localhost:9200/_nodes/stats/indices/?fields=field1,field2&pretty’

  • Indices Stat
    curl -XGET ‘http://localhost:9200/_stats/fielddata/?fields=field1,field2&pretty’

  • You can use wildcards for field names
    curl -XGET ‘http://localhost:9200/_stats/fielddata/?fields=field*&pretty’
    curl -XGET ‘http://localhost:9200/_nodes/stats/indices/?fields=field*&pretty’

原创粉丝点击