ES缓存fielddata（avoid）、doc values详解(持续更新)

来源：互联网发布：软件安全在线检测编辑：程序博客网时间：2024/05/17 07:56

Fielddata（重点）

产生条件：sort、aggs、script
缺点
- java heap

造成集群instability的主因之一，the most frequent cause of the highest severity issues.

理解Fielddata

inverted index
使得 ES query fast。
This data structure holds a sorted list of all the unique terms that appear in a field, and each term points to the list of documents that contain that term:

反向索引解决的：是contain，而不是equal关系，是哪些文档包含特定term的query。

sort、aggs、script 的无奈
需回答的问题是：在doc1的xx字段包含了哪些term?

需要的数据结构：

Docs:   Terms:----------------------------1       [ brown ]2       [ brown, fox, quick ]

This is the purpose of fielddata.

generated at query time ，reading the inverted index, inverting the term <-> docs data structure, and storing the results（doc <-> terms） in memory.
- Loading slow，especially with big segments. 代价昂贵，小撸提性能，大撸伤！
- consume a lot heap space。
loaded on demand、per segment。
Eviction happens：
- 删除、关闭相关索引
- segment removed(by merge) ,moving rather than going away（待确认）.
- 节点重启
- clear relevant fielddata cache.
- auto evict for other fielddata（默认不会，因无界）
  Evicting fielddata when the cache is full, leads to different issues: one request triggers fielddata loading for one field and the next request triggers loading for another, causing the first field to be evicted. This causes memory thrashing（内存抖动） and slow garbage collections(缓慢的GC), so suffer from very slow queries while they wait for their fielddata to be loaded.

Fielddata does not go away on its own. In Elasticsearch 1.3 and later, allow up to 60% of your Java heap’s memory to be consumed by fielddata per node，prior to Elasticsearch 1.3 unlimited。

We control this via the Fielddata Circuit Breaker, which checks incoming requests for potential fielddata usage and then blocks them if they require more memory than is currently available.

Any circuit breaker’s purpose is to prevent(rejected) any bad requests, which means that it never gets the chance to cause a problem (e.g., allocate even more fielddata), but it’s important to note that it will not clear any existing fielddata.

监控

list of each node with its fielddata usage.

# curl 172.28.141.11:9240/_cat/fielddata?vnode                   total   catId note quantity jWareId node-87.14                0b      0b   0b       0b      0b dm_172.28.141.11:9240 65.5mb   1.2mb   0b    1.2mb  13.8mb node-gw-87.11             0b      0b   0b       0b      0b node-87.12                0b      0b   0b       0b      0b dm_172.28.141.15:9240 80.6mb 993.7kb   0b 1011.8kb    17mb dm_172.28.141.13:9240 68.8mb   1.3mb   0b    1.3mb  14.7mb d_172.20.71.30:9240   75.6mb 859.9kb   0b  923.5kb    16mb gw_172.28.159.29:9203     0b      0b   0b       0b      0b node-87.13                0b      0b   0b       0b      0b gw_172.28.159.12:9203     0b      0b   0b       0b      0b node-gw-87.12             0b      0b   0b       0b      0b 省略id、host、ip、node，及其它字段fielddata。?fields=catId,note 查询特定字段的

参考
Support in the Wild: My Biggest Elasticsearch Problem at Scale (Chris Earle)

Field Data: The Most Common Cause of Elasticsearch Cluster Instability at Scale

fielddata –>doc values

参考guid https://www.elastic.co/guide/en/elasticsearch/guide/2.x/docvalues.html
Without repeating too much from the guide, doc values offload this burden by writing the fielddata to disk at index time, thereby allowing Elasticsearch to load the values outside of your Java heap as they are needed.

Through the file system cache（Linux）, which gives in-memory performance without the cost of garbage collections.

怎么配置doc values
v2.x默认配置，手动配置

"doc_values" : true   //切勿在analyzed string类型字段设置true

doc values缺点
Cannot be used with analyzed strings.
For regular, unstructured search, you will not use any fielddata.

With that in mind, the only time that you should catch yourself using fielddata for analyzed strings is with the significant terms aggregation. All other uses of fielddata should be avoided by using a not_analyzed version of the string.

take advantage of both analyzed and not analyzed strings by using multifields,

0 0