Elasticsearch使用TTL导致OOM问题分析解决

来源:互联网 发布:mac os 国外软件推荐 编辑:程序博客网 时间:2024/06/01 08:01

简介:

Elasticsearch 由于使用TTL,在文档量很大的时候,如果同时有大量文档过期,可能会导致集群节点OOM。本文记录这一现象,以及问题分析,处理步骤。


1.现象


今天我们的ES出现了OOM。ES版本是2.1.2,日志如下:

[2017-06-21 11:10:12,250][WARN ][monitor.jvm              ] [dm_172.23.41.93:20002] [gc][young][535][56] duration [1.4s], collections [1]/[2.3s], total [1.4s]/[10.6s], memory [9.5gb]->[8.5gb]/[15.8gb], all_pools {[young] [1.3gb]->[26.5mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [8gb]->[8.3gb]/[14.1gb]}[2017-06-21 11:14:06,057][WARN ][monitor.jvm              ] [dm_172.23.41.93:20002] [gc][young][766][83] duration [2.1s], collections [1]/[3.1s], total [2.1s]/[15.7s], memory [12.6gb]->[11.7gb]/[15.8gb], all_pools {[young] [1.3gb]->[30.1mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [11.1gb]->[11.5gb]/[14.1gb]}[2017-06-21 11:18:20,668][WARN ][monitor.jvm              ] [dm_172.23.41.93:20002] [gc][old][985][17] duration [35.9s], collections [1]/[36.2s], total [35.9s]/[38.1s], memory [15.6gb]->[14.2gb]/[15.8gb], all_pools {[young] [1.4gb]->[96.6mb]/[1.4gb]}{[survivor] [148mb]->[0b]/[191.3mb]}{[old] [14gb]->[14.1gb]/[14.1gb]}

内存增长很快,16G的内存,启动后几分钟就满了,而且回收不了。重启还是一样,很快就OOM。多次反复重启,反复OOM,看来集群恢复不了了。


查看索引,有一个1,283,152,933 文档,主分片总大小996.1GB的索引。索引mapping如下:

{  "cluster_name": "es-302",  "metadata": {    "indices": {      "msg_center_log": {        "settings": {          "index": {            "number_of_shards": "5",            "creation_date": "1490263243013",            "analysis": {              "char_filter": {                "extend_to_space": {                  "mappings": [                    "\"extend1\":\"=>\\b",                    "\",\"venderId\":\"=>\\b",                    "\",\"extend2\":\"=>\\b",                    "\",\"type\"=>\\b"                  ],                  "type": "mapping"                }              },              "analyzer": {                "my_analyzer": {                  "filter": "lowercase",                  "char_filter": [                    "extend_to_space",                    "html_strip"                  ],                  "type": "custom",                  "tokenizer": "standard"                }              }            },            "number_of_replicas": "1",            "uuid": "ShmjnFHVS0W25E0EphFznA",            "version": {              "created": "2010399"            }          }        },        "mappings": {          "message": {            "_routing": {              "required": true            },            "_ttl": {              "default": 172800000,              "enabled": true            },            "_timestamp": {              "enabled": true            },            "_all": {              "enabled": true            },            "properties": {              "msgType": {                "index": "not_analyzed",                "store": true,                "type": "string"              },              "source": {                "index": "not_analyzed",                "store": true,                "type": "string"              },              "deviceId": {                "index": "not_analyzed",                "store": true,                "type": "string"              },              "msgStatus": {                "index": "not_analyzed",                "store": true,                "type": "string"              },              "content": {                "analyzer": "my_analyzer",                "store": true,                "type": "string"              },              "pushType": {                "index": "not_analyzed",                "store": true,                "type": "string"              },              "pin": {                "index": "not_analyzed",                "store": true,                "type": "string"              },              "extend2": {                "index": "not_analyzed",                "store": true,                "type": "string"              },              "msgNode": {                "index": "not_analyzed",                "store": true,                "type": "string"              },              "msgTime": {                "index": "not_analyzed",                "store": true,                "type": "string"              },              "extend1": {                "index": "not_analyzed",                "store": true,                "type": "string"              },              "serviceNoId": {                "index": "not_analyzed",                "store": true,                "type": "string"              },              "jmMsgType": {                "index": "not_analyzed",                "store": true,                "type": "string"              },              "msgUniqueId": {                "index": "not_analyzed",                "store": true,                "type": "string"              },              "timestamp": {                "index": "not_analyzed",                "store": true,                "type": "string"              }            }          }        },        "aliases": [],        "state": "open"      }    },    "cluster_uuid": "_na_",    "templates": {}  }}

可以看到该索引使用了TTL。


2.分析内存

通过jmap获取堆转储快照,9326 是ES进程的pid,命令如下:
jmap -dump:live,format=b,file=/export/dump/dump.hd 9326

在Linux上安装mat,进入mat目录,使用mat工具分析dump文件,命令如下:

./ParseHeapDump.sh /export/dump/dump.hd./ParseHeapDump.sh /export/dump/dump.hd org.eclipse.mat.api:suspects./ParseHeapDump.sh /export/dump/dump.hd org.eclipse.mat.api:overview./ParseHeapDump.sh /export/dump/dump.hd org.eclipse.mat.api:top_components

生成三个报告:

dump_Top_Components.zip
dump_System_Overview.zip
dump_Leak_Suspects.zip


将这三个文件下载下来,解压,浏览可以看到:









可以看到 org.elasticsearch.indices.ttl.IndicesTTLService$PurgerThread这个就是问题所在。


3.问题分析

查看ES源码 org.elasticsearch.indices.ttl.IndicesTTLService构造函数如下:PurgerThread

    @Inject    public IndicesTTLService(Settings settings, ClusterService clusterService, IndicesService indicesService, NodeSettingsService nodeSettingsService, TransportBulkAction bulkAction) {        super(settings);        this.clusterService = clusterService;        this.indicesService = indicesService;        TimeValue interval = this.settings.getAsTime("indices.ttl.interval", TimeValue.timeValueSeconds(60));//60s 执行一次        this.bulkAction = bulkAction;        this.bulkSize = this.settings.getAsInt("indices.ttl.bulk_size", 10000);//文档过期后,一次bulk发送删除请求的文档数        this.purgerThread = new PurgerThread(EsExecutors.threadName(settings, "[ttl_expire]"), interval);//启动定时任务,清理ttl过期文档,默认周期60s,        nodeSettingsService.addListener(new ApplySettings());    }

IndicesTTLService构造函数中启动了一个清除过期文档的定时周期任务,PurgerThread,任务定义如下:

        @Override        public void run() {            try {                while (running.get()) {                    try {                        List<IndexShard> shardsToPurge = getShardsToPurge();//获取需要进行ttl处理的主分片                        purgeShards(shardsToPurge);//获取需要删除的文档,发送bulk delete请求

purgeShards()方法如下:

    private void purgeShards(List<IndexShard> shardsToPurge) {        for (IndexShard shardToPurge : shardsToPurge) {            Query query = shardToPurge.indexService().mapperService().smartNameFieldType(TTLFieldMapper.NAME).rangeQuery(null, System.currentTimeMillis(), false, true);            Engine.Searcher searcher = shardToPurge.acquireSearcher("indices_ttl");            try {                logger.debug("[{}][{}] purging shard", shardToPurge.routingEntry().index(), shardToPurge.routingEntry().id());                ExpiredDocsCollector expiredDocsCollector = new ExpiredDocsCollector();                searcher.searcher().search(query, expiredDocsCollector);//调用Lucene,获取需要删除的文档,                List<DocToPurge> docsToPurge = expiredDocsCollector.getDocsToPurge();//需要删除的文档                BulkRequest bulkRequest = new BulkRequest();                for (DocToPurge docToPurge : docsToPurge) {                    bulkRequest.add(new DeleteRequest().index(shardToPurge.routingEntry().index()).type(docToPurge.type).id(docToPurge.id).version(docToPurge.version).routing(docToPurge.routing));                    bulkRequest = processBulkIfNeeded(bulkRequest, false);//如果bulk 请求删除的文档数达到了indices.ttl.bulk_size,发送bulk delete 请求                }                processBulkIfNeeded(bulkRequest, true);

通过对比上面的dump分析图表,可以确定占用13.7G内存的就是下面这个

List<DocToPurge> docsToPurge = expiredDocsCollector.getDocsToPurge();//需要删除的文档


同时有63,191,376 个TTL过期的文档需要删除,org.elasticsearch.indices.ttl.IndicesTTLService$DocToPurge对象将内存占满了。

Class NameObjectsShallow HeapRetained Heaporg.elasticsearch.indices.ttl.IndicesTTLService$PurgerThread1136>= 14,761,919,848org.elasticsearch.indices.ttl.IndicesTTLService$PurgerThread124>= 14,761,340,104org.elasticsearch.indices.ttl.IndicesTTLService$DocToPurge63,191,3762,022,124,032>= 14,480,975,760org.elasticsearch.indices.ttl.IndicesTTLService$DocToPurge189,639,7044,551,352,896>= 12,465,602,816


4.解决方案


通过动态设置索引的参数
index.ttl.disable_purge
[experimental] Disables the purge of expired docs on the current index.
禁用掉ttl,使ES不对ttl过期的文档进行处理。


PUT index1/_settings
{
    "ttl.disable_purge": true
}

这样ES才能正常启动起来。然后想办法主动删除过期的文档。

配置参考这里:https://www.elastic.co/guide/en/elasticsearch/reference/2.1/index-modules.html



总结:Elasticsearch 高版本已经取消了TTL功能,所以最好不要用TTL,如果需要删除文档,可以每天创建一个索引,到期后直接删除过期的索引,方便快捷,这也是官方推荐的方案。



原创粉丝点击
热门问题 老师的惩罚 人脸识别 我在镇武司摸鱼那些年 重生之率土为王 我在大康的咸鱼生活 盘龙之生命进化 天生仙种 凡人之先天五行 春回大明朝 姑娘不必设防,我是瞎子 手机电源键掉了怎么办 手机电源键坏了怎么办 小米5s听筒声音小怎么办 荣耀8电源键失灵怎么办 华为荣耀3c卡怎么办 大王卡是2g网络怎么办 联通停用2g副卡怎么办 华为荣耀8忘记解锁密码怎么办 华为手机内存满了怎么办 华为手机无限重启怎么办 华为3c重启怎么办 荣耀6 无限重启怎么办 手机进水无法开机了怎么办 华为手机不停重启怎么办 华为手机反复重启怎么办 酷派电池不耐用怎么办 美图手机充电慢怎么办 酷派b770太卡怎么办 酷派手机出现无命令怎么办 华为荣耀4x卡怎么办 华为手机图案解锁忘了怎么办 xp电脑读不起u盘怎么办 在外国玩王者卡怎么办 华为p7忘记解锁密码怎么办 华为荣耀4x存储空间不足怎么办 红米4a内存不够怎么办 华为h60开不了机怎么办 华为荣耀4c内存不足怎么办 华为4c运行内存不足怎么办 华为手机总是显示内存不足怎么办 华为荣耀4x畅玩版内存不足怎么办 三星手机忘了解锁密码怎么办 荣耀8密码忘了怎么办 华为4x开不了机怎么办 华为指纹和密码解锁解不开怎么办 华为荣耀5x死机怎么办 华为荣耀开不了机怎么办 荣耀10开不了机怎么办 乐视pro3变砖了怎么办 手机升级后开不了机怎么办 华为g750开不了机怎么办