Elasticsearch笔记-聚合

来源：互联网发布：网络打印机的ip地址编辑：程序博客网时间：2024/06/05 14:55

本篇我们讨论ES的聚合功能，聚合可以对数据进行复杂的统计分析，作用类似于SQL中的group by,不过其统计功能更灵活，更强大。

在讲解前先填充些数据，posts索引的article类型中目前含有以下数据

{  "took" : 8,  "timed_out" : false,  "_shards" : {    "total" : 5,    "successful" : 5,    "failed" : 0  },  "hits" : {    "total" : 7,    "max_score" : 1.0,    "hits" : [ {      "_index" : "posts",      "_type" : "article",      "_id" : "5",      "_score" : 1.0,      "_source" : {        "id" : 5,        "name" : "生活日志",        "author" : "wthfeng",        "date" : "2015-09-21",        "contents" : "这是日常生活的记录",        "readNum" : 100      }    }, {      "_index" : "posts",      "_type" : "article",      "_id" : "8",      "_score" : 1.0,      "_source" : {        "name" : "ES笔记2",        "author" : "hefeng",        "contents" : "ES 的 search ",        "date" : "2016-10-23",        "readNum" : 40      }    }, {      "_index" : "posts",      "_type" : "article",      "_id" : "2",      "_score" : 1.0,      "_source" : {        "id" : 2,        "name" : "更新后的文档",        "author" : "wthfeng",        "date" : "2016-10-23",        "contents" : "这是我的javascript学习笔记",        "brief" : "简介，这是新加的字段",        "readNum" : 200      }    }, {      "_index" : "posts",      "_type" : "article",      "_id" : "4",      "_score" : 1.0,      "_source" : {        "id" : 4,        "name" : "javascript指南",        "author" : "wthfeng",        "date" : "2016-09-21",        "contents" : "js的权威指南",        "readNum" : 200      }    }, {      "_index" : "posts",      "_type" : "article",      "_id" : "6",      "_score" : 1.0,      "_source" : {        "id" : "6",        "name" : "java笔记1",        "author" : "hefeng",        "contents" : "java String info",        "date" : "2016-10-21",        "readNum" : 12      }    }, {      "_index" : "posts",      "_type" : "article",      "_id" : "1",      "_score" : 1.0,      "_source" : {        "id" : 1,        "name" : "ES更新过的文档",        "author" : "wthfeng",        "date" : "2016-10-25",        "contents" : "这是更新内容",        "readNum" : 200      }    }, {      "_index" : "posts",      "_type" : "article",      "_id" : "7",      "_score" : 1.0,      "_source" : {        "id" : "7",        "name" : "ES笔记1",        "author" : "hefeng",        "contents" : "ES search",        "date" : "2016-09-21",        "readNum" : 100      }    } ]  }}

我们有7篇文档。下面操作均来自这些数据。

聚合结构

聚合是与query(查询)、sort(排序)同等地位的数据操作类型。使用aggs表示。类似于

{    "query":{},    "aggs":{},}

先来演示一个例子

GET /posts/article/_search?pretty&search_type=count -d @search.json

{    "aggs":{        "readNum_stats":{            "stats":{                "field":"readNum"            }        }    }}

search_type=count指定只返回结果条数，查询语句中stats表示查询某字段的最值及平均值状况。readNum_stats为自定义字段，返回结果时将结果放入此字段内。返回结果如下：

{  "took" : 4,  "timed_out" : false,  "_shards" : {    "total" : 5,    "successful" : 5,    "failed" : 0  },  "hits" : {    "total" : 7,    "max_score" : 0.0,    "hits" : [ ]  },  "aggregations" : {    "readNum_stats" : {      "count" : 7,      "min" : 12.0,      "max" : 200.0,      "avg" : 121.71428571428571,      "sum" : 852.0    }  }}

返回的聚合结果在aggregations内，readNum字段的最值、平均值、总和及数量都统计出来了。

聚合类型

聚合类型主要有两种，一种是度量聚合，一种是桶聚合。前面示例为度量结合，主要用于求某字段的统计值（如最值、平均值等）；另一种桶聚合则是按条件将数据分组，类似于SQL中的group by。下面我们一一介绍。

度量聚合

度量聚合类似SQL中sum、avg、min、max等的作用，生成一个或多个统计项。具体用法如下：

1. min、max、avg、sum聚合

针对给定字段，返回该字段相应统计值。注意这些字段类型需是数值型。

① 求最低的文档阅读量

GET /posts/article/_search?pretty&search_type=count -d @search.json

{    "aggs":{        "minReadNum":{            "min":{                "field":"readNum"            }        }    }}

返回结果

{  "took" : 1,  "timed_out" : false,  "_shards" : {    "total" : 5,    "successful" : 5,    "failed" : 0  },  "hits" : {    "total" : 7,    "max_score" : 0.0,    "hits" : [ ]  },  "aggregations" : {    "minReadNum" : {      "value" : 12.0    }  }}

② 求总阅读量

GET /posts/article/_search?pretty&search_type=count -d @search.json

{    "aggs": {        "sum_ReadNum": {            "sum": {                "field": "readNum"            }        }    }}

返回结果

{  "took" : 1,  "timed_out" : false,  "_shards" : {    "total" : 5,    "successful" : 5,    "failed" : 0  },  "hits" : {    "total" : 7,    "max_score" : 0.0,    "hits" : [ ]  },  "aggregations" : {    "sum_ReadNum" : {      "value" : 852.0    }  }}

用法都很简单，这里就不一一列举了。还有一种度量聚合将这些度量值集中一起输出。就是我们上节演示的stats 聚合

2. stats、extended_stats聚合

stats聚合输出指定字段的数目、最大、小值，平均值、总值，extended_stats是stats的扩展，在stats基础上还包括了平方和、方差、标准差等统计值。

GET /posts/article/_search?pretty&search_type=count -d @search.json

{    "aggs": {        "stats_of_readNum": {            "extended_stats": {                "field": "readNum"            }        }    }}

返回结果：

{  "took" : 1,  "timed_out" : false,  "_shards" : {    "total" : 5,    "successful" : 5,    "failed" : 0  },  "hits" : {    "total" : 7,    "max_score" : 0.0,    "hits" : [ ]  },  "aggregations" : {    "stats_of_readNum" : {      "count" : 7,      "min" : 12.0,      "max" : 200.0,      "avg" : 121.71428571428571,      "sum" : 852.0,      "sum_of_squares" : 141744.0, //平方和      "variance" : 5434.775510204081, //方差      "std_deviation" : 73.72092993312063, //标准差      "std_deviation_bounds" : {        "upper" : 269.156145580527,        "lower" : -25.72757415195555      }    }  }}

桶聚合

1. terms聚合

terms聚合就类似SQL中的group by,先看看下面示例：

将文档按作者分类，查询每位作者的文档数

GET /posts/article/_search?pretty&search_type=count -d @search.json

{    "aggs": {        "author_aggs": {            "terms": {                "field": "author"                                   }        }    }}

返回结果

{  "took" : 125,  "timed_out" : false,  "_shards" : {    "total" : 5,    "successful" : 5,    "failed" : 0  },  "hits" : {    "total" : 7,    "max_score" : 0.0,    "hits" : [ ]  },  "aggregations" : {    "author_aggs" : {      "doc_count_error_upper_bound" : 0,      "sum_other_doc_count" : 0,      "buckets" : [ {        "key" : "wthfeng",        "doc_count" : 4      }, {        "key" : "hefeng",        "doc_count" : 3      } ]    }  }}

由返回结果可知，名为wthfeng的作者有4篇文档，hefeng有3篇文档。用SQL表示则为：

select author,count(*) from article group by author

默认情况下，返回结果按文档数（doc_count）倒序排序，我们也可以按其正序排序，或使用key排序。按doc_count排序应使用_count,按key排序应使用_terms。例按key正序排列应使用如下查询。

{    "aggs": {        "author_aggs": {            "terms": {                "field": "author",                "order":{                    "_term":"asc"                }                                   }        }    }}

2. range聚合

range聚合按可以自定义范围将数值类型数据分组。起始值用from表示（包括边界），终止值用to表示（不包括边界）。可以给分组起一个便于记忆的自定义的名字，用key表示。如按阅读量分组：

GET /posts/article/_search?pretty&search_type=count’ -d @search.json

{    "aggs": {        "read_docs": {            "range": {                "field":"readNum",                "ranges":[                    {"to":50,"key":"less 50"},                    {"from":50,"to":100,"key":"50 - 100"},                    {"from":100,"to":150,"key":"100 - 150"},                    {"from":150,"key":"more than 150"}                ]                                               }        }    }}

返回结果：

{  "took" : 1,  "timed_out" : false,  "_shards" : {    "total" : 5,    "successful" : 5,    "failed" : 0  },  "hits" : {    "total" : 7,    "max_score" : 0.0,    "hits" : [ ]  },  "aggregations" : {    "read_docs" : {      "buckets" : [ {        "key" : "less 50",        "to" : 50.0,        "to_as_string" : "50.0",        "doc_count" : 2      }, {        "key" : "50 - 100",        "from" : 50.0,        "from_as_string" : "50.0",        "to" : 100.0,        "to_as_string" : "100.0",        "doc_count" : 0      }, {        "key" : "100 - 150",        "from" : 100.0,        "from_as_string" : "100.0",        "to" : 150.0,        "to_as_string" : "150.0",        "doc_count" : 2      }, {        "key" : "more than 150",        "from" : 150.0,        "from_as_string" : "150.0",        "doc_count" : 3      } ]    }  }}

3. date_range聚合

date_range聚合与range用法一致，只是date_range专用于日期聚合。另外，可以使用format指定日期格式。

GET ‘/posts/article/_search?pretty&search_type=count’

{    "aggs":{        "date_docs":{            "field":"date",            "format":"yyyy-MM",            "ranges":[                {"key":"before 2016","to":"2016-01"},                {"key":"first half of 2016","from":"2016-01","to":"2016-06"},                {"key":"second half of 2016","from":"2016-06","to":"2016-12"}            ]        }    }}

{  "took" : 2,  "timed_out" : false,  "_shards" : {    "total" : 5,    "successful" : 5,    "failed" : 0  },  "hits" : {    "total" : 7,    "max_score" : 0.0,    "hits" : [ ]  },  "aggregations" : {    "date_docs" : {      "buckets" : [ {        "key" : "before 2016",        "to" : 1.4516064E12,        "to_as_string" : "2016-01",        "doc_count" : 1      }, {        "key" : "first half of 2016",        "from" : 1.4516064E12,        "from_as_string" : "2016-01",        "to" : 1.4647392E12,        "to_as_string" : "2016-06",        "doc_count" : 0      }, {        "key" : "second half of 2016",        "from" : 1.4647392E12,        "from_as_string" : "2016-06",        "to" : 1.4805504E12,        "to_as_string" : "2016-12",        "doc_count" : 6      } ]    }  }}

0 0