7-Druid-Querying

来源:互联网 发布:数控加工编程代码例题 编辑:程序博客网 时间:2024/06/06 03:21

原文

http://druid.io/docs/0.10.1/querying/querying.html

其他:

Transforming Dimension Values

The following JSON fields can be used in a query to operate on dimension values.

http://druid.io/docs/0.10.1/querying/dimensionspecs.html

Query Context

The query context is used for various query configuration parameters. The following parameters apply to all querie

http://druid.io/docs/0.10.1/querying/query-context.html

Multi-value dimensions

http://druid.io/docs/0.10.1/querying/multi-value-dimensions.html

一、Querying

Queries 是以HTTP REST形式向nodes请求 (Broker, Historical, or Realtime). JSON形式标识请求的内容:

curl -X POST '<queryable_host>:<port>/druid/v2/?pretty' -H 'Content-Type:application/json' -d @<query_json_file>

其他形式的第三方查询库: client libraries 

http://druid.io/libraries.html

calcite库

http://calcite.apache.org/docs/druid_adapter.html

  • Apache Calcite - SQL parser, planner and query engine whose Druid adapter can query data residing in Druid, and combine it with data in other locations; has local and remote JDBC drivers powered by Avatica
  • implydata/plyql - A command line and HTTP interface for issuing SQL queries to Druid

SQL Support for Druid

http://druid.io/docs/0.10.1/querying/sql.html


Queries分类

Aggregation Queries

  • Timeseries
  • TopN
  • GroupBy

Metadata Queries

  • Time Boundary
  • Segment Metadata
  • Datasource Metadata

Search Queries

  • Search

查询方式使用建议

建议使用 Timeseries and TopN queries 而非 GroupBy. GroupBy 复杂,实现低效. 对于 单维度的grouping sorting ,topN 查询优于 groupBys.

Query的取消

 DELETE /druid/v2/{queryId}

例如取消query ID = 123的查询可以用(在制定端口发送请求)

 curl -X DELETE "http://host:port/druid/v2/abc123"

Query Errors

如果查询失败,可得到 HTTP 500 response 包含JSON对象如下:

 {  "error" : "Query timeout",  "errorMessage" : "Timeout waiting for task.",  "errorClass" : "java.util.concurrent.TimeoutException",  "host" : "druid1.example.com:8083"}
fielddescriptionerror具体error codeerrorMessage详细信息errorClass异常类host异常端口


codedescriptionQuery timeout超时Query interrupted

query中断,可能是JVM关闭

Query cancelled通过api取消Resource limit exceededThe query exceeded a configured resource limit (e.g. groupBy maxResults).Unknown exception其他异常

二、聚合查询-Timeseries查询

请求中包含timeseries对象,返回一个json数组,其中每个json对象代表一个结果值

 timeseries query object :

 {  "queryType": "timeseries",  "dataSource": "sample_datasource",  "granularity": "day",  "descending": "true",  "filter": {    "type": "and",    "fields": [      { "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" },      { "type": "or",        "fields": [          { "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" },          { "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" }        ]      }    ]  },  "aggregations": [    { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" },    { "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" }  ],  "postAggregations": [    { "type": "arithmetic",      "name": "sample_divide",      "fn": "/",      "fields": [        { "type": "fieldAccess", "name": "postAgg__sample_name1", "fieldName": "sample_name1" },        { "type": "fieldAccess", "name": "postAgg__sample_name2", "fieldName": "sample_name2" }      ]    }  ],  "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ]}

7个主要部分:

propertydescriptionrequired?queryType

"timeseries"

yesdataSource数据源yesdescending

是否排序Default is false(ascending).

nointervalsISO-8601 Intervals. 指定查询的范围yesgranularity

 Granularities 查询结果分桶的粒度

yesfilter Filtersnoaggregations AggregationsnopostAggregations Post Aggregationsnocontext Contextno

上面的查询结果总的是:2个点,每个数据点包含sample_fieldName1的 (long) sum ,sample_fieldName2的 (double) sum ,以及复合结果sample_fieldName1/sample_fieldName2 如下:

 [ { "timestamp": "2012-01-01T00:00:00.000Z", "result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> } }, { "timestamp": "2012-01-02T00:00:00.000Z","result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> } }
]

无结果时的零值填充Zero-filling

对于timeseries 粒度中没有值的情况,用0填充,如interval 2012-01-01/2012-01-04,其中2012-01-02没有数,则为:

 [  {    "timestamp": "2012-01-01T00:00:00.000Z",    "result": { "sample_name1": <some_value> }  },  {   "timestamp": "2012-01-02T00:00:00.000Z",   "result": { "sample_name1": 0 }  },  {    "timestamp": "2012-01-03T00:00:00.000Z",    "result": { "sample_name1": <some_value> }  }]

可以取消0填充,不在结果中显示,context中设置skipEmptyBuckets = ture,这样结果如下:

 {  "queryType": "timeseries",  "dataSource": "sample_datasource",  "granularity": "day",  "aggregations": [    { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" }  ],  "intervals": [ "2012-01-01T00:00:00.000/2012-01-04T00:00:00.000" ],  "context" : {    "skipEmptyBuckets": "true"  }}


三、聚合查询-TopN查询

TopNs是每个节点将topK 发送到broker,然后broker将topN 返回,K在 max(1000, threshold)范围,可设置

 topN query object :

 {  "queryType": "topN",  "dataSource": "sample_data",  "dimension": "sample_dim",  "threshold": 5,  "metric": "count",  "granularity": "all",  "filter": {    "type": "and",    "fields": [      {        "type": "selector",        "dimension": "dim1",        "value": "some_value"      },      {        "type": "selector",        "dimension": "dim2",        "value": "some_other_val"      }    ]  },  "aggregations": [    {      "type": "longSum",      "name": "count",      "fieldName": "count"    },    {      "type": "doubleSum",      "name": "some_metric",      "fieldName": "some_metric"    }  ],  "postAggregations": [    {      "type": "arithmetic",      "name": "average",      "fn": "/",      "fields": [        {          "type": "fieldAccess",          "name": "some_metric",          "fieldName": "some_metric"        },        {          "type": "fieldAccess",          "name": "count",          "fieldName": "count"        }      ]    }  ],  "intervals": [    "2013-08-31T00:00:00.000/2013-09-03T00:00:00.000"  ]}.

各部分说明:

propertydescriptionrequired?queryType类型"topN"yesdataSource数据源.yesintervals查询范围yesgranularity结果分桶间隔yesfilterSee FiltersnoaggregationsSee AggregationsnopostAggregationsSee Post Aggregationsnodimension

按照某个维度取top

yesthreshold

限制N

yesmetric

top list的计算指标,对这些进行排序

yescontextSee Contextno

results would look like so:

 [  {    "timestamp": "2013-08-31T00:00:00.000Z",    "result": [      {        "dim1": "dim1_val",        "count": 111,        "some_metrics": 10669,        "average": 96.11711711711712      },      {        "dim1": "another_dim1_val",        "count": 88,        "some_metrics": 28344,        "average": 322.09090909090907      },      {        "dim1": "dim1_val3",        "count": 70,        "some_metrics": 871,        "average": 12.442857142857143      },      {        "dim1": "dim1_val4",        "count": 62,        "some_metrics": 815,        "average": 13.14516129032258      },      {        "dim1": "dim1_val5",        "count": 60,        "some_metrics": 2787,        "average": 46.45      }    ]  }]

Behavior on multi-value dimensions

topN queries can group on multi-value dimensions. When grouping on a multi-value dimension, all values from matching rows will be used to generate one group per value. It's possible for a query to return more groups than there are rows. For example, a topN on the dimension tags with filter "t1" AND "t3" would match only row1, and generate a result with three groups: t1t2, and t3. If you only need to include values that match your filter, you can use a filtered dimensionSpec. This can also improve performance.

See Multi-value dimensions for more details.

混叠Aliasing

现在的TopN是非精确值,

threshold可以通过 server parameter druid.query.topN.minTopNThreshold 设置,但是必须重启servers

精确结果必须groupBy后sort

容忍大约的topN排序在一个维度dimension有超过1000个维度值的情况,可用两次查询得到精确结果 ,一个 得到大约的 topN 个dimension的 values, 另一个值查询这第一步的这topN 值的结果。

Example First query:

 {    "aggregations": [             {                 "fieldName": "L_QUANTITY_longSum",                 "name": "L_QUANTITY_",                 "type": "longSum"             }    ],    "dataSource": "tpch_year",    "dimension":"l_orderkey",    "granularity": "all",    "intervals": [        "1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z"    ],    "metric": "L_QUANTITY_",    "queryType": "topN",    "threshold": 2}

Example second query:

 {    "aggregations": [             {                 "fieldName": "L_TAX_doubleSum",                 "name": "L_TAX_",                 "type": "doubleSum"             },             {                 "fieldName": "L_DISCOUNT_doubleSum",                 "name": "L_DISCOUNT_",                 "type": "doubleSum"             },             {                 "fieldName": "L_EXTENDEDPRICE_doubleSum",                 "name": "L_EXTENDEDPRICE_",                 "type": "doubleSum"             },             {                 "fieldName": "L_QUANTITY_longSum",                 "name": "L_QUANTITY_",                 "type": "longSum"             },             {                 "name": "count",                 "type": "count"             }    ],    "dataSource": "tpch_year",    "dimension":"l_orderkey",    "filter": {        "fields": [            {                "dimension": "l_orderkey",                "type": "selector",                "value": "103136"            },            {                "dimension": "l_orderkey",                "type": "selector",                "value": "1648672"            }        ],        "type": "or"    },    "granularity": "all",    "intervals": [        "1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z"    ],    "metric": "L_QUANTITY_",    "queryType": "topN",    "threshold": 2}

四、聚合查询-groupBy查询

 {  "queryType": "groupBy",  "dataSource": "sample_datasource",  "granularity": "day",  "dimensions": ["country", "device"],  "limitSpec": { "type": "default", "limit": 5000, "columns": ["country", "data_transfer"] },  "filter": {    "type": "and",    "fields": [      { "type": "selector", "dimension": "carrier", "value": "AT&T" },      { "type": "or",         "fields": [          { "type": "selector", "dimension": "make", "value": "Apple" },          { "type": "selector", "dimension": "make", "value": "Samsung" }        ]      }    ]  },  "aggregations": [    { "type": "longSum", "name": "total_usage", "fieldName": "user_count" },    { "type": "doubleSum", "name": "data_transfer", "fieldName": "data_transfer" }  ],  "postAggregations": [    { "type": "arithmetic",      "name": "avg_usage",      "fn": "/",      "fields": [        { "type": "fieldAccess", "fieldName": "data_transfer" },        { "type": "fieldAccess", "fieldName": "total_usage" }      ]    }  ],  "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ],  "having": {    "type": "greaterThan",    "aggregation": "total_usage",    "value": 100  }}
propertydescriptionrequired?queryType同上yesdataSource同上yesdimensions

需要groupBy的维度

yeslimitSpecSee LimitSpec.nohavingSee Having.nogranularity同上yesfilterSee FiltersnoaggregationsSee AggregationsnopostAggregationsSee Post Aggregationsnointervals同上yescontext同上no


按照天 day, ["country", "device"] 聚合,limit 5000条,total_usage > 100

 [   {    "version" : "v1",    "timestamp" : "2012-01-01T00:00:00.000Z",    "event" : {      "country" : <some_dim_value_one>,      "device" : <some_dim_value_two>,      "total_usage" : <some_value_one>,      "data_transfer" :<some_value_two>,      "avg_usage" : <some_avg_usage_value>    }  },   {    "version" : "v1",    "timestamp" : "2012-01-01T00:00:12.000Z",    "event" : {      "dim1" : <some_other_dim_value_one>,      "dim2" : <some_other_dim_value_two>,      "sample_name1" : <some_other_value_one>,      "sample_name2" :<some_other_value_two>,      "avg_usage" : <some_other_avg_usage_value>    }  },...]

五、元数据查询-Time Boundary查询

时间范围查询,返回数据集的最大最小时间点:

 {    "queryType" : "timeBoundary",    "dataSource": "sample_datasource",    "bound"     : < "maxTime" | "minTime" > # optional, defaults to returning both timestamps if not set     "filter"    : { "type": "and", "fields": [<filter>, <filter>, ...] } # optional}
propertydescriptionrequired?queryType查询类型yesdataSource数据源yesbound

minTime或minTime,默认返回两者

nofilterSee FiltersnocontextSee Contextno

结果示例:

 [ {  "timestamp" : "2013-05-09T18:24:00.000Z",  "result" : {    "minTime" : "2013-05-09T18:24:00.000Z",    "maxTime" : "2013-05-09T18:37:00.000Z"  }} ]

六、元数据查询-Data Source Metadata查询

返回数据集最后一次数据摄入的时间,不能包括rollup

 {    "queryType" : "dataSourceMetadata",    "dataSource": "sample_datasource"}
propertydescriptionrequired?queryType类型yesdataSource数据源yescontextSee Contextno

结果

 [ {  "timestamp" : "2013-05-09T18:24:00.000Z",  "result" : {    "maxIngestedEventTime" : "2013-05-09T18:24:09.007Z",  }} ]

七、Search查询

返回某个维度,维度值为指定值的结果

如下返回dim1和dim2中包含Ke的数据

  {

  "queryType": "search",  "dataSource": "sample_datasource",  "granularity": "day",  "searchDimensions": [    "dim1",    "dim2"  ],  "query": {    "type": "insensitive_contains",    "value": "Ke"  },  "sort" : {    "type": "lexicographic"  },  "intervals": [    "2013-01-01T00:00:00.000/2013-01-03T00:00:00.000"  ]}

There are several main parts to a search query:

propertydescriptionrequired?queryType类型yesdataSource数据源yesgranularity粒度yesfilterSee Filters.nolimit

定义每个节点的最大返回值

no (default to 1000)intervals范围yessearchDimensions查询的维度noquerySearchQuerySpec.yessort

An object specifying how the results of the search should be sorted.
Possible types are "lexicographic" (the default sort), "alphanumeric", "strlen", and "numeric".
See Sorting Orders for more details.

排序方式,默认lexicographic,还有alphanumeric

nocontextContextno


 [  {    "timestamp": "2012-01-01T00:00:00.000Z",    "result": [      {        "dimension": "dim1",        "value": "Ke$ha",        "count": 3      },      {        "dimension": "dim2",        "value": "Ke$haForPresident",        "count": 1      }    ]  },  {    "timestamp": "2012-01-02T00:00:00.000Z",    "result": [      {        "dimension": "dim1",        "value": "SomethingThatContainsKe",        "count": 1      },      {        "dimension": "dim2",        "value": "SomethingElseThatContainsKe",        "count": 2      }    ]  }]

 

八、Select查询

 

返回原生行,支持分页

 
  {   "queryType": "select",   "dataSource": "wikipedia",   "descending": "false",   "dimensions":[],   "metrics":[],   "granularity": "all",   "intervals": [     "2013-01-01/2013-01-02"   ],   "pagingSpec":{"pagingIdentifiers": {}, "threshold":5} }
 
 propertydescriptionrequired?queryType类型yesdataSource数据源yesintervals数据范围yesdescending排序nofilterSee Filtersnodimensions

如果为空返回全部

nometrics如果为空返回全部nogranularity粒度 GranularitiesyespagingSpec

在所有scanned segments中的offsets,pagingIdentifiers为下一个query分页用


yescontext no 

The format of the result is:

 
  [{  "timestamp" : "2013-01-01T00:00:00.000Z",  "result" : {    "pagingIdentifiers" : {      "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9" : 4    },    "events" : [ {      "segmentId" : "wikipedia_editstream_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",      "offset" : 0,      "event" : {        "timestamp" : "2013-01-01T00:00:00.000Z",        "robot" : "1",        "namespace" : "article",        "anonymous" : "0",        "unpatrolled" : "0",        "page" : "11._korpus_(NOVJ)",        "language" : "sl",        "newpage" : "0",        "user" : "EmausBot",        "count" : 1.0,        "added" : 39.0,        "delta" : 39.0,        "variation" : 39.0,        "deleted" : 0.0      }    }, {      "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",      "offset" : 1,      "event" : {        "timestamp" : "2013-01-01T00:00:00.000Z",        "robot" : "0",        "namespace" : "article",        "anonymous" : "0",        "unpatrolled" : "0",        "page" : "112_U.S._580",        "language" : "en",        "newpage" : "1",        "user" : "MZMcBride",        "count" : 1.0,        "added" : 70.0,        "delta" : 70.0,        "variation" : 70.0,        "deleted" : 0.0      }    }, {      "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",      "offset" : 2,      "event" : {        "timestamp" : "2013-01-01T00:00:00.000Z",        "robot" : "0",        "namespace" : "article",        "anonymous" : "0",        "unpatrolled" : "0",        "page" : "113_U.S._243",        "language" : "en",        "newpage" : "1",        "user" : "MZMcBride",        "count" : 1.0,        "added" : 77.0,        "delta" : 77.0,        "variation" : 77.0,        "deleted" : 0.0      }    }, {      "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",      "offset" : 3,      "event" : {        "timestamp" : "2013-01-01T00:00:00.000Z",        "robot" : "0",        "namespace" : "article",        "anonymous" : "0",        "unpatrolled" : "0",        "page" : "113_U.S._73",        "language" : "en",        "newpage" : "1",        "user" : "MZMcBride",        "count" : 1.0,        "added" : 70.0,        "delta" : 70.0,        "variation" : 70.0,        "deleted" : 0.0      }    }, {      "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",      "offset" : 4,      "event" : {        "timestamp" : "2013-01-01T00:00:00.000Z",        "robot" : "0",        "namespace" : "article",        "anonymous" : "0",        "unpatrolled" : "0",        "page" : "113_U.S._756",        "language" : "en",        "newpage" : "1",        "user" : "MZMcBride",        "count" : 1.0,        "added" : 68.0,        "delta" : 68.0,        "variation" : 68.0,        "deleted" : 0.0      }    } ]  }} ]

 threshold 决定命中多少个,每一个有一个offset所为索引,descending 时offset为负数

Result Pagination

 

PagingSpec决定分页,threshold决定每个页中多少数据,开始时必须有一个pagingIdentifiers且为空

 
 "pagingSpec":{"pagingIdentifiers": {}, "threshold":5}
 

返回时pagingIndentifers不为空,标识是结果。当取下一个结果时必须pagingIdentifiers设置为上一个结果的pagingIdentifiers值

九、Components-Datasources

数据源

Table Data Source

一个table作为源

 

 {    "type": "table",    "name": "<string_value>"}

 

Union Data Source

 

几个table(same schema)做union作为源

 

 {       "type": "union",       "dataSources": ["<string_value1>", "<string_value2>", "<string_value3>", ... ]}

 

Query Data Source

 

This is used for nested groupBys and is only currently supported for groupBys.

 

 {    "type": "query",    "query": {        "type": "groupBy",        ...    }}

十、Components-Query Filters

相当于SQL中where

Selector filter

selector 将一个特定的dimension = 特定的值,也可组成复杂的布尔表达式,Filtering with Extraction Functions 

语法如下:

 "filter": { "type": "selector", "dimension": <dimension_string>, "value": <dimension_value_string> }
相当于 WHERE <dimension_string> = '<dimension_value_string>'.

Column Comparison filter

维度间的比较

 "filter": { "type": "columnComparison", "dimensions": [<dimension_a>, <dimension_b>] }

相当于WHERE <dimension_a> = <dimension_b>.

Regular expression filter

Java regular expression. 维度和given pattern正则, Filtering with Extraction Functions

 "filter": { "type": "regex", "dimension": <dimension_string>, "pattern": <pattern_string> }

Logical expression filters

AND

 "filter": { "type": "and", "fields": [<filter>, <filter>, ...] }

OR

 "filter": { "type": "or", "fields": [<filter>, <filter>, ...] }

NOT

The grammar for a NOT filter is as follows:

 "filter": { "type": "not", "field": <filter> }

JavaScript filter

The JavaScript filter matches a dimension against the specified JavaScript function predicate. The filter matches values for which the function returns true.

The function takes a single argument, the dimension value, and returns either true or false.

 {  "type" : "javascript",  "dimension" : <dimension_string>,  "function" : "function(value) { <...> }"}

Example The following matches any dimension values for the dimension name between 'bar' and 'foo'

 {  "type" : "javascript",  "dimension" : "name",  "function" : "function(x) { return(x >= 'bar' && x <= 'foo') }"}

The JavaScript filter supports the use of extraction functions, see Filtering with Extraction Functions for details.

JavaScript-based functionality is disabled by default. Please refer to the Druid JavaScript programming guide for guidelines about using Druid's JavaScript functionality, including instructions on how to enable it.

Search filter

在部分string匹配

 {    "filter": {        "type": "search",        "dimension": "product",        "query": {          "type": "insensitive_contains",          "value": "foo"         }            }}
propertydescriptionrequired?type类型yesdimension维度yesquery类型.yesextractionFnExtraction function 作用在维度上no


Search Query Spec

Contains
propertydescriptionrequired?type一般是 "contains".yesvalue匹配的字符yescaseSensitive是否大小写敏感no (default == false)
Insensitive Contains
propertydescriptionrequired?type"insensitive_contains".yesvalue匹配的字符yes

 "insensitive_contains"  等同于 "contains" search with "caseSensitive": false (or not provided).

Fragment
propertydescriptionrequired?typeThis String should always be "fragment".yesvaluesA JSON array of String values to run the search over.yescaseSensitiveWhether strings should be compared as case sensitive or not. Default: false(insensitive)no

In filter

类似于 SQL query:

  SELECT COUNT(*) AS 'Count' FROM `table` WHERE `outlaw` IN ('Good', 'Bad', 'Ugly')

The grammar for a IN filter is as follows:

 {    "type": "in",    "dimension": "outlaw",    "values": ["Good", "Bad", "Ugly"]}

The IN filter supports the use of extraction functions, see Filtering with Extraction Functions for details.

Like filter

像SQL中的SQL LIKE, "%" (matches any number of characters) and "_" (matches any one character).

propertytypedescriptionrequired?typeStringThis should always be "like".yesdimensionStringThe dimension to filter onyespatternStringLIKE pattern, such as "foo%" or "___bar".yesescapeStringAn escape character that can be used to escape special characters.noextractionFnExtraction functionExtraction function to apply to the dimensionno

Like filters support the use of extraction functions, see Filtering with Extraction Functions for details.

 last_name LIKE "D%" (i.e. last_name starts with "D").

 {    "type": "like",    "dimension": "last_name",    "pattern": "D%"}

Bound filter

维度值的范围  greater than, less than, greater than or equal to, less than or equal to, and "between" (if both "lower" and "upper" are set).

propertytypedescriptionrequired?typeString"bound".yesdimensionString维度yeslowerStringThe lower bound for the filternoupperStringThe upper bound for the filternolowerStrictBooleanPerform strict comparison on the lower bound ("<" instead of "<=")no, default: falseupperStrictBooleanPerform strict comparison on the upper bound (">" instead of ">=")no, default: falseorderingString

不同bound间比较的排序方法 Sorting Orders


no, default: "lexicographic"extractionFnExtraction functionExtraction function to apply to the dimensionno

(1) 21 <= age <= 31

{ "type": "bound", "dimension": "age", "lower": "21", "upper": "31" , "ordering": "numeric" }

(2)foo <= name <= hoo, using the default lexicographic sorting order. json

{ "type": "bound", "dimension": "name", "lower": "foo", "upper": "hoo" }

 (3)21 < age < 31 

{ "type": "bound", "dimension": "age", "lower": "21", "lowerStrict": true, "upper": "31" , "upperStrict": true, "ordering": "numeric" }

 (4)age < 31.

 { "type": "bound", "dimension": "age", "upper": "31" , "upperStrict": true, "ordering": "numeric" }

(5)age >= 18

 { "type": "bound", "dimension": "age", "lower": "18" , "ordering": "numeric" }

Interval Filter

对long millisecond values比较,适合__time column, long metric columns和dimensions中s long milliseconds的维度,ISO 8601 intervals, left-closed and right-open matching (i.e., start <= time < end).

propertytypedescriptionrequired?typeString"interval".yesdimensionStringdimensionyesintervalsArraytime rangesyesextractionFnExtraction functionExtraction function to apply to the dimensionno


 October 1-7, 2014 and November 15-16, 2014.

 { "type" : "interval", "dimension" : "__time", "intervals" : [ "2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z", "2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z" ] }

等同于下面的形式

 {    "type": "or",    "fields": [      {        "type": "bound",        "dimension": "__time",        "lower": "1412121600000",        "lowerStrict": false,        "upper": "1412640000000" ,        "upperStrict": true,        "ordering": "numeric"      },      {         "type": "bound",         "dimension": "__time",         "lower": "1416009600000",         "lowerStrict": false,         "upper": "1416096000000" ,         "upperStrict": true,         "ordering": "numeric"      }    ]}

Column types

按照类型过滤

Druid supports filtering on timestamp, string, long, and float columns.

Filtering on numeric columns

 myFloatColumn = 10.1:

 "filter": {  "type": "selector",  "dimension": "myFloatColumn",  "value": "10.1"}

10 <= myFloatColumn < 20:

 "filter": {  "type": "bound",  "dimension": "myFloatColumn",  "ordering": "numeric",  "lowerBound": "10",  "lowerStrict": false,  "upperBound": "20",  "upperStrict": true}

Filtering on the Timestamp Column

 filtering on a long timestamp value:

 "filter": {  "type": "selector",  "dimension": "__time",  "value": "124457387532"}

Filtering on day of week:

 "filter": {  "type": "selector",  "dimension": "__time",  "value": "Friday",  "extractionFn": {    "type": "timeFormat",    "format": "EEEE",    "timeZone": "America/New_York",    "locale": "en"  }}

Filtering on a set of ISO 8601 intervals:

 {    "type" : "interval",    "dimension" : "__time",    "intervals" : [      "2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z",      "2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z"    ]}

 

十一、Components-Aggregations

 

在数据进入Druid前的聚合. 查询时的聚合

Count aggregator

 

 { "type" : "count", "name" : <output_name> }

 对列count,不一定是摄入的原生数据量,可以在摄入时配置roll up,为了计算摄入数据量在 ingestion time时设置一个count ,在查询时用longSum aggregator

 

Sum aggregators

 

longSum aggregator

 

 { "type" : "longSum", "name" : <output_name>, "fieldName" : <metric_name> }

 

 fieldName – name 做64-bitsum 结果命名为output_name

 

doubleSum aggregator

 

64-bit floating point value

 

 { "type" : "doubleSum", "name" : <output_name>, "fieldName" : <metric_name> }

 

Min / Max aggregators

 

doubleMin aggregator

 

doubleMin Double.POSITIVE_INFINITY

 

 { "type" : "doubleMin", "name" : <output_name>, "fieldName" : <metric_name> }

 

doubleMax aggregator

 

doubleMax  Double.NEGATIVE_INFINITY

 

 { "type" : "doubleMax", "name" : <output_name>, "fieldName" : <metric_name> }

 

longMin aggregator

 

longMin  Long.MAX_VALUE

 

 { "type" : "longMin", "name" : <output_name>, "fieldName" : <metric_name> }

 

longMax aggregator

 

longMax   Long.MIN_VALUE

 

 { "type" : "longMax", "name" : <output_name>, "fieldName" : <metric_name> }

 

First / Last aggregator

 

First and Last aggregator cannot be used in ingestion spec, and should only be specified as part of queries.

 

Note that queries with first/last aggregators on a segment created with rollup enabled will return the rolled up value, and not the last value within the raw ingested data.

 

doubleFirst aggregator

 

doubleFirst computes the metric value with the minimum timestamp or 0 if no row exist

 

 {  "type" : "doubleFirst",  "name" : <output_name>,  "fieldName" : <metric_name>}

 

doubleLast aggregator

 

doubleLast computes the metric value with the maximum timestamp or 0 if no row exist

 

 {  "type" : "doubleLast",  "name" : <output_name>,  "fieldName" : <metric_name>}

 

longFirst aggregator

 

longFirst computes the metric value with the minimum timestamp or 0 if no row exist

 

 {  "type" : "longFirst",  "name" : <output_name>,  "fieldName" : <metric_name>}

 

longLast aggregator

 

longLast computes the metric value with the maximum timestamp or 0 if no row exist

 

 {   "type" : "longLast",  "name" : <output_name>,   "fieldName" : <metric_name>,}

 

JavaScript aggregator

 

Computes an arbitrary JavaScript function over a set of columns (both metrics and dimensions are allowed). Your JavaScript functions are expected to return floating-point values.

 

 { "type": "javascript",  "name": "<output_name>",  "fieldNames"  : [ <column1>, <column2>, ... ],  "fnAggregate" : "function(current, column1, column2, ...) { <updates partial aggregate (current) based on the current row values> return <updated partial aggregate> }",  "fnCombine"   : "function(partialA, partialB) { return <combined partial results>; }",  "fnReset"     : "function() { return <initial value>; }"}

 

Example

 

 {  "type": "javascript",  "name": "sum(log(x)*y) + 10",  "fieldNames": ["x", "y"],  "fnAggregate" : "function(current, a, b) { return current + (Math.log(a) * b); }",  "fnCombine"   : "function(partialA, partialB) { return partialA + partialB; }",  "fnReset"     : "function() { return 10; }"}

 

JavaScript-based functionality is disabled by default. Please refer to the Druid JavaScript programming guide for guidelines about using Druid's JavaScript functionality, including instructions on how to enable it.

 

Approximate Aggregations 近似计算

 

Cardinality aggregator

 

用HyperLogLog估计维度的基数,比hyperUnique慢,当只关心某一个维度时建议使用hyperUnique aggregator

 

 {  "type": "cardinality",  "name": "<output_name>",  "fields": [ <dimension1>, <dimension2>, ... ],  "byRow": <false | true> # (optional, defaults to false)}

Cardinality by value

 byRow = false时,对所有给定dimensions union后求cardinality

 

  • 对于一个维度等同于

 

 SELECT COUNT(DISTINCT(dimension)) FROM <datasource>

 

  • 多个维度时等同于

 

 SELECT COUNT(DISTINCT(value)) FROM (  SELECT dim_1 as value FROM <datasource>  UNION  SELECT dim_2 as value FROM <datasource>  UNION  SELECT dim_3 as value FROM <datasource>)

 

Cardinality by row

 byRow = true时等同于 

 SELECT COUNT(*) FROM ( SELECT DIM1, DIM2, DIM3 FROM <datasource> GROUP BY DIM1, DIM2, DIM3 )

 

Example

计算人的pv 当地和非当地

 

 {  "type": "cardinality",  "name": "distinct_countries",  "fields": [ "country_of_origin", "country_of_residence" ]}

 

Determine the number of distinct people (i.e. combinations of first and last name).

 

 {  "type": "cardinality",  "name": "distinct_people",  "fields": [ "first_name", "last_name" ],  "byRow" : true}

 

Determine the number of distinct starting characters of last names

 

 {  "type": "cardinality",  "name": "distinct_last_name_first_char",  "fields": [    {     "type" : "extraction",     "dimension" : "last_name",     "outputName" :  "last_name_first_char",     "extractionFn" : { "type" : "substring", "index" : 0, "length" : 1 }    }  ],  "byRow" : true}

 

HyperUnique aggregator

 

 HyperLogLog 估计一个维度的uv

 

 {   "type" : "hyperUnique",  "name" : <output_name>,  "fieldName" : <metric_name>,  "isInputHyperUnique" : false}

 

十二、Components-Post-Aggregations

对聚合结果二次计算 后输出

Arithmetic post-aggregator

 支持 +-*/quotient.

Note:

  • / 除数为0返回0
  • quotient 和浮点数一样

一般也有 ordering, 对结果排序:

  • If no ordering (or null) is specified, the default floating point ordering is used.
  • numericFirst ordering always returns finite values first, followed by NaN, and infinite values last.:
 postAggregation : {  "type"  : "arithmetic",  "name"  : <output_name>,  "fn"    : <arithmetic_function>,  "fields": [<post_aggregator>, <post_aggregator>, ...],  "ordering" : <null (default), or "numericFirst">}

Field accessor post-aggregator

 aggregator.的输出作为输入

fieldName refers to the output name of the aggregator given in the aggregations portion of the query.

 { "type" : "fieldAccess", "name": <output_name>, "fieldName" : <aggregator_name> }

Constant post-aggregator

 constant post-aggregator 返回特定的值

 { "type"  : "constant", "name"  : <output_name>, "value" : <numerical_value> }

Greatest / Least post-aggregators

doubleGreatest and longGreatest computes the maximum of all fields and Double.NEGATIVE_INFINITY. doubleLeast and longLeast computes the minimum of all fields and Double.POSITIVE_INFINITY.

The difference between the doubleMax aggregator and the doubleGreatest post-aggregator is that doubleMax returns the highest value of all rows for one specific column while doubleGreatest returns the highest value of multiple columns in one row. 

类似于SQL的MAX和GREATEST

 {  "type"  : "doubleGreatest",  "name"  : <output_name>,  "fields": [<post_aggregator>, <post_aggregator>, ...]}

HyperUnique Cardinality post-aggregator

The hyperUniqueCardinality post aggregator is used to wrap a hyperUnique object such that it can be used in post aggregations.

 { "type"  : "hyperUniqueCardinality", "name": <output name>, "fieldName"  : <the name field value of the hyperUnique aggregator>}

It can be used in a sample calculation as so:

   "aggregations" : [{    {"type" : "count", "name" : "rows"},    {"type" : "hyperUnique", "name" : "unique_users", "fieldName" : "uniques"}  }],  "postAggregations" : [{    "type"   : "arithmetic",    "name"   : "average_users_per_row",    "fn"     : "/",    "fields" : [      { "type" : "hyperUniqueCardinality", "fieldName" : "unique_users" },      { "type" : "fieldAccess", "name" : "rows", "fieldName" : "rows" }    ]  }]

例子

计算占比

 {  ...  "aggregations" : [    { "type" : "count", "name" : "rows" },    { "type" : "doubleSum", "name" : "tot", "fieldName" : "total" }  ],  "postAggregations" : [{    "type"   : "arithmetic",    "name"   : "average",    "fn"     : "*",    "fields" : [       { "type"   : "arithmetic",         "name"   : "div",         "fn"     : "/",         "fields" : [           { "type" : "fieldAccess", "name" : "tot", "fieldName" : "tot" },           { "type" : "fieldAccess", "name" : "rows", "fieldName" : "rows" }         ]       },       { "type" : "constant", "name": "const", "value" : 100 }    ]  }]  ...}


十三、Components-Aggregation Granularity聚合粒度

决定数据在时间维度的分桶(hour, day, minute, etc.),可以是简单也可以是任意

Simple Granularities

支持的字符串表达形式是: allnonesecondminutefifteen_minutethirty_minutehourdayweekmonthquarter and year.

  • all buckets everything into a single bucket
  • none node和摄入的粒度一样

millisecond粒度摄入

{"timestamp": "2013-08-31T01:02:33Z", "page": "AAA", "language" : "en"}{"timestamp": "2013-09-01T01:02:33Z", "page": "BBB", "language" : "en"}{"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"}{"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"}

 hour granularity groupby,

 {   "queryType":"groupBy",   "dataSource":"my_dataSource",   "granularity":"hour",   "dimensions":[      "language"   ],   "aggregations":[      {         "type":"count",         "name":"count"      }   ],   "intervals":[      "2000-01-01T00:00Z/3000-01-01T00:00Z"   ]}

you will get

 [ {  "version" : "v1",  "timestamp" : "2013-08-31T01:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-01T01:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-02T23:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-03T03:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }} ]


按天聚合

 [ {  "version" : "v1",  "timestamp" : "2013-08-31T00:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-01T00:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-02T00:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-03T00:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }} ]

node和摄入的粒度一样

 [ {  "version" : "v1",  "timestamp" : "2013-08-31T01:02:33.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-01T01:02:33.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-02T23:32:45.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-03T03:32:45.000Z",  "event" : {    "count" : 1,    "language" : "en"  }} ]


all 所有

 [ {  "version" : "v1",  "timestamp" : "2000-01-01T00:00:00.000Z",  "event" : {    "count" : 4,    "language" : "en"  }} ]

Duration Granularities

milliseconds 的整数倍

This chunks up every 2 hours.

 {"type": "duration", "duration": 3600000, "origin": "2012-01-01T00:30:00Z"}

Example:

Reusing the data in the previous example, after submitting a groupBy query with 24 hours duration,

 {   "queryType":"groupBy",   "dataSource":"my_dataSource",   "granularity":{"type": "duration", "duration": "86400000"},   "dimensions":[      "language"   ],   "aggregations":[      {         "type":"count",         "name":"count"      }   ],   "intervals":[      "2000-01-01T00:00Z/3000-01-01T00:00Z"   ]}

you will get

 [ {  "version" : "v1",  "timestamp" : "2013-08-31T00:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-01T00:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-02T00:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-03T00:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }} ]

if you set the origin for the granularity to 2012-01-01T00:30:00Z,

    "granularity":{"type": "duration", "duration": "86400000", "origin":"2012-01-01T00:30:00Z"}

you will get

 [ {  "version" : "v1",  "timestamp" : "2013-08-31T00:30:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-01T00:30:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-02T00:30:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-03T00:30:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }} ]

Note that the timestamp for each bucket starts at the 30th minute.

Period Granularities

Period granularities are specified as arbitrary period combinations of years, months, weeks, hours, minutes and seconds (e.g. P2W, P3M, PT1H30M, PT0.750S) in ISO8601 format. They support specifying a time zone which determines where period boundaries start as well as the timezone of the returned timestamps. By default, years start on the first of January, months start on the first of the month and weeks start on Mondays unless an origin is specified.

Time zone is optional (defaults to UTC). Origin is optional (defaults to 1970-01-01T00:00:00 in the given time zone).

 {"type": "period", "period": "P2D", "timeZone": "America/Los_Angeles"}

This will bucket by two-day chunks in the Pacific timezone.

 {"type": "period", "period": "P3M", "timeZone": "America/Los_Angeles", "origin": "2012-02-01T00:00:00-08:00"}

This will bucket by 3-month chunks in the Pacific timezone where the three-month quarters are defined as starting from February.

Example

Reusing the data in the previous example, if you submit a groupBy query with 1 day period in Pacific timezone,

 {   "queryType":"groupBy",   "dataSource":"my_dataSource",   "granularity":{"type": "period", "period": "P1D", "timeZone": "America/Los_Angeles"},   "dimensions":[      "language"   ],   "aggregations":[      {         "type":"count",         "name":"count"      }   ],   "intervals":[      "1999-12-31T16:00:00.000-08:00/2999-12-31T16:00:00.000-08:00"   ]}

you will get

 [ {  "version" : "v1",  "timestamp" : "2013-08-30T00:00:00.000-07:00",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-08-31T00:00:00.000-07:00",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-02T00:00:00.000-07:00",  "event" : {    "count" : 2,    "language" : "en"  }} ]

Note that the timestamp for each bucket has been converted to Pacific time. Row {"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"} and {"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"}are put in the same bucket because they are in the same day in Pacific time.

Also note that the intervals in groupBy query will not be converted to the timezone specified, the timezone specified in granularity is only applied on the query results.

If you set the origin for the granularity to 1970-01-01T20:30:00-08:00,

    "granularity":{"type": "period", "period": "P1D", "timeZone": "America/Los_Angeles", "origin": "1970-01-01T20:30:00-08:00"}

you will get

 [ {  "version" : "v1",  "timestamp" : "2013-08-29T20:30:00.000-07:00",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-08-30T20:30:00.000-07:00",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-01T20:30:00.000-07:00",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-02T20:30:00.000-07:00",  "event" : {    "count" : 1,    "language" : "en"  }} ]

Note that the origin you specified has nothing to do with the timezone, it only serves as a starting point for locating the very first granularity bucket. In this case, Row {"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"} and {"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"} are not in the same bucket.

原创粉丝点击