7-Druid-Querying

来源：互联网发布：数控加工编程代码例题编辑：程序博客网时间：2024/06/06 03:21

原文

http://druid.io/docs/0.10.1/querying/querying.html

其他：

Transforming Dimension Values

The following JSON fields can be used in a query to operate on dimension values.

http://druid.io/docs/0.10.1/querying/dimensionspecs.html

Query Context

The query context is used for various query configuration parameters. The following parameters apply to all querie

http://druid.io/docs/0.10.1/querying/query-context.html

Multi-value dimensions

http://druid.io/docs/0.10.1/querying/multi-value-dimensions.html

一、Querying

Queries 是以HTTP REST形式向nodes请求 (Broker, Historical, or Realtime). JSON形式标识请求的内容：

curl -X POST '<queryable_host>:<port>/druid/v2/?pretty' -H 'Content-Type:application/json' -d @<query_json_file>

其他形式的第三方查询库： client libraries

http://druid.io/libraries.html

calcite库

：http://calcite.apache.org/docs/druid_adapter.html

Apache Calcite - SQL parser, planner and query engine whose Druid adapter can query data residing in Druid, and combine it with data in other locations; has local and remote JDBC drivers powered by Avatica
implydata/plyql - A command line and HTTP interface for issuing SQL queries to Druid

SQL Support for Druid

http://druid.io/docs/0.10.1/querying/sql.html

Queries分类

Aggregation Queries

Timeseries
TopN
GroupBy

Metadata Queries

Time Boundary
Segment Metadata
Datasource Metadata

Search Queries

查询方式使用建议

建议使用 Timeseries and TopN queries 而非 GroupBy. GroupBy 复杂，实现低效. 对于单维度的grouping sorting ，topN 查询优于 groupBys.

Query的取消

 DELETE /druid/v2/{queryId}

例如取消query ID = 123的查询可以用（在制定端口发送请求）

 curl -X DELETE "http://host:port/druid/v2/abc123"

Query Errors

如果查询失败，可得到 HTTP 500 response 包含JSON对象如下：

 {  "error" : "Query timeout",  "errorMessage" : "Timeout waiting for task.",  "errorClass" : "java.util.concurrent.TimeoutException",  "host" : "druid1.example.com:8083"}

fielddescriptionerror具体error codeerrorMessage详细信息errorClass异常类host异常端口

codedescriptionQuery timeout超时Query interrupted

query中断，可能是JVM关闭

Query cancelled通过api取消Resource limit exceededThe query exceeded a configured resource limit (e.g. groupBy maxResults).Unknown exception其他异常

二、聚合查询-Timeseries查询

请求中包含timeseries对象，返回一个json数组，其中每个json对象代表一个结果值

timeseries query object :

 {  "queryType": "timeseries",  "dataSource": "sample_datasource",  "granularity": "day",  "descending": "true",  "filter": {    "type": "and",    "fields": [      { "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" },      { "type": "or",        "fields": [          { "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" },          { "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" }        ]      }    ]  },  "aggregations": [    { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" },    { "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" }  ],  "postAggregations": [    { "type": "arithmetic",      "name": "sample_divide",      "fn": "/",      "fields": [        { "type": "fieldAccess", "name": "postAgg__sample_name1", "fieldName": "sample_name1" },        { "type": "fieldAccess", "name": "postAgg__sample_name2", "fieldName": "sample_name2" }      ]    }  ],  "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ]}

7个主要部分:

propertydescriptionrequired?queryType

"timeseries"

yesdataSource数据源yesdescending

是否排序Default is false(ascending).

nointervalsISO-8601 Intervals. 指定查询的范围yesgranularity

Granularities 查询结果分桶的粒度

yesfilter Filtersnoaggregations AggregationsnopostAggregations Post Aggregationsnocontext Contextno

上面的查询结果总的是：2个点，每个数据点包含sample_fieldName1的 (long) sum ，sample_fieldName2的 (double) sum ，以及复合结果sample_fieldName1/sample_fieldName2 如下：

[ { "timestamp": "2012-01-01T00:00:00.000Z", "result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> } }, { "timestamp": "2012-01-02T00:00:00.000Z","result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> } } ]

`无结果时的零值填充`Zero-filling

对于timeseries 粒度中没有值的情况，用0填充，如interval 2012-01-01/2012-01-04，其中2012-01-02没有数，则为:

 [  {    "timestamp": "2012-01-01T00:00:00.000Z",    "result": { "sample_name1": <some_value> }  },  {   "timestamp": "2012-01-02T00:00:00.000Z",   "result": { "sample_name1": 0 }  },  {    "timestamp": "2012-01-03T00:00:00.000Z",    "result": { "sample_name1": <some_value> }  }]

可以取消0填充，不在结果中显示，context中设置skipEmptyBuckets = ture，这样结果如下：

 {  "queryType": "timeseries",  "dataSource": "sample_datasource",  "granularity": "day",  "aggregations": [    { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" }  ],  "intervals": [ "2012-01-01T00:00:00.000/2012-01-04T00:00:00.000" ],  "context" : {    "skipEmptyBuckets": "true"  }}

三、聚合查询-TopN查询

TopNs是每个节点将topK 发送到broker，然后broker将topN 返回，K在 max(1000, threshold)范围，可设置

topN query object :

 {  "queryType": "topN",  "dataSource": "sample_data",  "dimension": "sample_dim",  "threshold": 5,  "metric": "count",  "granularity": "all",  "filter": {    "type": "and",    "fields": [      {        "type": "selector",        "dimension": "dim1",        "value": "some_value"      },      {        "type": "selector",        "dimension": "dim2",        "value": "some_other_val"      }    ]  },  "aggregations": [    {      "type": "longSum",      "name": "count",      "fieldName": "count"    },    {      "type": "doubleSum",      "name": "some_metric",      "fieldName": "some_metric"    }  ],  "postAggregations": [    {      "type": "arithmetic",      "name": "average",      "fn": "/",      "fields": [        {          "type": "fieldAccess",          "name": "some_metric",          "fieldName": "some_metric"        },        {          "type": "fieldAccess",          "name": "count",          "fieldName": "count"        }      ]    }  ],  "intervals": [    "2013-08-31T00:00:00.000/2013-09-03T00:00:00.000"  ]}.

各部分说明：

propertydescriptionrequired?queryType类型"topN"yesdataSource数据源.yesintervals查询范围yesgranularity结果分桶间隔yesfilterSee FiltersnoaggregationsSee AggregationsnopostAggregationsSee Post Aggregationsnodimension

按照某个维度取top

yesthreshold

限制N

yesmetric

top list的计算指标，对这些进行排序

yescontextSee Contextno

results would look like so:

 [  {    "timestamp": "2013-08-31T00:00:00.000Z",    "result": [      {        "dim1": "dim1_val",        "count": 111,        "some_metrics": 10669,        "average": 96.11711711711712      },      {        "dim1": "another_dim1_val",        "count": 88,        "some_metrics": 28344,        "average": 322.09090909090907      },      {        "dim1": "dim1_val3",        "count": 70,        "some_metrics": 871,        "average": 12.442857142857143      },      {        "dim1": "dim1_val4",        "count": 62,        "some_metrics": 815,        "average": 13.14516129032258      },      {        "dim1": "dim1_val5",        "count": 60,        "some_metrics": 2787,        "average": 46.45      }    ]  }]

Behavior on multi-value dimensions

topN queries can group on multi-value dimensions. When grouping on a multi-value dimension, all values from matching rows will be used to generate one group per value. It's possible for a query to return more groups than there are rows. For example, a topN on the dimension tags with filter "t1" AND "t3" would match only row1, and generate a result with three groups: t1, t2, and t3. If you only need to include values that match your filter, you can use a filtered dimensionSpec. This can also improve performance.

See Multi-value dimensions for more details.

混叠Aliasing

现在的TopN是非精确值，

threshold可以通过 server parameter druid.query.topN.minTopNThreshold 设置，但是必须重启servers

精确结果必须groupBy后sort

容忍大约的topN排序在一个维度dimension有超过1000个维度值的情况，可用两次查询得到精确结果，一个得到大约的 topN 个dimension的 values，另一个值查询这第一步的这topN 值的结果。

Example First query:

 {    "aggregations": [             {                 "fieldName": "L_QUANTITY_longSum",                 "name": "L_QUANTITY_",                 "type": "longSum"             }    ],    "dataSource": "tpch_year",    "dimension":"l_orderkey",    "granularity": "all",    "intervals": [        "1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z"    ],    "metric": "L_QUANTITY_",    "queryType": "topN",    "threshold": 2}

Example second query:

 {    "aggregations": [             {                 "fieldName": "L_TAX_doubleSum",                 "name": "L_TAX_",                 "type": "doubleSum"             },             {                 "fieldName": "L_DISCOUNT_doubleSum",                 "name": "L_DISCOUNT_",                 "type": "doubleSum"             },             {                 "fieldName": "L_EXTENDEDPRICE_doubleSum",                 "name": "L_EXTENDEDPRICE_",                 "type": "doubleSum"             },             {                 "fieldName": "L_QUANTITY_longSum",                 "name": "L_QUANTITY_",                 "type": "longSum"             },             {                 "name": "count",                 "type": "count"             }    ],    "dataSource": "tpch_year",    "dimension":"l_orderkey",    "filter": {        "fields": [            {                "dimension": "l_orderkey",                "type": "selector",                "value": "103136"            },            {                "dimension": "l_orderkey",                "type": "selector",                "value": "1648672"            }        ],        "type": "or"    },    "granularity": "all",    "intervals": [        "1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z"    ],    "metric": "L_QUANTITY_",    "queryType": "topN",    "threshold": 2}

四、聚合查询-groupBy查询

 {  "queryType": "groupBy",  "dataSource": "sample_datasource",  "granularity": "day",  "dimensions": ["country", "device"],  "limitSpec": { "type": "default", "limit": 5000, "columns": ["country", "data_transfer"] },  "filter": {    "type": "and",    "fields": [      { "type": "selector", "dimension": "carrier", "value": "AT&T" },      { "type": "or",         "fields": [          { "type": "selector", "dimension": "make", "value": "Apple" },          { "type": "selector", "dimension": "make", "value": "Samsung" }        ]      }    ]  },  "aggregations": [    { "type": "longSum", "name": "total_usage", "fieldName": "user_count" },    { "type": "doubleSum", "name": "data_transfer", "fieldName": "data_transfer" }  ],  "postAggregations": [    { "type": "arithmetic",      "name": "avg_usage",      "fn": "/",      "fields": [        { "type": "fieldAccess", "fieldName": "data_transfer" },        { "type": "fieldAccess", "fieldName": "total_usage" }      ]    }  ],  "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ],  "having": {    "type": "greaterThan",    "aggregation": "total_usage",    "value": 100  }}

propertydescriptionrequired?queryType同上yesdataSource同上yesdimensions

需要groupBy的维度

yeslimitSpecSee LimitSpec.nohavingSee Having.nogranularity同上yesfilterSee FiltersnoaggregationsSee AggregationsnopostAggregationsSee Post Aggregationsnointervals同上yescontext同上no

按照天 day， ["country", "device"] 聚合，limit 5000条，total_usage > 100

 [   {    "version" : "v1",    "timestamp" : "2012-01-01T00:00:00.000Z",    "event" : {      "country" : <some_dim_value_one>,      "device" : <some_dim_value_two>,      "total_usage" : <some_value_one>,      "data_transfer" :<some_value_two>,      "avg_usage" : <some_avg_usage_value>    }  },   {    "version" : "v1",    "timestamp" : "2012-01-01T00:00:12.000Z",    "event" : {      "dim1" : <some_other_dim_value_one>,      "dim2" : <some_other_dim_value_two>,      "sample_name1" : <some_other_value_one>,      "sample_name2" :<some_other_value_two>,      "avg_usage" : <some_other_avg_usage_value>    }  },...]

五、元数据查询-Time Boundary查询

时间范围查询，返回数据集的最大最小时间点:

 {    "queryType" : "timeBoundary",    "dataSource": "sample_datasource",    "bound"     : < "maxTime" | "minTime" > # optional, defaults to returning both timestamps if not set     "filter"    : { "type": "and", "fields": [<filter>, <filter>, ...] } # optional}

propertydescriptionrequired?queryType查询类型yesdataSource数据源yesbound

minTime或minTime，默认返回两者

nofilterSee FiltersnocontextSee Contextno

结果示例:

 [ {  "timestamp" : "2013-05-09T18:24:00.000Z",  "result" : {    "minTime" : "2013-05-09T18:24:00.000Z",    "maxTime" : "2013-05-09T18:37:00.000Z"  }} ]

六、元数据查询-Data Source Metadata查询

返回数据集最后一次数据摄入的时间，不能包括rollup

 {    "queryType" : "dataSourceMetadata",    "dataSource": "sample_datasource"}

propertydescriptionrequired?queryType类型yesdataSource数据源yescontextSee Contextno

结果

 [ {  "timestamp" : "2013-05-09T18:24:00.000Z",  "result" : {    "maxIngestedEventTime" : "2013-05-09T18:24:09.007Z",  }} ]

七、Search查询

返回某个维度，维度值为指定值的结果

如下返回dim1和dim2中包含Ke的数据

{

  "queryType": "search",  "dataSource": "sample_datasource",  "granularity": "day",  "searchDimensions": [    "dim1",    "dim2"  ],  "query": {    "type": "insensitive_contains",    "value": "Ke"  },  "sort" : {    "type": "lexicographic"  },  "intervals": [    "2013-01-01T00:00:00.000/2013-01-03T00:00:00.000"  ]}

There are several main parts to a search query:

propertydescriptionrequired?queryType类型yesdataSource数据源yesgranularity粒度yesfilterSee Filters.nolimit

定义每个节点的最大返回值

no (default to 1000)intervals范围yessearchDimensions查询的维度noquerySearchQuerySpec.yessort

An object specifying how the results of the search should be sorted.
Possible types are "lexicographic" (the default sort), "alphanumeric", "strlen", and "numeric".
See Sorting Orders for more details.

排序方式，默认lexicographic，还有alphanumeric

nocontextContextno

 [  {    "timestamp": "2012-01-01T00:00:00.000Z",    "result": [      {        "dimension": "dim1",        "value": "Ke$ha",        "count": 3      },      {        "dimension": "dim2",        "value": "Ke$haForPresident",        "count": 1      }    ]  },  {    "timestamp": "2012-01-02T00:00:00.000Z",    "result": [      {        "dimension": "dim1",        "value": "SomethingThatContainsKe",        "count": 1      },      {        "dimension": "dim2",        "value": "SomethingElseThatContainsKe",        "count": 2      }    ]  }]

八、Select查询

返回原生行，支持分页

  {   "queryType": "select",   "dataSource": "wikipedia",   "descending": "false",   "dimensions":[],   "metrics":[],   "granularity": "all",   "intervals": [     "2013-01-01/2013-01-02"   ],   "pagingSpec":{"pagingIdentifiers": {}, "threshold":5} }

propertydescriptionrequired?queryType类型yesdataSource数据源yesintervals数据范围yesdescending排序nofilterSee Filtersnodimensions

如果为空返回全部

nometrics如果为空返回全部nogranularity粒度 GranularitiesyespagingSpec

在所有scanned segments中的offsets，pagingIdentifiers为下一个query分页用

yescontext no

The format of the result is:

  [{  "timestamp" : "2013-01-01T00:00:00.000Z",  "result" : {    "pagingIdentifiers" : {      "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9" : 4    },    "events" : [ {      "segmentId" : "wikipedia_editstream_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",      "offset" : 0,      "event" : {        "timestamp" : "2013-01-01T00:00:00.000Z",        "robot" : "1",        "namespace" : "article",        "anonymous" : "0",        "unpatrolled" : "0",        "page" : "11._korpus_(NOVJ)",        "language" : "sl",        "newpage" : "0",        "user" : "EmausBot",        "count" : 1.0,        "added" : 39.0,        "delta" : 39.0,        "variation" : 39.0,        "deleted" : 0.0      }    }, {      "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",      "offset" : 1,      "event" : {        "timestamp" : "2013-01-01T00:00:00.000Z",        "robot" : "0",        "namespace" : "article",        "anonymous" : "0",        "unpatrolled" : "0",        "page" : "112_U.S._580",        "language" : "en",        "newpage" : "1",        "user" : "MZMcBride",        "count" : 1.0,        "added" : 70.0,        "delta" : 70.0,        "variation" : 70.0,        "deleted" : 0.0      }    }, {      "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",      "offset" : 2,      "event" : {        "timestamp" : "2013-01-01T00:00:00.000Z",        "robot" : "0",        "namespace" : "article",        "anonymous" : "0",        "unpatrolled" : "0",        "page" : "113_U.S._243",        "language" : "en",        "newpage" : "1",        "user" : "MZMcBride",        "count" : 1.0,        "added" : 77.0,        "delta" : 77.0,        "variation" : 77.0,        "deleted" : 0.0      }    }, {      "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",      "offset" : 3,      "event" : {        "timestamp" : "2013-01-01T00:00:00.000Z",        "robot" : "0",        "namespace" : "article",        "anonymous" : "0",        "unpatrolled" : "0",        "page" : "113_U.S._73",        "language" : "en",        "newpage" : "1",        "user" : "MZMcBride",        "count" : 1.0,        "added" : 70.0,        "delta" : 70.0,        "variation" : 70.0,        "deleted" : 0.0      }    }, {      "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",      "offset" : 4,      "event" : {        "timestamp" : "2013-01-01T00:00:00.000Z",        "robot" : "0",        "namespace" : "article",        "anonymous" : "0",        "unpatrolled" : "0",        "page" : "113_U.S._756",        "language" : "en",        "newpage" : "1",        "user" : "MZMcBride",        "count" : 1.0,        "added" : 68.0,        "delta" : 68.0,        "variation" : 68.0,        "deleted" : 0.0      }    } ]  }} ]

threshold 决定命中多少个，每一个有一个offset所为索引，descending 时offset为负数

Result Pagination

PagingSpec决定分页，threshold决定每个页中多少数据，开始时必须有一个pagingIdentifiers且为空

 "pagingSpec":{"pagingIdentifiers": {}, "threshold":5}

返回时pagingIndentifers不为空，标识是结果。当取下一个结果时必须pagingIdentifiers设置为上一个结果的pagingIdentifiers值

九、Components-Datasources

数据源

Table Data Source

一个table作为源

 {    "type": "table",    "name": "<string_value>"}

Union Data Source

几个table（same schema）做union作为源

 {       "type": "union",       "dataSources": ["<string_value1>", "<string_value2>", "<string_value3>", ... ]}

Query Data Source

This is used for nested groupBys and is only currently supported for groupBys.

 {    "type": "query",    "query": {        "type": "groupBy",        ...    }}

十、Components-Query Filters

相当于SQL中where

Selector filter

selector 将一个特定的dimension = 特定的值，也可组成复杂的布尔表达式，Filtering with Extraction Functions

语法如下:

 "filter": { "type": "selector", "dimension": <dimension_string>, "value": <dimension_value_string> }

相当于 WHERE <dimension_string> = '<dimension_value_string>'.

Column Comparison filter

维度间的比较

 "filter": { "type": "columnComparison", "dimensions": [<dimension_a>, <dimension_b>] }

相当于WHERE <dimension_a> = <dimension_b>.

Regular expression filter

Java regular expression. 维度和given pattern正则， Filtering with Extraction Functions

 "filter": { "type": "regex", "dimension": <dimension_string>, "pattern": <pattern_string> }

Logical expression filters

AND

 "filter": { "type": "and", "fields": [<filter>, <filter>, ...] }

OR

 "filter": { "type": "or", "fields": [<filter>, <filter>, ...] }

NOT

The grammar for a NOT filter is as follows:

 "filter": { "type": "not", "field": <filter> }

JavaScript filter

The JavaScript filter matches a dimension against the specified JavaScript function predicate. The filter matches values for which the function returns true.

The function takes a single argument, the dimension value, and returns either true or false.

 {  "type" : "javascript",  "dimension" : <dimension_string>,  "function" : "function(value) { <...> }"}

Example The following matches any dimension values for the dimension name between 'bar' and 'foo'

 {  "type" : "javascript",  "dimension" : "name",  "function" : "function(x) { return(x >= 'bar' && x <= 'foo') }"}

The JavaScript filter supports the use of extraction functions, see Filtering with Extraction Functions for details.

JavaScript-based functionality is disabled by default. Please refer to the Druid JavaScript programming guide for guidelines about using Druid's JavaScript functionality, including instructions on how to enable it.

Search filter

在部分string匹配

 {    "filter": {        "type": "search",        "dimension": "product",        "query": {          "type": "insensitive_contains",          "value": "foo"         }            }}

propertydescriptionrequired?type类型yesdimension维度yesquery类型.yesextractionFnExtraction function 作用在维度上no

Search Query Spec

Contains

propertydescriptionrequired?type一般是 "contains".yesvalue匹配的字符yescaseSensitive是否大小写敏感no (default == false)

Insensitive Contains

propertydescriptionrequired?type"insensitive_contains".yesvalue匹配的字符yes

"insensitive_contains" 等同于 "contains" search with "caseSensitive": false (or not provided).

Fragment

propertydescriptionrequired?typeThis String should always be "fragment".yesvaluesA JSON array of String values to run the search over.yescaseSensitiveWhether strings should be compared as case sensitive or not. Default: false(insensitive)no

In filter

类似于 SQL query:

  SELECT COUNT(*) AS 'Count' FROM `table` WHERE `outlaw` IN ('Good', 'Bad', 'Ugly')

The grammar for a IN filter is as follows:

 {    "type": "in",    "dimension": "outlaw",    "values": ["Good", "Bad", "Ugly"]}

The IN filter supports the use of extraction functions, see Filtering with Extraction Functions for details.

Like filter

像SQL中的SQL LIKE， "%" (matches any number of characters) and "_" (matches any one character).

propertytypedescriptionrequired?typeStringThis should always be "like".yesdimensionStringThe dimension to filter onyespatternStringLIKE pattern, such as "foo%" or "___bar".yesescapeStringAn escape character that can be used to escape special characters.noextractionFnExtraction functionExtraction function to apply to the dimensionno

Like filters support the use of extraction functions, see Filtering with Extraction Functions for details.

last_name LIKE "D%" (i.e. last_name starts with "D").

 {    "type": "like",    "dimension": "last_name",    "pattern": "D%"}

Bound filter

维度值的范围 greater than, less than, greater than or equal to, less than or equal to, and "between" (if both "lower" and "upper" are set).

propertytypedescriptionrequired?typeString"bound".yesdimensionString维度yeslowerStringThe lower bound for the filternoupperStringThe upper bound for the filternolowerStrictBooleanPerform strict comparison on the lower bound ("<" instead of "<=")no, default: falseupperStrictBooleanPerform strict comparison on the upper bound (">" instead of ">=")no, default: falseorderingString

不同bound间比较的排序方法 Sorting Orders

no, default: "lexicographic"extractionFnExtraction functionExtraction function to apply to the dimensionno

（1） 21 <= age <= 31:

{ "type": "bound", "dimension": "age", "lower": "21", "upper": "31" , "ordering": "numeric" }

（2）foo <= name <= hoo, using the default lexicographic sorting order. json

{ "type": "bound", "dimension": "name", "lower": "foo", "upper": "hoo" }

（3）21 < age < 31

{ "type": "bound", "dimension": "age", "lower": "21", "lowerStrict": true, "upper": "31" , "upperStrict": true, "ordering": "numeric" }

（4）age < 31.

{ "type": "bound", "dimension": "age", "upper": "31" , "upperStrict": true, "ordering": "numeric" }

（5）age >= 18

{ "type": "bound", "dimension": "age", "lower": "18" , "ordering": "numeric" }

Interval Filter

对long millisecond values比较，适合__time column, long metric columns和dimensions中s long milliseconds的维度，ISO 8601 intervals， left-closed and right-open matching (i.e., start <= time < end).

propertytypedescriptionrequired?typeString"interval".yesdimensionStringdimensionyesintervalsArraytime rangesyesextractionFnExtraction functionExtraction function to apply to the dimensionno

October 1-7, 2014 and November 15-16, 2014.

{ "type" : "interval", "dimension" : "__time", "intervals" : [ "2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z", "2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z" ] }

等同于下面的形式

 {    "type": "or",    "fields": [      {        "type": "bound",        "dimension": "__time",        "lower": "1412121600000",        "lowerStrict": false,        "upper": "1412640000000" ,        "upperStrict": true,        "ordering": "numeric"      },      {         "type": "bound",         "dimension": "__time",         "lower": "1416009600000",         "lowerStrict": false,         "upper": "1416096000000" ,         "upperStrict": true,         "ordering": "numeric"      }    ]}

Column types

按照类型过滤

Druid supports filtering on timestamp, string, long, and float columns.

Filtering on numeric columns

myFloatColumn = 10.1:

 "filter": {  "type": "selector",  "dimension": "myFloatColumn",  "value": "10.1"}

10 <= myFloatColumn < 20:

 "filter": {  "type": "bound",  "dimension": "myFloatColumn",  "ordering": "numeric",  "lowerBound": "10",  "lowerStrict": false,  "upperBound": "20",  "upperStrict": true}

Filtering on the Timestamp Column

filtering on a long timestamp value:

 "filter": {  "type": "selector",  "dimension": "__time",  "value": "124457387532"}

Filtering on day of week:

 "filter": {  "type": "selector",  "dimension": "__time",  "value": "Friday",  "extractionFn": {    "type": "timeFormat",    "format": "EEEE",    "timeZone": "America/New_York",    "locale": "en"  }}

Filtering on a set of ISO 8601 intervals:

 {    "type" : "interval",    "dimension" : "__time",    "intervals" : [      "2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z",      "2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z"    ]}

十一、Components-Aggregations

在数据进入Druid前的聚合. 查询时的聚合

Count aggregator

 { "type" : "count", "name" : <output_name> }

对列count，不一定是摄入的原生数据量，可以在摄入时配置roll up，为了计算摄入数据量在 ingestion time时设置一个count ，在查询时用longSum aggregator

Sum aggregators

`longSum` aggregator

 { "type" : "longSum", "name" : <output_name>, "fieldName" : <metric_name> }

fieldName – name 做64-bitsum 结果命名为output_name

`doubleSum` aggregator

64-bit floating point value

 { "type" : "doubleSum", "name" : <output_name>, "fieldName" : <metric_name> }

Min / Max aggregators

`doubleMin` aggregator

doubleMin Double.POSITIVE_INFINITY

 { "type" : "doubleMin", "name" : <output_name>, "fieldName" : <metric_name> }

`doubleMax` aggregator

doubleMax Double.NEGATIVE_INFINITY

 { "type" : "doubleMax", "name" : <output_name>, "fieldName" : <metric_name> }

`longMin` aggregator

longMin Long.MAX_VALUE

 { "type" : "longMin", "name" : <output_name>, "fieldName" : <metric_name> }

`longMax` aggregator

longMax Long.MIN_VALUE

 { "type" : "longMax", "name" : <output_name>, "fieldName" : <metric_name> }

First / Last aggregator

First and Last aggregator cannot be used in ingestion spec, and should only be specified as part of queries.

Note that queries with first/last aggregators on a segment created with rollup enabled will return the rolled up value, and not the last value within the raw ingested data.

`doubleFirst` aggregator

doubleFirst computes the metric value with the minimum timestamp or 0 if no row exist

 {  "type" : "doubleFirst",  "name" : <output_name>,  "fieldName" : <metric_name>}

`doubleLast` aggregator

doubleLast computes the metric value with the maximum timestamp or 0 if no row exist

 {  "type" : "doubleLast",  "name" : <output_name>,  "fieldName" : <metric_name>}

`longFirst` aggregator

longFirst computes the metric value with the minimum timestamp or 0 if no row exist

 {  "type" : "longFirst",  "name" : <output_name>,  "fieldName" : <metric_name>}

`longLast` aggregator

longLast computes the metric value with the maximum timestamp or 0 if no row exist

 {   "type" : "longLast",  "name" : <output_name>,   "fieldName" : <metric_name>,}

JavaScript aggregator

Computes an arbitrary JavaScript function over a set of columns (both metrics and dimensions are allowed). Your JavaScript functions are expected to return floating-point values.

 { "type": "javascript",  "name": "<output_name>",  "fieldNames"  : [ <column1>, <column2>, ... ],  "fnAggregate" : "function(current, column1, column2, ...) { <updates partial aggregate (current) based on the current row values> return <updated partial aggregate> }",  "fnCombine"   : "function(partialA, partialB) { return <combined partial results>; }",  "fnReset"     : "function() { return <initial value>; }"}

Example

 {  "type": "javascript",  "name": "sum(log(x)*y) + 10",  "fieldNames": ["x", "y"],  "fnAggregate" : "function(current, a, b) { return current + (Math.log(a) * b); }",  "fnCombine"   : "function(partialA, partialB) { return partialA + partialB; }",  "fnReset"     : "function() { return 10; }"}

Approximate Aggregations 近似计算

Cardinality aggregator

用HyperLogLog估计维度的基数，比hyperUnique慢，当只关心某一个维度时建议使用hyperUnique aggregator

 {  "type": "cardinality",  "name": "<output_name>",  "fields": [ <dimension1>, <dimension2>, ... ],  "byRow": <false | true> # (optional, defaults to false)}

Cardinality by value

byRow = false时，对所有给定dimensions union后求cardinality

对于一个维度等同于

 SELECT COUNT(DISTINCT(dimension)) FROM <datasource>

多个维度时等同于

 SELECT COUNT(DISTINCT(value)) FROM (  SELECT dim_1 as value FROM <datasource>  UNION  SELECT dim_2 as value FROM <datasource>  UNION  SELECT dim_3 as value FROM <datasource>)

Cardinality by row

byRow = true时等同于

 SELECT COUNT(*) FROM ( SELECT DIM1, DIM2, DIM3 FROM <datasource> GROUP BY DIM1, DIM2, DIM3 )

Example

计算人的pv 当地和非当地

 {  "type": "cardinality",  "name": "distinct_countries",  "fields": [ "country_of_origin", "country_of_residence" ]}

Determine the number of distinct people (i.e. combinations of first and last name).

 {  "type": "cardinality",  "name": "distinct_people",  "fields": [ "first_name", "last_name" ],  "byRow" : true}

Determine the number of distinct starting characters of last names

 {  "type": "cardinality",  "name": "distinct_last_name_first_char",  "fields": [    {     "type" : "extraction",     "dimension" : "last_name",     "outputName" :  "last_name_first_char",     "extractionFn" : { "type" : "substring", "index" : 0, "length" : 1 }    }  ],  "byRow" : true}

HyperUnique aggregator

HyperLogLog 估计一个维度的uv

 {   "type" : "hyperUnique",  "name" : <output_name>,  "fieldName" : <metric_name>,  "isInputHyperUnique" : false}

十二、Components-Post-Aggregations

对聚合结果二次计算后输出

Arithmetic post-aggregator

支持 +, -, *, /, quotient.

Note:

/ 除数为0返回0
quotient 和浮点数一样

一般也有 ordering, 对结果排序:

If no ordering (or null) is specified, the default floating point ordering is used.
numericFirst ordering always returns finite values first, followed by NaN, and infinite values last.:

 postAggregation : {  "type"  : "arithmetic",  "name"  : <output_name>,  "fn"    : <arithmetic_function>,  "fields": [<post_aggregator>, <post_aggregator>, ...],  "ordering" : <null (default), or "numericFirst">}

Field accessor post-aggregator

aggregator.的输出作为输入

fieldName refers to the output name of the aggregator given in the aggregations portion of the query.

 { "type" : "fieldAccess", "name": <output_name>, "fieldName" : <aggregator_name> }

Constant post-aggregator

constant post-aggregator 返回特定的值

 { "type"  : "constant", "name"  : <output_name>, "value" : <numerical_value> }

Greatest / Least post-aggregators

doubleGreatest and longGreatest computes the maximum of all fields and Double.NEGATIVE_INFINITY. doubleLeast and longLeast computes the minimum of all fields and Double.POSITIVE_INFINITY.

The difference between the doubleMax aggregator and the doubleGreatest post-aggregator is that doubleMax returns the highest value of all rows for one specific column while doubleGreatest returns the highest value of multiple columns in one row.

类似于SQL的MAX和GREATEST

 {  "type"  : "doubleGreatest",  "name"  : <output_name>,  "fields": [<post_aggregator>, <post_aggregator>, ...]}

HyperUnique Cardinality post-aggregator

The hyperUniqueCardinality post aggregator is used to wrap a hyperUnique object such that it can be used in post aggregations.

 { "type"  : "hyperUniqueCardinality", "name": <output name>, "fieldName"  : <the name field value of the hyperUnique aggregator>}

It can be used in a sample calculation as so:

   "aggregations" : [{    {"type" : "count", "name" : "rows"},    {"type" : "hyperUnique", "name" : "unique_users", "fieldName" : "uniques"}  }],  "postAggregations" : [{    "type"   : "arithmetic",    "name"   : "average_users_per_row",    "fn"     : "/",    "fields" : [      { "type" : "hyperUniqueCardinality", "fieldName" : "unique_users" },      { "type" : "fieldAccess", "name" : "rows", "fieldName" : "rows" }    ]  }]

例子

计算占比

 {  ...  "aggregations" : [    { "type" : "count", "name" : "rows" },    { "type" : "doubleSum", "name" : "tot", "fieldName" : "total" }  ],  "postAggregations" : [{    "type"   : "arithmetic",    "name"   : "average",    "fn"     : "*",    "fields" : [       { "type"   : "arithmetic",         "name"   : "div",         "fn"     : "/",         "fields" : [           { "type" : "fieldAccess", "name" : "tot", "fieldName" : "tot" },           { "type" : "fieldAccess", "name" : "rows", "fieldName" : "rows" }         ]       },       { "type" : "constant", "name": "const", "value" : 100 }    ]  }]  ...}

十三、Components-Aggregation Granularity聚合粒度

决定数据在时间维度的分桶（hour, day, minute, etc.），可以是简单也可以是任意

Simple Granularities

支持的字符串表达形式是: all, none, second, minute, fifteen_minute, thirty_minute, hour, day, week, month, quarter and year.

all buckets everything into a single bucket
none node和摄入的粒度一样

millisecond粒度摄入

{"timestamp": "2013-08-31T01:02:33Z", "page": "AAA", "language" : "en"}{"timestamp": "2013-09-01T01:02:33Z", "page": "BBB", "language" : "en"}{"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"}{"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"}

hour granularity groupby,

 {   "queryType":"groupBy",   "dataSource":"my_dataSource",   "granularity":"hour",   "dimensions":[      "language"   ],   "aggregations":[      {         "type":"count",         "name":"count"      }   ],   "intervals":[      "2000-01-01T00:00Z/3000-01-01T00:00Z"   ]}

you will get

 [ {  "version" : "v1",  "timestamp" : "2013-08-31T01:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-01T01:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-02T23:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-03T03:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }} ]

按天聚合

 [ {  "version" : "v1",  "timestamp" : "2013-08-31T00:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-01T00:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-02T00:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-03T00:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }} ]

node和摄入的粒度一样

 [ {  "version" : "v1",  "timestamp" : "2013-08-31T01:02:33.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-01T01:02:33.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-02T23:32:45.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-03T03:32:45.000Z",  "event" : {    "count" : 1,    "language" : "en"  }} ]

all 所有

 [ {  "version" : "v1",  "timestamp" : "2000-01-01T00:00:00.000Z",  "event" : {    "count" : 4,    "language" : "en"  }} ]

Duration Granularities

milliseconds 的整数倍

This chunks up every 2 hours.

 {"type": "duration", "duration": 3600000, "origin": "2012-01-01T00:30:00Z"}

Example:

Reusing the data in the previous example, after submitting a groupBy query with 24 hours duration,

 {   "queryType":"groupBy",   "dataSource":"my_dataSource",   "granularity":{"type": "duration", "duration": "86400000"},   "dimensions":[      "language"   ],   "aggregations":[      {         "type":"count",         "name":"count"      }   ],   "intervals":[      "2000-01-01T00:00Z/3000-01-01T00:00Z"   ]}

you will get

 [ {  "version" : "v1",  "timestamp" : "2013-08-31T00:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-01T00:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-02T00:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-03T00:00:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }} ]

if you set the origin for the granularity to 2012-01-01T00:30:00Z,

    "granularity":{"type": "duration", "duration": "86400000", "origin":"2012-01-01T00:30:00Z"}

you will get

 [ {  "version" : "v1",  "timestamp" : "2013-08-31T00:30:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-01T00:30:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-02T00:30:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-03T00:30:00.000Z",  "event" : {    "count" : 1,    "language" : "en"  }} ]

Note that the timestamp for each bucket starts at the 30th minute.

Period Granularities

Period granularities are specified as arbitrary period combinations of years, months, weeks, hours, minutes and seconds (e.g. P2W, P3M, PT1H30M, PT0.750S) in ISO8601 format. They support specifying a time zone which determines where period boundaries start as well as the timezone of the returned timestamps. By default, years start on the first of January, months start on the first of the month and weeks start on Mondays unless an origin is specified.

Time zone is optional (defaults to UTC). Origin is optional (defaults to 1970-01-01T00:00:00 in the given time zone).

 {"type": "period", "period": "P2D", "timeZone": "America/Los_Angeles"}

This will bucket by two-day chunks in the Pacific timezone.

 {"type": "period", "period": "P3M", "timeZone": "America/Los_Angeles", "origin": "2012-02-01T00:00:00-08:00"}

This will bucket by 3-month chunks in the Pacific timezone where the three-month quarters are defined as starting from February.

Example

Reusing the data in the previous example, if you submit a groupBy query with 1 day period in Pacific timezone,

 {   "queryType":"groupBy",   "dataSource":"my_dataSource",   "granularity":{"type": "period", "period": "P1D", "timeZone": "America/Los_Angeles"},   "dimensions":[      "language"   ],   "aggregations":[      {         "type":"count",         "name":"count"      }   ],   "intervals":[      "1999-12-31T16:00:00.000-08:00/2999-12-31T16:00:00.000-08:00"   ]}

you will get

 [ {  "version" : "v1",  "timestamp" : "2013-08-30T00:00:00.000-07:00",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-08-31T00:00:00.000-07:00",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-02T00:00:00.000-07:00",  "event" : {    "count" : 2,    "language" : "en"  }} ]

Note that the timestamp for each bucket has been converted to Pacific time. Row {"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"} and {"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"}are put in the same bucket because they are in the same day in Pacific time.

Also note that the intervals in groupBy query will not be converted to the timezone specified, the timezone specified in granularity is only applied on the query results.

If you set the origin for the granularity to 1970-01-01T20:30:00-08:00,

    "granularity":{"type": "period", "period": "P1D", "timeZone": "America/Los_Angeles", "origin": "1970-01-01T20:30:00-08:00"}

you will get

 [ {  "version" : "v1",  "timestamp" : "2013-08-29T20:30:00.000-07:00",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-08-30T20:30:00.000-07:00",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-01T20:30:00.000-07:00",  "event" : {    "count" : 1,    "language" : "en"  }}, {  "version" : "v1",  "timestamp" : "2013-09-02T20:30:00.000-07:00",  "event" : {    "count" : 1,    "language" : "en"  }} ]

Note that the origin you specified has nothing to do with the timezone, it only serves as a starting point for locating the very first granularity bucket. In this case, Row {"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"} and {"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"} are not in the same bucket.

阅读全文

0 0

7-Druid-Querying

原文

一、Querying

calcite库

SQL Support for Druid

Queries分类

Aggregation Queries

Metadata Queries

Search Queries

查询方式使用建议

Query的取消

Query Errors

二、聚合查询-Timeseries查询

无结果时的零值填充Zero-filling

三、聚合查询-TopN查询

Behavior on multi-value dimensions

混叠Aliasing

Example First query:

Example second query:

四、聚合查询-groupBy查询

五、元数据查询-Time Boundary查询

六、元数据查询-Data Source Metadata查询

七、Search查询

八、Select查询

Result Pagination

九、Components-Datasources

Table Data Source

Union Data Source

Query Data Source

十、Components-Query Filters

Selector filter

Column Comparison filter

Regular expression filter

Logical expression filters

AND

OR

NOT

JavaScript filter

Search filter

Search Query Spec

Contains

Insensitive Contains

Fragment

In filter

Like filter

Bound filter

Interval Filter

Column types

Filtering on numeric columns

Filtering on the Timestamp Column

十一、Components-Aggregations

Count aggregator

Sum aggregators

longSum aggregator

doubleSum aggregator

Min / Max aggregators

doubleMin aggregator

doubleMax aggregator

longMin aggregator

longMax aggregator

First / Last aggregator

doubleFirst aggregator

doubleLast aggregator

longFirst aggregator

longLast aggregator

JavaScript aggregator

Approximate Aggregations 近似计算

Cardinality aggregator

Cardinality by value

Cardinality by row

HyperUnique aggregator

十二、Components-Post-Aggregations

Arithmetic post-aggregator

Field accessor post-aggregator

Constant post-aggregator

Greatest / Least post-aggregators

HyperUnique Cardinality post-aggregator

十三、Components-Aggregation Granularity聚合粒度

Simple Granularities

Duration Granularities

`无结果时的零值填充`Zero-filling

`longSum` aggregator

`doubleSum` aggregator

`doubleMin` aggregator

`doubleMax` aggregator

`longMin` aggregator

`longMax` aggregator

`doubleFirst` aggregator

`doubleLast` aggregator

`longFirst` aggregator

`longLast` aggregator