7-Druid-Querying
来源:互联网 发布:数控加工编程代码例题 编辑:程序博客网 时间:2024/06/06 03:21
原文
http://druid.io/docs/0.10.1/querying/querying.html
其他:
Transforming Dimension Values
The following JSON fields can be used in a query to operate on dimension values.
http://druid.io/docs/0.10.1/querying/dimensionspecs.html
Query Context
The query context is used for various query configuration parameters. The following parameters apply to all querie
http://druid.io/docs/0.10.1/querying/query-context.html
Multi-value dimensions
http://druid.io/docs/0.10.1/querying/multi-value-dimensions.html
一、Querying
Queries 是以HTTP REST形式向nodes请求 (Broker, Historical, or Realtime). JSON形式标识请求的内容:
curl -X POST '<queryable_host>:<port>/druid/v2/?pretty' -H 'Content-Type:application/json' -d @<query_json_file>
其他形式的第三方查询库: client libraries
http://druid.io/libraries.html
calcite库
:http://calcite.apache.org/docs/druid_adapter.html
- Apache Calcite - SQL parser, planner and query engine whose Druid adapter can query data residing in Druid, and combine it with data in other locations; has local and remote JDBC drivers powered by Avatica
- implydata/plyql - A command line and HTTP interface for issuing SQL queries to Druid
SQL Support for Druid
http://druid.io/docs/0.10.1/querying/sql.html
Queries分类
Aggregation Queries
- Timeseries
- TopN
- GroupBy
Metadata Queries
- Time Boundary
- Segment Metadata
- Datasource Metadata
Search Queries
- Search
查询方式使用建议
建议使用 Timeseries and TopN queries 而非 GroupBy. GroupBy 复杂,实现低效. 对于 单维度的grouping sorting ,topN 查询优于 groupBys.
Query的取消
DELETE /druid/v2/{queryId}
例如取消query ID = 123的查询可以用(在制定端口发送请求)
curl -X DELETE "http://host:port/druid/v2/abc123"
Query Errors
如果查询失败,可得到 HTTP 500 response 包含JSON对象如下:
{ "error" : "Query timeout", "errorMessage" : "Timeout waiting for task.", "errorClass" : "java.util.concurrent.TimeoutException", "host" : "druid1.example.com:8083"}
Query timeout
超时Query interrupted
query中断,可能是JVM关闭
Query cancelled
通过api取消Resource limit exceeded
The query exceeded a configured resource limit (e.g. groupBy maxResults).Unknown exception
其他异常二、聚合查询-Timeseries查询
请求中包含timeseries对象,返回一个json数组,其中每个json对象代表一个结果值
timeseries query object :
{ "queryType": "timeseries", "dataSource": "sample_datasource", "granularity": "day", "descending": "true", "filter": { "type": "and", "fields": [ { "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" }, { "type": "or", "fields": [ { "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" }, { "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" } ] } ] }, "aggregations": [ { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" }, { "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" } ], "postAggregations": [ { "type": "arithmetic", "name": "sample_divide", "fn": "/", "fields": [ { "type": "fieldAccess", "name": "postAgg__sample_name1", "fieldName": "sample_name1" }, { "type": "fieldAccess", "name": "postAgg__sample_name2", "fieldName": "sample_name2" } ] } ], "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ]}
7个主要部分:
"timeseries"
yesdataSource数据源yesdescending是否排序Default is false
(ascending).
Granularities 查询结果分桶的粒度
yesfilter Filtersnoaggregations AggregationsnopostAggregations Post Aggregationsnocontext Contextno上面的查询结果总的是:2个点,每个数据点包含sample_fieldName1的 (long) sum ,sample_fieldName2的 (double) sum ,以及复合结果sample_fieldName1/sample_fieldName2 如下:
[ { "timestamp": "2012-01-01T00:00:00.000Z", "result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> } }, { "timestamp": "2012-01-02T00:00:00.000Z","result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> } }
]
无结果时的零值填充
Zero-filling
对于timeseries 粒度中没有值的情况,用0填充,如interval 2012-01-01/2012-01-04,其中2012-01-02没有数,则为:
[ { "timestamp": "2012-01-01T00:00:00.000Z", "result": { "sample_name1": <some_value> } }, { "timestamp": "2012-01-02T00:00:00.000Z", "result": { "sample_name1": 0 } }, { "timestamp": "2012-01-03T00:00:00.000Z", "result": { "sample_name1": <some_value> } }]
可以取消0填充,不在结果中显示,context中设置skipEmptyBuckets = ture,这样结果如下:
{ "queryType": "timeseries", "dataSource": "sample_datasource", "granularity": "day", "aggregations": [ { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" } ], "intervals": [ "2012-01-01T00:00:00.000/2012-01-04T00:00:00.000" ], "context" : { "skipEmptyBuckets": "true" }}
三、聚合查询-TopN查询
TopNs是每个节点将topK 发送到broker,然后broker将topN 返回,K在 max(1000, threshold)范围,可设置
topN query object :
{ "queryType": "topN", "dataSource": "sample_data", "dimension": "sample_dim", "threshold": 5, "metric": "count", "granularity": "all", "filter": { "type": "and", "fields": [ { "type": "selector", "dimension": "dim1", "value": "some_value" }, { "type": "selector", "dimension": "dim2", "value": "some_other_val" } ] }, "aggregations": [ { "type": "longSum", "name": "count", "fieldName": "count" }, { "type": "doubleSum", "name": "some_metric", "fieldName": "some_metric" } ], "postAggregations": [ { "type": "arithmetic", "name": "average", "fn": "/", "fields": [ { "type": "fieldAccess", "name": "some_metric", "fieldName": "some_metric" }, { "type": "fieldAccess", "name": "count", "fieldName": "count" } ] } ], "intervals": [ "2013-08-31T00:00:00.000/2013-09-03T00:00:00.000" ]}
.
各部分说明:
按照某个维度取top
yesthreshold限制N
yesmetrictop list的计算指标,对这些进行排序
yescontextSee Contextnoresults would look like so:
[ { "timestamp": "2013-08-31T00:00:00.000Z", "result": [ { "dim1": "dim1_val", "count": 111, "some_metrics": 10669, "average": 96.11711711711712 }, { "dim1": "another_dim1_val", "count": 88, "some_metrics": 28344, "average": 322.09090909090907 }, { "dim1": "dim1_val3", "count": 70, "some_metrics": 871, "average": 12.442857142857143 }, { "dim1": "dim1_val4", "count": 62, "some_metrics": 815, "average": 13.14516129032258 }, { "dim1": "dim1_val5", "count": 60, "some_metrics": 2787, "average": 46.45 } ] }]
Behavior on multi-value dimensions
topN queries can group on multi-value dimensions. When grouping on a multi-value dimension, all values from matching rows will be used to generate one group per value. It's possible for a query to return more groups than there are rows. For example, a topN on the dimension tags
with filter "t1" AND "t3"
would match only row1, and generate a result with three groups: t1
, t2
, and t3
. If you only need to include values that match your filter, you can use a filtered dimensionSpec. This can also improve performance.
See Multi-value dimensions for more details.
混叠Aliasing
现在的TopN是非精确值,
threshold可以通过 server parameter druid.query.topN.minTopNThreshold 设置,但是必须重启servers
精确结果必须groupBy后sort
容忍大约的topN排序在一个维度dimension有超过1000个维度值的情况,可用两次查询得到精确结果 ,一个 得到大约的 topN 个dimension的 values, 另一个值查询这第一步的这topN 值的结果。
Example First query:
{ "aggregations": [ { "fieldName": "L_QUANTITY_longSum", "name": "L_QUANTITY_", "type": "longSum" } ], "dataSource": "tpch_year", "dimension":"l_orderkey", "granularity": "all", "intervals": [ "1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z" ], "metric": "L_QUANTITY_", "queryType": "topN", "threshold": 2}
Example second query:
{ "aggregations": [ { "fieldName": "L_TAX_doubleSum", "name": "L_TAX_", "type": "doubleSum" }, { "fieldName": "L_DISCOUNT_doubleSum", "name": "L_DISCOUNT_", "type": "doubleSum" }, { "fieldName": "L_EXTENDEDPRICE_doubleSum", "name": "L_EXTENDEDPRICE_", "type": "doubleSum" }, { "fieldName": "L_QUANTITY_longSum", "name": "L_QUANTITY_", "type": "longSum" }, { "name": "count", "type": "count" } ], "dataSource": "tpch_year", "dimension":"l_orderkey", "filter": { "fields": [ { "dimension": "l_orderkey", "type": "selector", "value": "103136" }, { "dimension": "l_orderkey", "type": "selector", "value": "1648672" } ], "type": "or" }, "granularity": "all", "intervals": [ "1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z" ], "metric": "L_QUANTITY_", "queryType": "topN", "threshold": 2}
四、聚合查询-groupBy查询
{ "queryType": "groupBy", "dataSource": "sample_datasource", "granularity": "day", "dimensions": ["country", "device"], "limitSpec": { "type": "default", "limit": 5000, "columns": ["country", "data_transfer"] }, "filter": { "type": "and", "fields": [ { "type": "selector", "dimension": "carrier", "value": "AT&T" }, { "type": "or", "fields": [ { "type": "selector", "dimension": "make", "value": "Apple" }, { "type": "selector", "dimension": "make", "value": "Samsung" } ] } ] }, "aggregations": [ { "type": "longSum", "name": "total_usage", "fieldName": "user_count" }, { "type": "doubleSum", "name": "data_transfer", "fieldName": "data_transfer" } ], "postAggregations": [ { "type": "arithmetic", "name": "avg_usage", "fn": "/", "fields": [ { "type": "fieldAccess", "fieldName": "data_transfer" }, { "type": "fieldAccess", "fieldName": "total_usage" } ] } ], "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ], "having": { "type": "greaterThan", "aggregation": "total_usage", "value": 100 }}
需要groupBy的维度
yeslimitSpecSee LimitSpec.nohavingSee Having.nogranularity同上yesfilterSee FiltersnoaggregationsSee AggregationsnopostAggregationsSee Post Aggregationsnointervals同上yescontext同上no按照天 day, ["country", "device"] 聚合,limit 5000条,total_usage > 100
[ { "version" : "v1", "timestamp" : "2012-01-01T00:00:00.000Z", "event" : { "country" : <some_dim_value_one>, "device" : <some_dim_value_two>, "total_usage" : <some_value_one>, "data_transfer" :<some_value_two>, "avg_usage" : <some_avg_usage_value> } }, { "version" : "v1", "timestamp" : "2012-01-01T00:00:12.000Z", "event" : { "dim1" : <some_other_dim_value_one>, "dim2" : <some_other_dim_value_two>, "sample_name1" : <some_other_value_one>, "sample_name2" :<some_other_value_two>, "avg_usage" : <some_other_avg_usage_value> } },...]
五、元数据查询-Time Boundary查询
时间范围查询,返回数据集的最大最小时间点:
{ "queryType" : "timeBoundary", "dataSource": "sample_datasource", "bound" : < "maxTime" | "minTime" > # optional, defaults to returning both timestamps if not set "filter" : { "type": "and", "fields": [<filter>, <filter>, ...] } # optional}
minTime或minTime,默认返回两者
nofilterSee FiltersnocontextSee Contextno结果示例:
[ { "timestamp" : "2013-05-09T18:24:00.000Z", "result" : { "minTime" : "2013-05-09T18:24:00.000Z", "maxTime" : "2013-05-09T18:37:00.000Z" }} ]
六、元数据查询-Data Source Metadata查询
返回数据集最后一次数据摄入的时间,不能包括rollup
{ "queryType" : "dataSourceMetadata", "dataSource": "sample_datasource"}
结果
[ { "timestamp" : "2013-05-09T18:24:00.000Z", "result" : { "maxIngestedEventTime" : "2013-05-09T18:24:09.007Z", }} ]
七、Search查询
返回某个维度,维度值为指定值的结果
如下返回dim1和dim2中包含Ke的数据
{
"queryType": "search", "dataSource": "sample_datasource", "granularity": "day", "searchDimensions": [ "dim1", "dim2" ], "query": { "type": "insensitive_contains", "value": "Ke" }, "sort" : { "type": "lexicographic" }, "intervals": [ "2013-01-01T00:00:00.000/2013-01-03T00:00:00.000" ]}
There are several main parts to a search query:
定义每个节点的最大返回值
no (default to 1000)intervals范围yessearchDimensions查询的维度noquerySearchQuerySpec.yessortAn object specifying how the results of the search should be sorted.
Possible types are "lexicographic" (the default sort), "alphanumeric", "strlen", and "numeric".
See Sorting Orders for more details.
排序方式,默认lexicographic,还有alphanumeric
nocontextContextno [ { "timestamp": "2012-01-01T00:00:00.000Z", "result": [ { "dimension": "dim1", "value": "Ke$ha", "count": 3 }, { "dimension": "dim2", "value": "Ke$haForPresident", "count": 1 } ] }, { "timestamp": "2012-01-02T00:00:00.000Z", "result": [ { "dimension": "dim1", "value": "SomethingThatContainsKe", "count": 1 }, { "dimension": "dim2", "value": "SomethingElseThatContainsKe", "count": 2 } ] }]
八、Select查询
返回原生行,支持分页
{ "queryType": "select", "dataSource": "wikipedia", "descending": "false", "dimensions":[], "metrics":[], "granularity": "all", "intervals": [ "2013-01-01/2013-01-02" ], "pagingSpec":{"pagingIdentifiers": {}, "threshold":5} }
如果为空返回全部
nometrics如果为空返回全部nogranularity粒度 GranularitiesyespagingSpec在所有scanned segments中的offsets,pagingIdentifiers为下一个query分页用
The format of the result is:
[{ "timestamp" : "2013-01-01T00:00:00.000Z", "result" : { "pagingIdentifiers" : { "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9" : 4 }, "events" : [ { "segmentId" : "wikipedia_editstream_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9", "offset" : 0, "event" : { "timestamp" : "2013-01-01T00:00:00.000Z", "robot" : "1", "namespace" : "article", "anonymous" : "0", "unpatrolled" : "0", "page" : "11._korpus_(NOVJ)", "language" : "sl", "newpage" : "0", "user" : "EmausBot", "count" : 1.0, "added" : 39.0, "delta" : 39.0, "variation" : 39.0, "deleted" : 0.0 } }, { "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9", "offset" : 1, "event" : { "timestamp" : "2013-01-01T00:00:00.000Z", "robot" : "0", "namespace" : "article", "anonymous" : "0", "unpatrolled" : "0", "page" : "112_U.S._580", "language" : "en", "newpage" : "1", "user" : "MZMcBride", "count" : 1.0, "added" : 70.0, "delta" : 70.0, "variation" : 70.0, "deleted" : 0.0 } }, { "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9", "offset" : 2, "event" : { "timestamp" : "2013-01-01T00:00:00.000Z", "robot" : "0", "namespace" : "article", "anonymous" : "0", "unpatrolled" : "0", "page" : "113_U.S._243", "language" : "en", "newpage" : "1", "user" : "MZMcBride", "count" : 1.0, "added" : 77.0, "delta" : 77.0, "variation" : 77.0, "deleted" : 0.0 } }, { "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9", "offset" : 3, "event" : { "timestamp" : "2013-01-01T00:00:00.000Z", "robot" : "0", "namespace" : "article", "anonymous" : "0", "unpatrolled" : "0", "page" : "113_U.S._73", "language" : "en", "newpage" : "1", "user" : "MZMcBride", "count" : 1.0, "added" : 70.0, "delta" : 70.0, "variation" : 70.0, "deleted" : 0.0 } }, { "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9", "offset" : 4, "event" : { "timestamp" : "2013-01-01T00:00:00.000Z", "robot" : "0", "namespace" : "article", "anonymous" : "0", "unpatrolled" : "0", "page" : "113_U.S._756", "language" : "en", "newpage" : "1", "user" : "MZMcBride", "count" : 1.0, "added" : 68.0, "delta" : 68.0, "variation" : 68.0, "deleted" : 0.0 } } ] }} ]
threshold 决定命中多少个,每一个有一个offset所为索引,descending 时offset为负数
Result Pagination
PagingSpec决定分页,threshold决定每个页中多少数据,开始时必须有一个pagingIdentifiers且为空
"pagingSpec":{"pagingIdentifiers": {}, "threshold":5}
返回时pagingIndentifers不为空,标识是结果。当取下一个结果时必须pagingIdentifiers设置为上一个结果的pagingIdentifiers值
九、Components-Datasources
数据源
Table Data Source
一个table作为源
{ "type": "table", "name": "<string_value>"}
Union Data Source
几个table(same schema)做union作为源
{ "type": "union", "dataSources": ["<string_value1>", "<string_value2>", "<string_value3>", ... ]}
Query Data Source
This is used for nested groupBys and is only currently supported for groupBys.
{ "type": "query", "query": { "type": "groupBy", ... }}
十、Components-Query Filters
相当于SQL中where
Selector filter
selector 将一个特定的dimension = 特定的值,也可组成复杂的布尔表达式,Filtering with Extraction Functions
语法如下:
"filter": { "type": "selector", "dimension": <dimension_string>, "value": <dimension_value_string> }
相当于
WHERE <dimension_string> = '<dimension_value_string>'
.
Column Comparison filter
维度间的比较
"filter": { "type": "columnComparison", "dimensions": [<dimension_a>, <dimension_b>] }
相当于WHERE <dimension_a> = <dimension_b>
.
Regular expression filter
Java regular expression. 维度和given pattern正则, Filtering with Extraction Functions
"filter": { "type": "regex", "dimension": <dimension_string>, "pattern": <pattern_string> }
Logical expression filters
AND
"filter": { "type": "and", "fields": [<filter>, <filter>, ...] }
OR
"filter": { "type": "or", "fields": [<filter>, <filter>, ...] }
NOT
The grammar for a NOT filter is as follows:
"filter": { "type": "not", "field": <filter> }
JavaScript filter
The JavaScript filter matches a dimension against the specified JavaScript function predicate. The filter matches values for which the function returns true.
The function takes a single argument, the dimension value, and returns either true or false.
{ "type" : "javascript", "dimension" : <dimension_string>, "function" : "function(value) { <...> }"}
Example The following matches any dimension values for the dimension name
between 'bar'
and 'foo'
{ "type" : "javascript", "dimension" : "name", "function" : "function(x) { return(x >= 'bar' && x <= 'foo') }"}
The JavaScript filter supports the use of extraction functions, see Filtering with Extraction Functions for details.
Search filter
在部分string匹配
{ "filter": { "type": "search", "dimension": "product", "query": { "type": "insensitive_contains", "value": "foo" } }}
Search Query Spec
Contains
Insensitive Contains
"insensitive_contains" 等同于 "contains" search with "caseSensitive": false (or not provided).
Fragment
In filter
类似于 SQL query:
SELECT COUNT(*) AS 'Count' FROM `table` WHERE `outlaw` IN ('Good', 'Bad', 'Ugly')
The grammar for a IN filter is as follows:
{ "type": "in", "dimension": "outlaw", "values": ["Good", "Bad", "Ugly"]}
The IN filter supports the use of extraction functions, see Filtering with Extraction Functions for details.
Like filter
像SQL中的SQL LIKE, "%" (matches any number of characters) and "_" (matches any one character).
Like filters support the use of extraction functions, see Filtering with Extraction Functions for details.
last_name LIKE "D%"
(i.e. last_name starts with "D").
{ "type": "like", "dimension": "last_name", "pattern": "D%"}
Bound filter
维度值的范围 greater than, less than, greater than or equal to, less than or equal to, and "between" (if both "lower" and "upper" are set).
不同bound间比较的排序方法 Sorting Orders
(1) 21 <= age <= 31
:
{ "type": "bound", "dimension": "age", "lower": "21", "upper": "31" , "ordering": "numeric" }
(2)foo <= name <= hoo
, using the default lexicographic sorting order. json
{ "type": "bound", "dimension": "name", "lower": "foo", "upper": "hoo" }
(3)21 < age < 31
{ "type": "bound", "dimension": "age", "lower": "21", "lowerStrict": true, "upper": "31" , "upperStrict": true, "ordering": "numeric" }
(4)age < 31
.
{ "type": "bound", "dimension": "age", "upper": "31" , "upperStrict": true, "ordering": "numeric" }
(5)age >= 18
{ "type": "bound", "dimension": "age", "lower": "18" , "ordering": "numeric" }
Interval Filter
对long millisecond values比较,适合__time
column, long metric columns和dimensions中s long milliseconds的维度,ISO 8601 intervals, left-closed and right-open matching (i.e., start <= time < end).
October 1-7, 2014 and November 15-16, 2014.
{ "type" : "interval", "dimension" : "__time", "intervals" : [ "2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z", "2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z" ] }
等同于下面的形式
{ "type": "or", "fields": [ { "type": "bound", "dimension": "__time", "lower": "1412121600000", "lowerStrict": false, "upper": "1412640000000" , "upperStrict": true, "ordering": "numeric" }, { "type": "bound", "dimension": "__time", "lower": "1416009600000", "lowerStrict": false, "upper": "1416096000000" , "upperStrict": true, "ordering": "numeric" } ]}
Column types
按照类型过滤
Druid supports filtering on timestamp, string, long, and float columns.
Filtering on numeric columns
myFloatColumn = 10.1
:
"filter": { "type": "selector", "dimension": "myFloatColumn", "value": "10.1"}
10 <= myFloatColumn < 20
:
"filter": { "type": "bound", "dimension": "myFloatColumn", "ordering": "numeric", "lowerBound": "10", "lowerStrict": false, "upperBound": "20", "upperStrict": true}
Filtering on the Timestamp Column
filtering on a long timestamp value:
"filter": { "type": "selector", "dimension": "__time", "value": "124457387532"}
Filtering on day of week:
"filter": { "type": "selector", "dimension": "__time", "value": "Friday", "extractionFn": { "type": "timeFormat", "format": "EEEE", "timeZone": "America/New_York", "locale": "en" }}
Filtering on a set of ISO 8601 intervals:
{ "type" : "interval", "dimension" : "__time", "intervals" : [ "2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z", "2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z" ]}
十一、Components-Aggregations
在数据进入Druid前的聚合. 查询时的聚合
Count aggregator
{ "type" : "count", "name" : <output_name> }
对列count,不一定是摄入的原生数据量,可以在摄入时配置roll up,为了计算摄入数据量在 ingestion time时设置一个count ,在查询时用longSum aggregator
Sum aggregators
longSum
aggregator
{ "type" : "longSum", "name" : <output_name>, "fieldName" : <metric_name> }
fieldName
– name 做64-bitsum 结果命名为output_name
doubleSum
aggregator
64-bit floating point value
{ "type" : "doubleSum", "name" : <output_name>, "fieldName" : <metric_name> }
Min / Max aggregators
doubleMin
aggregator
doubleMin
Double.POSITIVE_INFINITY
{ "type" : "doubleMin", "name" : <output_name>, "fieldName" : <metric_name> }
doubleMax
aggregator
doubleMax
Double.NEGATIVE_INFINITY
{ "type" : "doubleMax", "name" : <output_name>, "fieldName" : <metric_name> }
longMin
aggregator
longMin
Long.MAX_VALUE
{ "type" : "longMin", "name" : <output_name>, "fieldName" : <metric_name> }
longMax
aggregator
longMax
Long.MIN_VALUE
{ "type" : "longMax", "name" : <output_name>, "fieldName" : <metric_name> }
First / Last aggregator
First and Last aggregator cannot be used in ingestion spec, and should only be specified as part of queries.
Note that queries with first/last aggregators on a segment created with rollup enabled will return the rolled up value, and not the last value within the raw ingested data.
doubleFirst
aggregator
doubleFirst
computes the metric value with the minimum timestamp or 0 if no row exist
{ "type" : "doubleFirst", "name" : <output_name>, "fieldName" : <metric_name>}
doubleLast
aggregator
doubleLast
computes the metric value with the maximum timestamp or 0 if no row exist
{ "type" : "doubleLast", "name" : <output_name>, "fieldName" : <metric_name>}
longFirst
aggregator
longFirst
computes the metric value with the minimum timestamp or 0 if no row exist
{ "type" : "longFirst", "name" : <output_name>, "fieldName" : <metric_name>}
longLast
aggregator
longLast
computes the metric value with the maximum timestamp or 0 if no row exist
{ "type" : "longLast", "name" : <output_name>, "fieldName" : <metric_name>,}
JavaScript aggregator
Computes an arbitrary JavaScript function over a set of columns (both metrics and dimensions are allowed). Your JavaScript functions are expected to return floating-point values.
{ "type": "javascript", "name": "<output_name>", "fieldNames" : [ <column1>, <column2>, ... ], "fnAggregate" : "function(current, column1, column2, ...) { <updates partial aggregate (current) based on the current row values> return <updated partial aggregate> }", "fnCombine" : "function(partialA, partialB) { return <combined partial results>; }", "fnReset" : "function() { return <initial value>; }"}
Example
{ "type": "javascript", "name": "sum(log(x)*y) + 10", "fieldNames": ["x", "y"], "fnAggregate" : "function(current, a, b) { return current + (Math.log(a) * b); }", "fnCombine" : "function(partialA, partialB) { return partialA + partialB; }", "fnReset" : "function() { return 10; }"}
Approximate Aggregations 近似计算
Cardinality aggregator
用HyperLogLog估计维度的基数,比hyperUnique慢,当只关心某一个维度时建议使用hyperUnique aggregator
{ "type": "cardinality", "name": "<output_name>", "fields": [ <dimension1>, <dimension2>, ... ], "byRow": <false | true> # (optional, defaults to false)}
Cardinality by value
byRow = false时,对所有给定dimensions union后求cardinality
- 对于一个维度等同于
SELECT COUNT(DISTINCT(dimension)) FROM <datasource>
- 多个维度时等同于
SELECT COUNT(DISTINCT(value)) FROM ( SELECT dim_1 as value FROM <datasource> UNION SELECT dim_2 as value FROM <datasource> UNION SELECT dim_3 as value FROM <datasource>)
Cardinality by row
byRow = true时等同于
SELECT COUNT(*) FROM ( SELECT DIM1, DIM2, DIM3 FROM <datasource> GROUP BY DIM1, DIM2, DIM3 )
Example
计算人的pv 当地和非当地
{ "type": "cardinality", "name": "distinct_countries", "fields": [ "country_of_origin", "country_of_residence" ]}
Determine the number of distinct people (i.e. combinations of first and last name).
{ "type": "cardinality", "name": "distinct_people", "fields": [ "first_name", "last_name" ], "byRow" : true}
Determine the number of distinct starting characters of last names
{ "type": "cardinality", "name": "distinct_last_name_first_char", "fields": [ { "type" : "extraction", "dimension" : "last_name", "outputName" : "last_name_first_char", "extractionFn" : { "type" : "substring", "index" : 0, "length" : 1 } } ], "byRow" : true}
HyperUnique aggregator
HyperLogLog 估计一个维度的uv
{ "type" : "hyperUnique", "name" : <output_name>, "fieldName" : <metric_name>, "isInputHyperUnique" : false}
十二、Components-Post-Aggregations
对聚合结果二次计算 后输出
Arithmetic post-aggregator
支持 +
, -
, *
, /
, quotient
.
Note:
/
除数为0返回0quotient
和浮点数一样
一般也有 ordering
, 对结果排序:
- If no ordering (or
null
) is specified, the default floating point ordering is used. numericFirst
ordering always returns finite values first, followed byNaN
, and infinite values last.:
postAggregation : { "type" : "arithmetic", "name" : <output_name>, "fn" : <arithmetic_function>, "fields": [<post_aggregator>, <post_aggregator>, ...], "ordering" : <null (default), or "numericFirst">}
Field accessor post-aggregator
aggregator.的输出作为输入
fieldName
refers to the output name of the aggregator given in the aggregations portion of the query.
{ "type" : "fieldAccess", "name": <output_name>, "fieldName" : <aggregator_name> }
Constant post-aggregator
constant post-aggregator 返回特定的值
{ "type" : "constant", "name" : <output_name>, "value" : <numerical_value> }
Greatest / Least post-aggregators
doubleGreatest
and longGreatest
computes the maximum of all fields and Double.NEGATIVE_INFINITY. doubleLeast
and longLeast
computes the minimum of all fields and Double.POSITIVE_INFINITY.
The difference between the doubleMax
aggregator and the doubleGreatest
post-aggregator is that doubleMax
returns the highest value of all rows for one specific column while doubleGreatest
returns the highest value of multiple columns in one row.
类似于SQL的MAX和GREATEST
{ "type" : "doubleGreatest", "name" : <output_name>, "fields": [<post_aggregator>, <post_aggregator>, ...]}
HyperUnique Cardinality post-aggregator
The hyperUniqueCardinality post aggregator is used to wrap a hyperUnique object such that it can be used in post aggregations.
{ "type" : "hyperUniqueCardinality", "name": <output name>, "fieldName" : <the name field value of the hyperUnique aggregator>}
It can be used in a sample calculation as so:
"aggregations" : [{ {"type" : "count", "name" : "rows"}, {"type" : "hyperUnique", "name" : "unique_users", "fieldName" : "uniques"} }], "postAggregations" : [{ "type" : "arithmetic", "name" : "average_users_per_row", "fn" : "/", "fields" : [ { "type" : "hyperUniqueCardinality", "fieldName" : "unique_users" }, { "type" : "fieldAccess", "name" : "rows", "fieldName" : "rows" } ] }]
例子
计算占比
{ ... "aggregations" : [ { "type" : "count", "name" : "rows" }, { "type" : "doubleSum", "name" : "tot", "fieldName" : "total" } ], "postAggregations" : [{ "type" : "arithmetic", "name" : "average", "fn" : "*", "fields" : [ { "type" : "arithmetic", "name" : "div", "fn" : "/", "fields" : [ { "type" : "fieldAccess", "name" : "tot", "fieldName" : "tot" }, { "type" : "fieldAccess", "name" : "rows", "fieldName" : "rows" } ] }, { "type" : "constant", "name": "const", "value" : 100 } ] }] ...}
十三、Components-Aggregation Granularity聚合粒度
决定数据在时间维度的分桶(hour, day, minute, etc.),可以是简单也可以是任意
Simple Granularities
支持的字符串表达形式是: all
, none
, second
, minute
, fifteen_minute
, thirty_minute
, hour
, day
, week
, month
, quarter
and year
.
all
buckets everything into a single bucketnone
node和摄入的粒度一样
millisecond粒度摄入
{"timestamp": "2013-08-31T01:02:33Z", "page": "AAA", "language" : "en"}{"timestamp": "2013-09-01T01:02:33Z", "page": "BBB", "language" : "en"}{"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"}{"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"}
hour
granularity groupby,
{ "queryType":"groupBy", "dataSource":"my_dataSource", "granularity":"hour", "dimensions":[ "language" ], "aggregations":[ { "type":"count", "name":"count" } ], "intervals":[ "2000-01-01T00:00Z/3000-01-01T00:00Z" ]}
you will get
[ { "version" : "v1", "timestamp" : "2013-08-31T01:00:00.000Z", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-01T01:00:00.000Z", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-02T23:00:00.000Z", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-03T03:00:00.000Z", "event" : { "count" : 1, "language" : "en" }} ]
按天聚合
[ { "version" : "v1", "timestamp" : "2013-08-31T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-01T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-02T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-03T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" }} ]
node和摄入的粒度一样
[ { "version" : "v1", "timestamp" : "2013-08-31T01:02:33.000Z", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-01T01:02:33.000Z", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-02T23:32:45.000Z", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-03T03:32:45.000Z", "event" : { "count" : 1, "language" : "en" }} ]
all 所有
[ { "version" : "v1", "timestamp" : "2000-01-01T00:00:00.000Z", "event" : { "count" : 4, "language" : "en" }} ]
Duration Granularities
milliseconds 的整数倍
This chunks up every 2 hours.
{"type": "duration", "duration": 3600000, "origin": "2012-01-01T00:30:00Z"}
Example:
Reusing the data in the previous example, after submitting a groupBy query with 24 hours duration,
{ "queryType":"groupBy", "dataSource":"my_dataSource", "granularity":{"type": "duration", "duration": "86400000"}, "dimensions":[ "language" ], "aggregations":[ { "type":"count", "name":"count" } ], "intervals":[ "2000-01-01T00:00Z/3000-01-01T00:00Z" ]}
you will get
[ { "version" : "v1", "timestamp" : "2013-08-31T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-01T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-02T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-03T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" }} ]
if you set the origin for the granularity to 2012-01-01T00:30:00Z
,
"granularity":{"type": "duration", "duration": "86400000", "origin":"2012-01-01T00:30:00Z"}
you will get
[ { "version" : "v1", "timestamp" : "2013-08-31T00:30:00.000Z", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-01T00:30:00.000Z", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-02T00:30:00.000Z", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-03T00:30:00.000Z", "event" : { "count" : 1, "language" : "en" }} ]
Note that the timestamp for each bucket starts at the 30th minute.
Period Granularities
Period granularities are specified as arbitrary period combinations of years, months, weeks, hours, minutes and seconds (e.g. P2W, P3M, PT1H30M, PT0.750S) in ISO8601 format. They support specifying a time zone which determines where period boundaries start as well as the timezone of the returned timestamps. By default, years start on the first of January, months start on the first of the month and weeks start on Mondays unless an origin is specified.
Time zone is optional (defaults to UTC). Origin is optional (defaults to 1970-01-01T00:00:00 in the given time zone).
{"type": "period", "period": "P2D", "timeZone": "America/Los_Angeles"}
This will bucket by two-day chunks in the Pacific timezone.
{"type": "period", "period": "P3M", "timeZone": "America/Los_Angeles", "origin": "2012-02-01T00:00:00-08:00"}
This will bucket by 3-month chunks in the Pacific timezone where the three-month quarters are defined as starting from February.
Example
Reusing the data in the previous example, if you submit a groupBy query with 1 day period in Pacific timezone,
{ "queryType":"groupBy", "dataSource":"my_dataSource", "granularity":{"type": "period", "period": "P1D", "timeZone": "America/Los_Angeles"}, "dimensions":[ "language" ], "aggregations":[ { "type":"count", "name":"count" } ], "intervals":[ "1999-12-31T16:00:00.000-08:00/2999-12-31T16:00:00.000-08:00" ]}
you will get
[ { "version" : "v1", "timestamp" : "2013-08-30T00:00:00.000-07:00", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-08-31T00:00:00.000-07:00", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-02T00:00:00.000-07:00", "event" : { "count" : 2, "language" : "en" }} ]
Note that the timestamp for each bucket has been converted to Pacific time. Row {"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"}
and {"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"}
are put in the same bucket because they are in the same day in Pacific time.
Also note that the intervals
in groupBy query will not be converted to the timezone specified, the timezone specified in granularity is only applied on the query results.
If you set the origin for the granularity to 1970-01-01T20:30:00-08:00
,
"granularity":{"type": "period", "period": "P1D", "timeZone": "America/Los_Angeles", "origin": "1970-01-01T20:30:00-08:00"}
you will get
[ { "version" : "v1", "timestamp" : "2013-08-29T20:30:00.000-07:00", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-08-30T20:30:00.000-07:00", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-01T20:30:00.000-07:00", "event" : { "count" : 1, "language" : "en" }}, { "version" : "v1", "timestamp" : "2013-09-02T20:30:00.000-07:00", "event" : { "count" : 1, "language" : "en" }} ]
Note that the origin
you specified has nothing to do with the timezone, it only serves as a starting point for locating the very first granularity bucket. In this case, Row {"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"}
and {"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"}
are not in the same bucket.
- 7-Druid-Querying
- druid
- druid
- Druid
- druid
- Druid
- druid
- Druid
- Druid
- druid
- Druid
- 5. Querying
- Querying Metadata
- Querying JSON
- Querying data
- 7-druid 之 superset 查询方式
- Ad hoc querying
- rails active record querying
- JAVA 读取属性文件方式
- 当你面对客户刁难时,持续更新。。
- Win10 安装arcgis10.2 for desktop需要microsoft.net framework 3.5 sp1或等效环境 解决方案
- 复制SA登录失败可能原因
- IT产业的70:20:10规律
- 7-Druid-Querying
- Mysql03_增删改查
- 微信网页授权 Scope参数错误或没有Scope权限 解决方法
- java中含中文字符串的编码和解码问题
- Tensorflow实战学习(四)【张量表示、类型、形状、计算】
- pip 安装报错,is not a supported wheel on this platform
- java多线程和synchronize锁
- FPGA入门实验
- 7