4-Druid数据摄入-1
来源:互联网 发布:软件研发生命周期 编辑:程序博客网 时间:2024/04/27 19:22
一、数据格式
【1】Data Formats
http://druid.io/docs/0.10.1/ingestion/data-formats.html
(1)摄入规范化数据:JSON、CSV、TSV
(2)自定义格式
Regex parser or the JavaScript parsers 来解析数据
(3)其他格式
http://druid.io/docs/0.10.1/development/extensions.html
【2】Configuration
对数据格式进行配置dataSchema中的parseSpec字段
具体见:http://druid.io/docs/0.10.1/ingestion/data-formats.html
二、数据schema
主要是摄入的规则ingestion Spec
摄入规则主要包含3个部分
{ "dataSchema" : {...}, "ioConfig" : {...}, "tuningConfig" : {...}}
DataSchema
"dataSchema" : { "dataSource" : "wikipedia", "parser" : { "type" : "string", "parseSpec" : { "format" : "json", "timestampSpec" : { "column" : "timestamp", "format" : "auto" }, "dimensionsSpec" : { "dimensions": [ "page", "language", "user", "unpatrolled", "newPage", "robot", "anonymous", "namespace", "continent", "country", "region", "city", { "type": "long", "name": "countryNum" }, { "type": "float", "name": "userLatitude" }, { "type": "float", "name": "userLongitude" } ], "dimensionExclusions" : [], "spatialDimensions" : [] } } }, "metricsSpec" : [{ "type" : "count", "name" : "count" }, { "type" : "doubleSum", "name" : "added", "fieldName" : "added" }, { "type" : "doubleSum", "name" : "deleted", "fieldName" : "deleted" }, { "type" : "doubleSum", "name" : "delta", "fieldName" : "delta" }], "granularitySpec" : { "segmentGranularity" : "DAY", "queryGranularity" : "NONE", "intervals" : [ "2013-08-31/2013-09-01" ] }}
Parser
"parser" : { "type" : "string", "parseSpec" : { "format" : "json", "timestampSpec" : { "column" : "timestamp", "format" : "auto" }, "dimensionsSpec" : { "dimensions": [ "page", "language", "user", "unpatrolled", "newPage", "robot", "anonymous", "namespace", "continent", "country", "region", "city", { "type": "long", "name": "countryNum" }, { "type": "float", "name": "userLatitude" }, { "type": "float", "name": "userLongitude" } ], "dimensionExclusions" : [], "spatialDimensions" : [] } } }
type 默认为string,其他数据格式见:extensions list.
String Parser
一般为string,或在Hadoop indexing job中使用hadoopyString
标识格式format和、imestamp、dimensions
ParseSpec
两个功能:
- String Parser用parseSpec判定将要处理rows的数据格式( JSON, CSV, TSV)
- 所有的Parsers 用parseSpec判定将要处理rows的 timestamp 和 dimensionsAll
format字段默认为tsv格式
JSON ParseSpec
json
.notimestampSpecJSON Objecttimestamp的列和format
yesdimensionsSpecJSON Object数据的dimensions
yesflattenSpecJSON Object标识嵌套JSON如何打平的配置,详见 Flattening JSON
noJSON Lowercase ParseSpec
将输入的JSON数据小写处理
jsonLowercase
.yestimestampSpecJSON Objecttimestamp的列和formatyesdimensionsSpecJSON Object数据的dimensionsyesCSV ParseSpec
使用String Parser 加载CSV,Strings用net.sf.opencsv library. parsed
csv
.yestimestampSpecJSON Objecttimestamp的列和formatyesdimensionsSpecJSON Object数据的dimensionsyeslistDelimiterString多值dimensions的分割符
no (default == ctrl+A)columnsJSON array数据列
yesTimestampSpec
DimensionsSpec
dimension schema 对象或dimension names,标识维度列,否则将timestamp列外的所以string列作为维度列
ingestion之外的dimensions
no (default == []spatialDimensionsJSON Object arrayspatial dimensionsno (default == []Dimension Schema
dimension schema标识要摄入dimension的type和name,不特殊标识type时为string
"dimensionsSpec" : { "dimensions": [ "page", "language", "user", "unpatrolled", "newPage", "robot", "anonymous", "namespace", "continent", "country", "region", "city", { "type": "long", "name": "countryNum" }, { "type": "float", "name": "userLatitude" }, { "type": "float", "name": "userLongitude" } ], "dimensionExclusions" : [], "spatialDimensions" : []}
GranularitySpec
"granularitySpec" : { "segmentGranularity" : "DAY", "queryGranularity" : "NONE", "intervals" : [ "2013-08-31/2013-09-01" ] }
granularity spec 默认是uniform,可以通过type字段配置,目前支持uniform和arbitrary
types
Uniform Granularity Spec
标识uniform intervals.
建立segments的周期
no (default == 'DAY')queryGranularitystring可query结果的最小granularity,数据已这个granularity在segment中granularity
例如:"minute" 说明 data已分钟级别的granularity聚合,也就是当 (minute(timestamp), dimensions)
tuple中有collisions时,将用aggregators聚合值,而不是对各个rows排序
no (default == 'NONE')rollupbooleanrollup or notno (default == true)intervalsstringraw data摄入的intervals列表,对于real-time摄取忽略
Arbitrary Granularity Spec
按照segments的大小决定intervals,不支持real-time
三、Schema Design
Druid将规范化后的数据分为3类:a timestamp, a dimension, or a measure (or a metric/aggregator as they are known in Druid).
更多信息:
- Timestamp每行必须,数据以时间分区,每个query有一个时间filter ,Query results 可以用时间分桶( minutes, hours, days, and so on)
- Dimensions可以filtered或者grouped by,一般是单Strings,Strings数组,单Longs,单Floats
- Metrics可以aggregated,可排序
一般生产tables(datasources)少于100个维度列,100个metrics
Numeric dimensions
数据类型的维度 (Long or Float) 必须在dimensionsSpec中标识,否则默认是字符串,数值型列在group时快,但由于没有索引在过滤时慢,Dimension Schema.
High cardinality dimensions (e.g. unique IDs)
实际中count-distinct不需要,对IDs列排序将杀掉 roll-up,影响压缩,再aggregations带着排序的IDS,增加性能减少存储,Druid's hyperUnique
aggregator 基于Hyperloglog, here.
Nested dimensions
不支持嵌套维度,下面
{"foo":{"bar": 3}}
在索引前转化为:
{"foo_bar": 3}
Counting the number of ingested events
count aggregator 在数据摄入阶段计算摄入的数据量,在查询时用longSum
aggregator.,根据这个计算结果决定roll-up 的速率
ingestion spec:
..."metricsSpec" : [ { "type" : "count", "name" : "count" },...
:
按照如下查询摄入的量
..."aggregations": [ { "type": "longSum", "name": "numIngestedEvents", "fieldName": "count" },...
Schema-less dimensions
dimensions在spec缺失时,所有非timestamp 的列作为string型作为维度
Including the same column as a dimension and a metric
一个列作为维度,同时由于去重计算需要,也作为hyperUnique,作为metric,这需要在ETL组织时就增加出来,
ETL中复制一列=:
{"device_id_dim":123, "device_id_met":123}
在metricsSpec
:
{ "type" : "hyperUnique", "name" : "devices", "fieldName" : "device_id_met" }
device_id_dim
自动作为维度
四、Schema Changes
datasources可以在任何时间改变,支持segments中存在不同的schemas
Replacing Segments
segments标识:datasource, interval, version, and partition number.partition number只在同一个granularity产生多个segments时可见,如hourly segments,在一个小时中的数据量超出一个segment存储范围,同一小时产生多个segments,以partition number区分
foo_2015-01-01/2015-01-02_v1_0foo_2015-01-01/2015-01-02_v1_1foo_2015-01-01/2015-01-02_v1_2
dataSource = foo, interval = 2015-01-01/2015-01-02, version = v1, partitionNum = 0. 如果此时用新的schema索引数据,新产生的segment有更高的version id。
foo_2015-01-01/2015-01-02_v2_0foo_2015-01-01/2015-01-02_v2_1foo_2015-01-01/2015-01-02_v2_2
Druid是批量构建索引的(either Hadoop-based or IndexTask-based),保证interval-by-interval间的原子性更新,例如直到2015-01-01/2015-01-02
间隔内的v2
segments加载到集群中后吗,queries才不再使用 v1
segments,此时v1从集群中卸载。
updates是夸过个segment的,指示在每个interval内是原子性的,不是整个更新的如下:
foo_2015-01-01/2015-01-02_v1_0foo_2015-01-02/2015-01-03_v1_1foo_2015-01-03/2015-01-04_v1_2
v2 segments 完全更新前,混存:
foo_2015-01-01/2015-01-02_v1_0foo_2015-01-02/2015-01-03_v2_1foo_2015-01-03/2015-01-04_v1_2
此时的查询可以命中V1和V2的混合
In this case, queries may hit a mixture of v1
and v2
segments.
Different Schemas Among Segments
datasource的segments可以有不同的schemas,如果一个stringcolumn (dimension) 在一个segment A中存在,另一个B不存在,认为B中该维度为null。对于numeric column,Aggregations跳过这条
- 4-Druid数据摄入-1
- 5-Druid数据摄入-2
- Druid学习笔记(4)数据摄入总结
- 摄入
- Druid数据监控服务
- druid对数据密码加密
- druid数据存储原理介绍
- 国产数据连接池Druid
- logoOLAP 数据存储系统 Druid-IO
- druid数据链接池配置
- druid 元数据接口查询
- DRUID- 通过postman或curl查询druid数据
- Druid (大数据实时统计分析数据存储)
- Druid的发送数据和查询数据
- Druid原理介绍(1)
- Druid使用起步1
- 1-Druid 前言
- 1-Druid 概览
- 网易2018校园招聘:重排数列 [python]
- Error reading manifest file manifest.yml must be owned by the beat user (uid=0) or root
- Glide V4源码解析
- 【JavaScript】console控制台调试技巧 | JavaScript调试技巧
- sprong
- 4-Druid数据摄入-1
- Android开发2017秋招总结+面经
- 网易2018校园招聘:魔法币 [python]
- 交叉编译在目标板上执行helloworld时提示:“No such file or directory”
- MVC 他们三个之间的交易
- kindEditor中提交内容数据库为空问题
- 解决 macOS 上使用 MySQLdb 时,报错 “Library not loaded: libmysqlclient.18.dylib”
- 用实例带你提前了解 Java 9 中的新特性
- dubbo配置文件报错处理