4-Druid数据摄入-1

来源:互联网 发布:软件研发生命周期 编辑:程序博客网 时间:2024/04/27 19:22

一、数据格式

【1】Data Formats
http://druid.io/docs/0.10.1/ingestion/data-formats.html
(1)摄入规范化数据:JSON、CSV、TSV 
(2)自定义格式
Regex parser or the JavaScript parsers 来解析数据
(3)其他格式
http://druid.io/docs/0.10.1/development/extensions.html
【2】Configuration
对数据格式进行配置dataSchema中的parseSpec字段
具体见:http://druid.io/docs/0.10.1/ingestion/data-formats.html

二、数据schema

主要是摄入的规则ingestion Spec
摄入规则主要包含3个部分


{  "dataSchema" : {...},  "ioConfig" : {...},  "tuningConfig" : {...}}
FieldTypeDescriptionRequireddataSchemaJSON Object标识摄入数据的schema,不同specs可共享yesioConfigJSON Object标识data从哪来,到哪去。根据不同的ingestion method不同yestuningConfigJSON Object标识如何调优不同的ingestion parameters。根据不同的ingestion method不同no

DataSchema

 
"dataSchema" : {  "dataSource" : "wikipedia",  "parser" : {    "type" : "string",    "parseSpec" : {      "format" : "json",      "timestampSpec" : {        "column" : "timestamp",        "format" : "auto"      },      "dimensionsSpec" : {        "dimensions": [          "page",          "language",          "user",          "unpatrolled",          "newPage",          "robot",          "anonymous",          "namespace",          "continent",          "country",          "region",          "city",          {            "type": "long",            "name": "countryNum"          },          {            "type": "float",            "name": "userLatitude"          },          {            "type": "float",            "name": "userLongitude"          }        ],        "dimensionExclusions" : [],        "spatialDimensions" : []      }    }  },  "metricsSpec" : [{    "type" : "count",    "name" : "count"  }, {    "type" : "doubleSum",    "name" : "added",    "fieldName" : "added"  }, {    "type" : "doubleSum",    "name" : "deleted",    "fieldName" : "deleted"  }, {    "type" : "doubleSum",    "name" : "delta",    "fieldName" : "delta"  }],  "granularitySpec" : {    "segmentGranularity" : "DAY",    "queryGranularity" : "NONE",    "intervals" : [ "2013-08-31/2013-09-01" ]  }}
FieldTypeDescriptionRequireddataSourceString要摄入的datasource名称,Datasources可看做为表yesparserJSON Objectingested data如何解析yesmetricsSpecJSON Object array aggregators器列表yesgranularitySpecJSON Object如何建立.segments,如何上卷数据yes

Parser

"parser" : {    "type" : "string",    "parseSpec" : {      "format" : "json",      "timestampSpec" : {        "column" : "timestamp",        "format" : "auto"      },      "dimensionsSpec" : {        "dimensions": [          "page",          "language",          "user",          "unpatrolled",          "newPage",          "robot",          "anonymous",          "namespace",          "continent",          "country",          "region",          "city",          {            "type": "long",            "name": "countryNum"          },          {            "type": "float",            "name": "userLatitude"          },          {            "type": "float",            "name": "userLongitude"          }        ],        "dimensionExclusions" : [],        "spatialDimensions" : []      }    }  }
type 默认为string,其他数据格式见:extensions list.
String Parser
FieldTypeDescriptionRequiredtypeString

一般为string,或在Hadoop indexing job中使用hadoopyString


noparseSpecJSON Object

标识格式format和、imestamp、dimensions


yes
ParseSpec

两个功能:

  • String Parser用parseSpec判定将要处理rows的数据格式( JSON, CSV, TSV)
  • 所有的Parsers 用parseSpec判定将要处理rows的 timestamp  和 dimensionsAll

format字段默认为tsv格式

JSON ParseSpec
FieldTypeDescriptionRequiredformatString json.notimestampSpecJSON Object

timestamp的列和format

yesdimensionsSpecJSON Object

数据的dimensions

yesflattenSpecJSON Object

标识嵌套JSON如何打平的配置,详见 Flattening JSON 

no
JSON Lowercase ParseSpec

将输入的JSON数据小写处理

FieldTypeDescriptionRequiredformatStringTjsonLowercase.yestimestampSpecJSON Objecttimestamp的列和formatyesdimensionsSpecJSON Object数据的dimensionsyes
CSV ParseSpec

使用String Parser 加载CSV,Strings用net.sf.opencsv library. parsed

FieldTypeDescriptionRequiredformatString csv.yestimestampSpecJSON Objecttimestamp的列和formatyesdimensionsSpecJSON Object数据的dimensionsyeslistDelimiterString

多值dimensions的分割符

no (default == ctrl+A)columnsJSON array

数据列

yes

TimestampSpec

FieldTypeDescriptionRequiredcolumnStringtimestamp的列yesformatStringiso, millis, posix, auto or any Joda time format.no (default == 'auto'

DimensionsSpec

FieldTypeDescriptionRequireddimensionsJSON array

dimension schema 对象或dimension names,标识维度列,否则将timestamp列外的所以string列作为维度列


yesdimensionExclusionsJSON String array

ingestion之外的dimensions

no (default == []spatialDimensionsJSON Object arrayspatial dimensionsno (default == []

Dimension Schema

dimension schema标识要摄入dimension的type和name,不特殊标识type时为string

 "dimensionsSpec" : {  "dimensions": [    "page",    "language",    "user",    "unpatrolled",    "newPage",    "robot",    "anonymous",    "namespace",    "continent",    "country",    "region",    "city",    {      "type": "long",      "name": "countryNum"    },    {      "type": "float",      "name": "userLatitude"    },    {      "type": "float",      "name": "userLongitude"    }  ],  "dimensionExclusions" : [],  "spatialDimensions" : []}
 

 GranularitySpec 

 "granularitySpec" : {    "segmentGranularity" : "DAY",    "queryGranularity" : "NONE",    "intervals" : [ "2013-08-31/2013-09-01" ]  }

granularity spec 默认是uniform,可以通过type字段配置,目前支持uniform和arbitrarytypes

 
 Uniform Granularity Spec 

标识uniform intervals.

 FieldTypeDescriptionRequiredsegmentGranularitystring

建立segments的周期

no (default == 'DAY')queryGranularitystring

可query结果的最小granularity,数据已这个granularity在segment中granularity

例如:"minute" 说明 data已分钟级别的granularity聚合,也就是当 (minute(timestamp), dimensions)

tuple中有collisions时,将用aggregators聚合值,而不是对各个rows排序

no (default == 'NONE')rollupbooleanrollup or notno (default == true)intervalsstring

raw data摄入的intervals列表,对于real-time摄取忽略


yes for batch, no for real-time
 
 Arbitrary Granularity Spec 

按照segments的大小决定intervals,不支持real-time 

FieldTypeDescriptionRequiredqueryGranularitystring同上no (default == 'NONE')rollupbooleanrollup or notno (default == true)intervalsstring同上yes for batch, no for real-time

三、Schema Design

Druid将规范化后的数据分为3类:a timestamp, a dimension, or a measure (or a metric/aggregator as they are known in Druid).

更多信息:

  • Timestamp每行必须,数据以时间分区,每个query有一个时间filter ,Query results 可以用时间分桶( minutes, hours, days, and so on)
  • Dimensions可以filtered或者grouped by,一般是单Strings,Strings数组,单Longs,单Floats
  • Metrics可以aggregated,可排序

一般生产tables(datasources)少于100个维度列,100个metrics

Numeric dimensions

 数据类型的维度 (Long or Float) 必须在dimensionsSpec中标识,否则默认是字符串,数值型列在group时快,但由于没有索引在过滤时慢,Dimension Schema.

High cardinality dimensions (e.g. unique IDs)

实际中count-distinct不需要,对IDs列排序将杀掉 roll-up,影响压缩,再aggregations带着排序的IDS,增加性能减少存储,Druid's hyperUnique aggregator 基于Hyperloglog, here.

Nested dimensions

不支持嵌套维度,下面

 {"foo":{"bar": 3}}

在索引前转化为:

 {"foo_bar": 3}
Counting the number of ingested events

count aggregator 在数据摄入阶段计算摄入的数据量,在查询时用longSum aggregator.,根据这个计算结果决定roll-up 的速率

 ingestion spec:

 ..."metricsSpec" : [      {        "type" : "count",        "name" : "count"      },...:

按照如下查询摄入的量

 ..."aggregations": [    { "type": "longSum", "name": "numIngestedEvents", "fieldName": "count" },...
Schema-less dimensions

dimensions在spec缺失时,所有非timestamp 的列作为string型作为维度


Including the same column as a dimension and a metric

一个列作为维度,同时由于去重计算需要,也作为hyperUnique,作为metric,这需要在ETL组织时就增加出来,


ETL中复制一列=:

 {"device_id_dim":123, "device_id_met":123}
在metricsSpec:
 { "type" : "hyperUnique", "name" : "devices", "fieldName" : "device_id_met" }

device_id_dim 自动作为维度


四、Schema Changes

datasources可以在任何时间改变,支持segments中存在不同的schemas

Replacing Segments

segments标识:datasource, interval, version, and partition number.partition number只在同一个granularity产生多个segments时可见,如hourly segments,在一个小时中的数据量超出一个segment存储范围,同一小时产生多个segments,以partition number区分

foo_2015-01-01/2015-01-02_v1_0foo_2015-01-01/2015-01-02_v1_1foo_2015-01-01/2015-01-02_v1_2
dataSource = foo, interval = 2015-01-01/2015-01-02, version = v1, partitionNum = 0. 如果此时用新的schema索引数据,新产生的segment有更高的version id。
foo_2015-01-01/2015-01-02_v2_0foo_2015-01-01/2015-01-02_v2_1foo_2015-01-01/2015-01-02_v2_2
Druid是批量构建索引的(either Hadoop-based or IndexTask-based),保证interval-by-interval间的原子性更新,例如直到2015-01-01/2015-01-02 间隔内的v2 segments加载到集群中后吗,queries才不再使用 v1 segments,此时v1从集群中卸载。

updates是夸过个segment的,指示在每个interval内是原子性的,不是整个更新的如下:

foo_2015-01-01/2015-01-02_v1_0foo_2015-01-02/2015-01-03_v1_1foo_2015-01-03/2015-01-04_v1_2
 v2 segments 完全更新前,混存:
foo_2015-01-01/2015-01-02_v1_0foo_2015-01-02/2015-01-03_v2_1foo_2015-01-03/2015-01-04_v1_2

此时的查询可以命中V1和V2的混合

In this case, queries may hit a mixture of v1 and v2 segments.

Different Schemas Among Segments

datasource的segments可以有不同的schemas,如果一个stringcolumn (dimension) 在一个segment A中存在,另一个B不存在,认为B中该维度为null。对于numeric column,Aggregations跳过这条

原创粉丝点击