Kylin源码解析——Cube构建过程中如何实现降维

来源：互联网发布：acl 软件编辑：程序博客网时间：2024/06/06 01:19

http://blog.csdn.net/qq_21653785/article/details/73731001

维度简述

Kylin中Cube的描述类CubeDesc有两个字段，rowkey和aggregationGroups。

@JsonProperty("rowkey")private RowKeyDesc rowkey;@JsonProperty("aggregation_groups")private List<AggregationGroup> aggregationGroups;1
2
3
4
5

其中rowkey描述的是该Cube中所有维度，在将统计结果存储到HBase中，各维度在rowkey中的排序情况，如下是rowkey的一个样例，包含6个维度。在描述一种维度组合时，是通过二进制来表示。
如这6个维度，都包含时，是 111111。
如 111001，则表示只包含INSERT_DATE、VISIT_MONTH、VISIT_QUARTER、IS_CLICK这四个维度。
二进制从左到右表示的就是rowkey_columns中各个维度的包含与否，1包含，0不包含。
这样的一个二进制组合就是一个cuboid，用long整型表示。

"rowkey": {    "rowkey_columns": [        {            "column": "DW_OLAP_CPARAM_INFO_VERSION2.INSERT_DATE",            "encoding": "dict",            "isShardBy": false        },        {            "column": "DW_OLAP_CPARAM_INFO_VERSION2.VISIT_MONTH",            "encoding": "dict",            "isShardBy": false        },        {            "column": "DW_OLAP_CPARAM_INFO_VERSION2.VISIT_QUARTER",            "encoding": "dict",            "isShardBy": false        },        {            "column": "DW_OLAP_CPARAM_INFO_VERSION2.BUSINESS_TYPE",            "encoding": "dict",            "isShardBy": false        },        {            "column": "DW_OLAP_CPARAM_INFO_VERSION2.SHOP_TYPE",            "encoding": "dict",            "isShardBy": false        },        {            "column": "DW_OLAP_CPARAM_INFO_VERSION2.IS_CLICK",            "encoding": "dict",            "isShardBy": false        },    ]}1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

而aggregationGroups则描述的是这些维度的分组情况，也就是在一个Cube中的所有维度，可以分成多个分组，每个分组就是一个AggregationGroup，各AggregationGroup之间是相互独立的。

对于所有的维度为什么要做分组？

在Kylin中会预先把所有维度的各种组合下的统计结果原先计算出来，假设维度有N个，那么维度的组合就有2^N中组合，比如N=6，则总的维度组合就有2^6=64种。

如果能够根据实际查询的需求，发现某些维度之间是不会有交叉查询的，那其实把这些维度组合的统计结果计算出来，也是浪费，因为后续的查询中，压根不会用到，这样既浪费了计算资源，更浪费了存储资源，所有可以按实际的查询需求，将维度进行分组，比如6个维度，分成2组，一组4个维度，一组2个维度，则总的维度组合则是2^4+2^2=20，比64小了很多，这里的分组这是举例说明分组，可以有效的减少维度组合，从而缩减存储空间，另外各个分组之间是可以有共享维度的，比如6个维度，可以分成两组，一组4个，另一组3个，两个分组中的共享维度，在后续计算中，其对应的统计结果不会被计算两次，只会计算一次，这也是Kylin聪明的地方。

一个AggragationGroup中包含includes和selectRule两个字段，其中includes就是该分组中包含了哪些维度，是一个字符串数组。

@JsonProperty("includes")private String[] includes;@JsonProperty("select_rule")private SelectRule selectRule;1
2
3
4
5

AggregationGroup详见 https://mp.weixin.qq.com/s?__biz=MzAwODE3ODU5MA==&mid=2653077921&idx=1&sn=89ae88bc63e71098166b74df7106c7bf&chksm=80a4bf50b7d3364692903aac3e901d09a516a8ff635e690e1e22b1d96abb4b2925c98cdace82&scene=21#wechat_redirect

强制维度——在每一个维度组合中都必须出现的维度，详见 https://mp.weixin.qq.com/s?__biz=MzAwODE3ODU5MA==&mid=2653077943&idx=1&sn=007d2ba345d0e25ec12807aa47f9913d&chksm=80a4bf46b7d33650465d33e20dac7edc09a7ad9308d77de6a501685c8ae00cba661c1d612074&scene=21#wechat_redirect

层级维度——则是那些有层级关系的维度，如省、市、县，详见 https://mp.weixin.qq.com/s?__biz=MzAwODE3ODU5MA==&mid=2653077929&idx=1&sn=c76ed1fbb745945a077d9ca99f159a4d&chksm=80a4bf58b7d3364e0346ad9c433d4e32c57d45f41b361ae653c64c7fcebab21238793d2f66cb&scene=21#wechat_redirect

联合维度——则是那些要么不出现，要出现就必须一起出现的维度，详见 https://mp.weixin.qq.com/s?__biz=MzAwODE3ODU5MA==&mid=2653077926&idx=1&sn=a0037628bd102ec8e607d67204cbfa7c&chksm=80a4bf57b7d336419896c9e801a51f08ead2f7727d0d0ec0f9e3b7799ae3c302ebea54f93cc0&scene=21#wechat_redirect

如下是只有一个分组的样例。

"aggregation_groups": [    {        "includes": [                        "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.FCID_SEARCH",                        "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.BUSINESS_TYPE",                        "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.FCID0”,                        "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.FCID1”,                        "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.FCID2”,                        "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.FCID3",                        "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.FCNAME0",                        "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.FCNAME1",                        "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.FCNAME2",                        "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.FCNAME3",                        "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.INSERT_DATE",                        "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.SHOP_TYPE”                        "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.USERID",                        "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.SHOPID"                    ],        "select_rule": {            "hierarchy_dims": [                [                    "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.FCID0",                    "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.FCID1",                    "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.FCID2",                    "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.FCID3"                ],                [                    "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.FCNAME0",                    "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.FCNAME1",                    "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.FCNAME2",                    "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.FCNAME3"                ]            ],        "mandatory_dims": [                            "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.INSERT_DATE",                            "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.BUSINESS_TYPE"                ],        "joint_dims": [                        "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.USERID",                        "DW_OLAP_AD_NORMAL_CONTRAST_VERSION2.SHOPID"                    ]        }    }]1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

-cuboid的有效性判断

在进行降维分析之前，先简单减少一下，给定的一个cuboid的，比如 110011 ，这样一个cuboid，如何判断在一个AggregationGroup中是否是有效的？判断逻辑在Cuboid类的isValid方法中，就是用来判断给定的一个cuboidID，在一个AggregationGroup中是否是一个合法有效的cuboidID。

static boolean isValid(AggregationGroup agg, long cuboidID) {    // 前面说明，一个cuboidID就是一组维度的组合，1位包含，0为不包含，所以cuboidID必定大于0    if (cuboidID <= 0) {        return false; //cuboid must be greater than 0        }    // 一个cuboidID在一个AggregationGroup中是否有效的前提，是它包含的维度必须都要是该AggregationGroup中的维度才行    // agg.getPartialCubeFullMask()获取的就是该AggregationGroup中所有维度组成的一个掩码    if ((cuboidID & ~agg.getPartialCubeFullMask()) != 0) {        return false; //a cuboid's parent within agg is at most partialCubeFullMask        }    // 接下来则分别进行了强制维度、层级维度、联合维度的校验，都校验通过时，才能算是有效合法的    return checkMandatoryColumns(agg, cuboidID) && checkHierarchy(agg, cuboidID) && checkJoint(agg, cuboidID);}1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

从上面的逻辑可以看出，判断一个cuboidID在一个AggregationGroup中是否合法有效的逻辑很清晰，首先该cuboidID要至少包含一个维度，然后包含的维度需要是该AggregationGroup中维度的子集，最后就是在进行强制维度、层级维度、联合维度的规则校验。

强制维度的校验逻辑，简单说就是cuboidID中需要包含强制维度的所有维度，另外当，cuboidID中只包含强制维度的维度时，则根据配置中是否允许这种情况，进行判断，具体逻辑如下：

private static boolean checkMandatoryColumns(AggregationGroup agg, long cuboidID) {    // agg.getMandatoryColumnMask() 获取的是所有强制维度组成的二进制    long mandatoryColumnMask = agg.getMandatoryColumnMask();    // 如果没有包含所有强制维度,则返回false     if ((cuboidID & mandatoryColumnMask) != mandatoryColumnMask) {        return false;    } else {        // 如果包含了整个cube的所有维度，则总是返回true的        if (cuboidID == getBaseCuboidId(agg.getCubeDesc())) {            return true;        }        // 如果配置中允许该cuboidID中的维度都是强制维度,则返回true         // 如果不允许全部,则cuboidID中需要包含除强制维度以为的维度                return agg.isMandatoryOnlyValid() || (cuboidID & ~mandatoryColumnMask) != 0;    }}1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

层级维度的校验逻辑，校验逻辑简单明了，只要cuboidID中包含某个层级维度中的维度，则必须与该层级维度的某个具体的组合相匹配才行，否则就是无效的。

比如省、市、县这样一个层级维度，当cuboidID中包含省、市、县这三个维度中的某些维度的时候，也即是cuboidID & hierarchyMasks.fullMask 大于0的时候，则cuboidID中包含的这个层级维度的组合只能是《省》、《省、市》、《省、市、县》这三种组合，如果包含的是《省、县》或者《市、县》或者其他组合，则都是无效的。具体逻辑如下。

private static boolean checkHierarchy(AggregationGroup agg, long cuboidID) {    List<HierarchyMask> hierarchyMaskList = agg.getHierarchyMasks();    // if no hierarchy defined in metadata        if (hierarchyMaskList == null || hierarchyMaskList.size() == 0) {        return true;    }    hier: for (HierarchyMask hierarchyMasks : hierarchyMaskList) {        // 如果包含了某个层级维度组中的维度,则就需要包含该层级维度组中的某种具体组合才行            long result = cuboidID & hierarchyMasks.fullMask;        if (result > 0) {            for (long mask : hierarchyMasks.allMasks) {                if (result == mask) {                    continue hier;                }            }            return false;        }    }    return true;}1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

联合维度的校验逻辑，联合维度顾名思义，就是连在一起的，要么一起出现，要么都不出现，校验逻辑如下：

private static boolean checkJoint(AggregationGroup agg, long cuboidID) {    for (long joint : agg.getJoints()) {        long common = cuboidID & joint;        // 如果包含了某个联合组中的维度,则就必须包含该联合组中的全部维度          if (!(common == 0 || common == joint)) {            return false;        }    }    return true;}1
2
3
4
5
6
7
8
9
10

上述分析了判断一个cuboidID在一个AggregationGroup中是否有效的判断，那判断一个cuboidID在一个Cube中是否有效，就是判断这个cuboidID在该Cube的所有AggregationGroup中都是有效的，逻辑如下：

public static boolean isValid(CubeDesc cube, long cuboidID) {    //base cuboid is always valid        if (cuboidID == getBaseCuboidId(cube)) {        return true;    }    // 就是这个循环，遍历了所有的AggregationGroup    for (AggregationGroup agg : cube.getAggregationGroups()) {        if (isValid(agg, cuboidID)) {            return true;        }    }    return false;}1
2
3
4
5
6
7
8
9
10
11
12
13
14

-降维逻辑

对于维度的升降操作主要在类CuboidScheduler中，对应的方法则是

public Set<Long> getPotentialChildren(long parent) {    ...}public long getParent(long child) {    ...}1
2
3
4
5
6
7

首先来看getPotentialChildren这个方法，就是给定一个cuboid，找出其所有的潜在的子cuboid，这里的子cuboid就是说parent通过减少一个或者多个维度，得到的新的cuboid。

public Set<Long> getPotentialChildren(long parent) {    // Cuboid.getBaseCuboid(cubeDesc).getId() 获取的就是该Cube的所有维度都存在的cuboid，比如6个维度，则111111    // Cuboid.isValid(cubeDesc, parent) 是判断parent这个cuboid是不是一个有效的cuboid    // 这里就是判断给的parent这个cuboid是否是一个有效的cuboid    if (parent != Cuboid.getBaseCuboid(cubeDesc).getId() && !Cuboid.isValid(cubeDesc, parent)) {        throw new IllegalStateException();    }    HashSet<Long> set = Sets.newHashSet();    if (Long.bitCount(parent) == 1) {        // 如果parent中只包含一个维度了，则就不需要在进一步降维了，再降维就是空了             return set;    }    // 如果parent包含了Cube中的所有维度    if (parent == Cuboid.getBaseCuboidId(cubeDesc)) {        //那么这个时候，parent的子cuboidID中，就应该包含Cube中的所有AggregationGroup的BaseCuboidID              for (AggregationGroup agg : cubeDesc.getAggregationGroups()) {            long partialCubeFullMask = agg.getPartialCubeFullMask();            if (partialCubeFullMask != parent && Cuboid.isValid(agg, partialCubeFullMask)) {                set.add(partialCubeFullMask);            }        }    }    // Cuboid.getValidAggGroupForCuboid(cubeDesc, parent)就是找出Cube中，parent在其中合法的AggregationGroup    // 然后依次遍历这些AggregationGroup    for (AggregationGroup agg : Cuboid.getValidAggGroupForCuboid(cubeDesc, parent)) {        // 对于普通的维度，就是除去强制维度、层级维度、联合维度之后，还剩下的维度                for (long normalDimMask : agg.getNormalDims()) {            long common = parent & normalDimMask;            long temp = parent ^ normalDimMask;            // 对于每一个普通维度                         // 如果在parent中存在,则将其从parent中移除后降维得到的temp,如果在该group中,仍然是一个有效的cuboidID,则算一个parent的child            if (common != 0 && Cuboid.isValid(agg, temp)) {                set.add(temp);            }        }        // 特别注意一下,这里为了简单理解,所以假设的parent和层级维度的取值,都是顺序的,        // dims一次为00000100、00000010、00000001,                 // 真实的情况是dims的取值可能为 00000001、10000000、00010000,这里的顺序都是反映了该维度在rowkey中的顺序         *        // 针对层级维度的降维                 // 建设parent为 11111111         // 层级维度为 fullMask 00000111 , allMasks 为 00000100、00000110、00000111, dims为 00000100、00000010、00000001        // for (int i = hierarchyMask.allMasks.length - 1; i >= 0; i--)这层循环,allMasks[i]遍历顺序为 00000111、00000110、00000100        // 比如第一次循环allMasks[i]取00000111,与parent与操作,就是判断allMasks[i]中的维度是否都包含在parent中,如果都包含在parent中,进入if条件        // 这时候取出allMasks[i]为00000111,这个组合中的最低级的维度为00000001,然后判断该维度是否是联合维度的一员,如果不是,进入if条件                 // 然后将层级维度的最末一级去掉,这里就是去掉00000001这一维度,去掉后的cuboidID为 11111111^00000001=11111110                 // 然后判断11111110是否在该group中是一个有效的cuboidID,如果是,则作为parent的child                 for (AggregationGroup.HierarchyMask hierarchyMask : agg.getHierarchyMasks()) {            for (int i = hierarchyMask.allMasks.length - 1; i >= 0; i--) {                 // 只有当层级维度中的某个组合中的维度都在parent中时,才进入if条件                if ((parent & hierarchyMask.allMasks[i]) == hierarchyMask.allMasks[i]) {                    // 所有联合维度中都不包含当前层级维度组合中的最低维度时,进入if条件                    if ((agg.getJointDimsMask() & hierarchyMask.dims[i]) == 0) {                            if (Cuboid.isValid(agg, parent ^ hierarchyMask.dims[i])) {                                //only when the hierarchy dim is not among joints                                                            set.add(parent ^ hierarchyMask.dims[i]);                            }                    }                    break;    //if hierarchyMask 111 is matched, won't check 110 or 100                }            }        }        //joint dim section                // 联合维度相对比较简单,如果包含某个联合维度,则将其全部去除,再判断其有效性,如果有效,则加入parent的child队列         for (long joint : agg.getJoints()) {            if ((parent & joint) == joint) {                if (Cuboid.isValid(agg, parent ^ joint)) {                    set.add(parent ^ joint);                }            }        }    }    return set;}1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81

降维操作主要是就是针对3类维度进行降维操作，普通维度（一个AggregationGroup的所有维度除去强制维度、层级维度、联合维度之后还剩余的维度）、层级维度、联合维度。

普通维度的降维就是首先判断parent是否包含该普通维度，如果包含，则将其从parent中移除，然后判断移除后的cuboidID在该AggregationGroup中是否有效合法；

层级维度的降维，首先parent中需要包含某个层级维度的某种组合，然后再将该层级维度组合中的最末级的维度移除，得到的cuboidID再去校验合法性；

联合维度的降维最直接明了，包含就全部去除，然后校验合法性。

以上就是通过一个给定的cuboidID，获取所有可能的子cuboidID的逻辑，也就是降维的过程。

-升维逻辑

那既然进行降维操作已经有了，为什么还要有一个getParent方法呢？其实从方法名中可以一探一二，getPotentialChildren获取可能的孩子，这就是说getPotentialChildren方法的逻辑获取的所有child只是说，可能是parent的child，但未必真的是，所以在getSpanningCuboid方法中，先通过getPotentialChildren获取了所以潜在的child，然后又对每一个potential，都去获取其对应的父亲，看是否与给定的这个parent一致，如果一致，才说明父子相认，也就是父亲认了儿子，同时也需要儿子认了父亲才行。

public List<Long> getSpanningCuboid(long cuboid) {    if (cuboid > max || cuboid < 0) {        throw new IllegalArgumentException("Cuboid " + cuboid + " is out of scope 0-" + max);    }    List<Long> result = cache.get(cuboid);    if (result != null) {        return result;    }    result = Lists.newArrayList();    Set<Long> potentials = getPotentialChildren(cuboid);    for (Long potential : potentials) {        if (getParent(potential) == cuboid) {            result.add(potential);        }    }    cache.put(cuboid, result);    return result;}    接着看下getParent的逻辑，getParent方法的逻辑与getPotentialChildren的逻辑刚好反过来，是一个升维的过程。public long getParent(long child) {    List<Long> candidates = Lists.newArrayList();    long baseCuboidID = Cuboid.getBaseCuboidId(cubeDesc);    // 如果该child等于fullMask 或者 该child不是有效的cuboidID,则抛异常      // 这也好理解，fullMask是不可能存在父亲的，因为它就是所有cuboidID的老祖宗   if (child == baseCuboidID || !Cuboid.isValid(cubeDesc, child)) {        throw new IllegalStateException();    }    // 这里与getPotentialChildren一样，也是首选找出所有可能的AggregationGroup，然后开始遍历    for (AggregationGroup agg : Cuboid.getValidAggGroupForCuboid(cubeDesc, child)) {        // thisAggContributed 这个变量标识 当前该AggregationGroup是否已经贡献出了一个parent       boolean thisAggContributed = false;        // 这里也好理解，如果child就是该AggregationGroup的基cuboidID，那么它的父亲只能是Cube的基cuboidID       if (agg.getPartialCubeFullMask() == child) {                    return baseCuboidID;        }        //+1 dim        //add one normal dim (only try the lowest dim)                // 这里只会添加lowest维度,是跟最后的Collections.min有呼应的                 // 因为最后只会选择所有满足条件中的维度数最少,在相同维度数中,值最小的那个候选者,                 // 所以这里就没有必要把高位的维度添加进去,反正最后也会被过滤掉        // 这一点在后面的升维中都会有所体现       long normalDimsMask = (agg.getNormalDimsMask() & ~child);        if (normalDimsMask != 0) {            candidates.add(child | Long.lowestOneBit(normalDimsMask));            thisAggContributed = true;        }        // 开始层级维度的升维        for (AggregationGroup.HierarchyMask hierarchyMask : agg.getHierarchyMasks()) {            if ((child & hierarchyMask.fullMask) == 0) {                // 这里只加入最高级的那个维度,其他维度不继续处理的原因,也是跟最后的排序,只取维度最少有关                 candidates.add(child | hierarchyMask.dims[0]);                thisAggContributed = true;            } else {                for (int i = hierarchyMask.allMasks.length - 1; i >= 0; i--) {                    // 只有与层级维度的某个组合匹配时，才会进入if条件                    if ((child & hierarchyMask.allMasks[i]) == hierarchyMask.allMasks[i]) {                        if (i == hierarchyMask.allMasks.length - 1) {                            // 感觉这里应该用break,而不是continue,虽然这里用contine也不会有问题                            // 如果某个层级维度的所有维度都已经在child中,则child无法再添加维度来形成parent了                            // 比如省、市、县，如果child中已经包含了省、市、县，则没法再进一步添加这个层级的维度了                        continue;//match the full hierarchy                        }                        if ((agg.getJointDimsMask() & hierarchyMask.dims[i + 1]) == 0) {                            // 如果是 省、市，则可以添加一个 县 维度进来，如果是省，则可以添加一个 市 维度进来                            if ((child & hierarchyMask.dims[i + 1]) == 0) {                                //only when the hierarchy dim is not among joints                                                                candidates.add(child | hierarchyMask.dims[i + 1]);                                thisAggContributed = true;                            }                        }                        // 这里的break，就是说，如果已经有一个多维层级组合满足要求了，就无需进一步检查少维度的层级组合了                        // 比如已经 省、市，这个组合已经满足了，就没必要再去检查 省 这个维度组合了。                        break;//if hierarchyMask 111 is matched, won't check 110 or 100                    }                }            }        }        // 如果经过上面的普通维度和层级维度,添加维度操作后,已经找到了候选parent,则无需再进行联合维度的操作        // 因为联合维度至少会加2个维度进来,根据最后的Collections.min,会优先选维度数少的       if (thisAggContributed) {            //next section is going to append more than 2 dim to child                        //thisAggContributed means there's already 1 dim added to child                        //which can safely prune the 2+ dim candidates.                        continue;        }        //2+ dim candidates                // 联合维度的很简单，如果没有包含，则直接全部加入       for (long joint : agg.getJoints()) {            if ((child & joint) == 0) {                candidates.add(child | joint);            }        }    }    if (candidates.size() == 0) {        throw new IllegalStateException();    }    // 这里的Collections.min就是上述很多地方可以提前结束的原因    return Collections.min(candidates, Cuboid.cuboidSelectComparator);}1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112

这个升维的过程，在进入AggregationGroup遍历后，主要通过增加一个维度的升维，和增加2个或以上维度的升维，主要也即是联合维度了。

对于增加1个维度的升维：
对于普通维度，则从所有普通维度中，选择一个在rowkey中排在最后面的那个维度，然后添加到child中；

对于层级维度，如果是该层级维度中的维度都不包含，则取该层级维度中最高级的那个维度添加到child中；如果是child只包含了该层级维度中所有维度的部分维度，比如对于省、市、县这个层级维度，只包含了省或者省市，则可以新增一个市或者县到child中；

如果在1个维度的升维中已经找到了一个候选的parent，则联合维度就不需在进行了，因为联合维度至少会加入两个维度。

再来看一下getParent方法的最后一句代码，就明白为什么升维的过程中，很多潜在的parent可以直接忽略掉。

Cuboid.cuboidSelectComparator的实现如下。

也就是对于任何两个cuboidID，先从中选出包含维度少的那个cuboidID，如果两个cuboidID包含的维度数相同，则在进一步比较，值小的为所需要的cuboidID。

也即是getParent获取的所有候选parent的集合candidates，经过这个比较器排序后，最小的那个cuboidID，就是包含维度最少，且在相同纬度的不同cuboidID中，值是最小的那个。

//smaller is betterpublic final static Comparator<Long> cuboidSelectComparator = new Comparator<Long>() {    @Override        public int compare(Long o1, Long o2) {        return ComparisonChain.start().compare(Long.bitCount(o1), Long.bitCount(o2)).compare(o1, o2).result();    }};

阅读全文

0 0