（3）聚簇因子(CLUSTERING_FACTOR )——上

来源：互联网发布：最省电的软件下载编辑：程序博客网时间：2024/05/22 01:36

大家知道数据表中的数据都是无序的存在库中，当我们在对数据进行检索的时候，查找起来很是耗费资源，于是我们就需要为表创建索引，索引的作用就是把表中的数据按照一定的顺序排列保存起来，于是就出现了一个问题，有的表中的数据和索引想要排列的顺序很是相近，而另一些表中的数据和索引想要排列的顺序相距甚远，聚簇因子的作用就是用来标示这个的，聚簇因子越小，相似度越高，聚簇因子越大，相似度越低。

1、列顺序

当计算有效索引选择率时，在基于区间的谓词之后的所有基于列的谓词都将被忽略——但是计算有效表选择率的时候还会使用他们——这样就导致oracle计算得出该索引的成本高得离谱。这就建议我们重新调整某些索引的结构，将对应于区间谓词的列调整到索引定义的末尾。这种调整可能会改变clustering_factor的值。

SQL> create table t1  2  pctfree 90  3  pctused 10  4  as  5  select  6    trunc((rownum-1)/ 100)  clustered,  7    mod(rownum - 1, 100)  scattered,  8    lpad(rownum,10)    small_vc  9  from 10    all_objects 11  where 12    rownum <= 10000 13  ;表已创建。SQL> create index t1_i1_good on t1(clustered, scattered);索引已创建。SQL> create index t1_i2_bad  on t1(scattered, clustered);索引已创建。SQL> begin  2    dbms_stats.gather_table_stats(  3      user,  4      't1',  5      cascade => true,  6      estimate_percent => null,  7      method_opt => 'for all columns size 1'  8    );  9  end; 10  /PL/SQL 过程已成功完成。SQL> select i.index_name,i.blevel,i.leaf_blocks,i.clustering_factor from user_indexes i;INDEX_NAME                         BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR------------------------------ ---------- ----------- -----------------T1_I1_GOOD                              1          24               278T1_I2_BAD                               1          24             10000SQL> set autotrace trace exp;SQL> select count(small_vc)  2    from t1  3   where scattered = 50  4     and clustered between 1 and 5;执行计划----------------------------------------------------------Plan hash value: 2833503776-------------------------------------------------------------------------------------------| Id  | Operation                    | Name       | Rows  | Bytes | Cost (%CPU)| Time     |-------------------------------------------------------------------------------------------|   0 | SELECT STATEMENT             |            |     1 |    17 |     4   (0)| 00:00:01 ||   1 |  SORT AGGREGATE              |            |     1 |    17 |            |          ||   2 |   TABLE ACCESS BY INDEX ROWID| T1         |     6 |   102 |     4   (0)| 00:00:01 ||*  3 |    INDEX RANGE SCAN          | T1_I1_GOOD |     6 |       |     3   (0)| 00:00:01 |-------------------------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   3 - access("CLUSTERED">=1 AND "SCATTERED"=50 AND "CLUSTERED"<=5)       filter("SCATTERED"=50)SQL> select  2  /*+ index(t1 t1_i2_bad) */  3   count(small_vc)  4    from t1  5   where scattered = 50  6     and clustered between 1 and 5;执行计划----------------------------------------------------------Plan hash value: 2223336776------------------------------------------------------------------------------------------| Id  | Operation                    | Name      | Rows  | Bytes | Cost (%CPU)| Time     |------------------------------------------------------------------------------------------|   0 | SELECT STATEMENT             |           |     1 |    17 |     9   (0)| 00:00:01 ||   1 |  SORT AGGREGATE              |           |     1 |    17 |            |          ||   2 |   TABLE ACCESS BY INDEX ROWID| T1        |     6 |   102 |     9   (0)| 00:00:01 ||*  3 |    INDEX RANGE SCAN          | T1_I2_BAD |     6 |       |     2   (0)| 00:00:01 |------------------------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   3 - access("SCATTERED"=50 AND "CLUSTERED">=1 AND "CLUSTERED"<=5)

在第一个执行计划中，尽管存在一个索引（T1_I2_BAD）能很好的满足以上查询的要求，其第一列是scattered（使用相等谓词），第二列是clustered（使用基于区间的谓词），优化器还是选择了t1_i1_good索引。
在第二个执行计划中，添加了一个提示来强制oracle使用我们认为更能满足查询要求的那个索引（t1_i2_bad），但是由此带来的成本将是优化器的默认选择的成本的两倍还要多（由4变为9）。
这就突出了优化器利用clustering_factor来计算索引访问路劲成本的缺陷。不管这个示例中用的是哪一个索引，访问表的数目是完全相同的——但是访问顺序会存在差异，这也足以导致优化器成本的计算结果产生巨大的差异。

SQL> select s.table_name,  2         s.column_name,  3         s.num_distinct,  4         s.density,  5         s.num_nulls,  6         s.avg_col_len  7    from user_tab_col_statistics s  8   where table_name = 'T1';TABLE_NAME                     COLUMN_NAME                    NUM_DISTINCT    DENSITY  NUM_NULLS AVG_COL_LEN------------------------------ ------------------------------ ------------ ---------- ---------- -----------T1                             CLUSTERED                               100        .01          0           3T1                             SCATTERED                               100        .01          0           3T1                             SMALL_VC                              10000      .0001          0          11

scattered = 50 的选择率： 1/100
clustered between 1 and 5 的选择率：(5-1)/(99-0)+2/100
联合选择率：1/100 * (5-1)/(99-0)+2/100

SQL> --cost(t1_il_good)SQL> select           2         1 +     3         ceil（24 * ((5-1)/(99-0)+2/100)) +     4         ceil(278 * (1/100 * ((5-1)/(99-0)+2/100)))    5  from dual;  1+CEIL（24*((5-1)/(99-0)+2/100))+CEIL(278*(1/100*((5-1)/(99-0)+2/100)))-----------------------------------------------------------------------                                                                      4SQL> --cost(t1_il_bad)SQL> select           2         1 +     3         ceil（24 * (1/100 * ((5-1)/(99-0)+2/100))) +     4         ceil(10000 * (1/100 * ((5-1)/(99-0)+2/100)))    5  from dual; 1+CEIL（24*(1/100*((5-1)/(99-0)+2/100)))+CEIL(10000*(1/100*((5-1)/(99-0)+2/100)))---------------------------------------------------------------------------------                                                                                9

2、额外的列

为索引添加一个或者两个列。也会导致clustering_factor出现剧烈的变化。

SQL> create table t1  2  as  3  select  4  sysdate + trunc((rownum-1) / 500)movement_date,  5  trunc(dbms_random.value(1,60.999))product_id,  6  trunc(dbms_random.value(1,10.000))qty,  7  lpad(rownum,10)small_vc,  8  rpad('x',100)padding  9  from 10  all_objects 11  where 12  rownum <= 10000 13  ;表已创建。SQL> create index t1_i1 on t1(movement_date);begin  dbms_stats.gather_table_stats(    user,    't1',    cascade => true,    estimate_percent => null,索引已创建。    method_opt => 'for all columns size 1'  8    );  9  end; 10  /PL/SQL 过程已成功完成。SQL> create index t1_i2 on t1(movement_date, product_id);begindbms_stats.gather_table_stats(user,'t1',cascade => true,索引已创建。SQL>   2    3    4    5    6  estimate_percent => null,  7  method_opt => 'for all columns size 1'  8  );  9  end; 10  /PL/SQL 过程已成功完成。SQL> select i.index_name,i.blevel,i.leaf_blocks,i.clustering_factor from user_indexes i;INDEX_NAME                         BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR------------------------------ ---------- ----------- -----------------T1_I1                                   1          27               184T1_I2                                   1          31              6664

虽然索引的大小只有一点点增加（这也可以通过叶块的数目提现出来），但是，clustering_factor的值再次出现了严重的变化。

SQL> select /*+ index(t1 t1_i1) */  2   sum(qty)  3    from t1  4   where movement_date = trunc(sysdate) + 7  5     and product_id = 44;执行计划----------------------------------------------------------Plan hash value: 269862921--------------------------------------------------------------------------------------| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |--------------------------------------------------------------------------------------|   0 | SELECT STATEMENT             |       |     1 |    14 |    12   (0)| 00:00:01 ||   1 |  SORT AGGREGATE              |       |     1 |    14 |            |          ||*  2 |   TABLE ACCESS BY INDEX ROWID| T1    |     8 |   112 |    12   (0)| 00:00:01 ||*  3 |    INDEX RANGE SCAN          | T1_I1 |   500 |       |     2   (0)| 00:00:01 |--------------------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   2 - filter("PRODUCT_ID"=44)   3 - access("MOVEMENT_DATE"=TRUNC(SYSDATE@!)+7)SQL> select /*+ index(t1 t1_i2) */  2   sum(qty)  3    from t1  4   where movement_date = trunc(sysdate) + 7  5     and product_id = 44;执行计划----------------------------------------------------------Plan hash value: 2003103559--------------------------------------------------------------------------------------| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |--------------------------------------------------------------------------------------|   0 | SELECT STATEMENT             |       |     1 |    14 |     7   (0)| 00:00:01 ||   1 |  SORT AGGREGATE              |       |     1 |    14 |            |          ||   2 |   TABLE ACCESS BY INDEX ROWID| T1    |     8 |   112 |     7   (0)| 00:00:01 ||*  3 |    INDEX RANGE SCAN          | T1_I2 |     8 |       |     1   (0)| 00:00:01 |--------------------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   3 - access("MOVEMENT_DATE"=TRUNC(SYSDATE@!)+7 AND "PRODUCT_ID"=44)

这个查询就是那种鼓励我们为索引添加额外列的查询类型。可以看见使用复合索引查询成本确实降低了。

SQL> select /*+ index(t1 t1_i1) */  2   product_id, max(small_vc)  3    from t1  4   where movement_date = trunc(sysdate) + 7  5   group by product_id;执行计划----------------------------------------------------------Plan hash value: 2629508439--------------------------------------------------------------------------------------| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |--------------------------------------------------------------------------------------|   0 | SELECT STATEMENT             |       |    60 |  1320 |    13   (8)| 00:00:01 ||   1 |  HASH GROUP BY               |       |    60 |  1320 |    13   (8)| 00:00:01 ||   2 |   TABLE ACCESS BY INDEX ROWID| T1    |   500 | 11000 |    12   (0)| 00:00:01 ||*  3 |    INDEX RANGE SCAN          | T1_I1 |   500 |       |     2   (0)| 00:00:01 |--------------------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   3 - access("MOVEMENT_DATE"=TRUNC(SYSDATE@!)+7)SQL> select /*+ index(t1 t1_i2) */  2   product_id, max(small_vc)  3    from t1  4   where movement_date = trunc(sysdate) + 7  5   group by product_id;执行计划----------------------------------------------------------Plan hash value: 878233575--------------------------------------------------------------------------------------| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |--------------------------------------------------------------------------------------|   0 | SELECT STATEMENT             |       |    60 |  1320 |   337   (0)| 00:00:05 ||   1 |  SORT GROUP BY NOSORT        |       |    60 |  1320 |   337   (0)| 00:00:05 ||   2 |   TABLE ACCESS BY INDEX ROWID| T1    |   500 | 11000 |   337   (0)| 00:00:05 ||*  3 |    INDEX RANGE SCAN          | T1_I2 |   500 |       |     3   (0)| 00:00:01 |--------------------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   3 - access("MOVEMENT_DATE"=TRUNC(SYSDATE@!)+7)

这个查询额外列的存在改变了行访问的顺序（即clustering_factor的值所描述的内容），因此成本也发生了变化。