（4）选择率（selectivity）的相关问题

来源：互联网发布：内衣取名知乎编辑：程序博客网时间：2024/05/24 02:37

本章将讨论采用标准的选择率计算方式却产生不合适结果的一些最常见原因。

1、前导零

创建一个包含2000000行数据的表，它有一个id列，采用序号并利用0来进行填充其他位。因此典型的存储值将是A00000000000000001，系统中绝大部分使用这种策略的查询，类似于where id={string constant}的形式；但是，如果它们使用了基于区间的谓词，可能将出现一些奇怪的性能问题。

SQL> create table t1   2  nologging  3  pctfree 0  4  as  5  with generator as (  6    select  7      rownum   id  8    from  all_objects   9    where  rownum <= 2000 10  ) 11  select 12    /*+ ordered use_nl(v2) */ 13    trunc((rownum-1)/10000)    grp_id, 14    'A' || lpad(rownum, 17, '0')  id 15  from 16    generator  v1, 17    generator  v2 18  where 19    rownum <= 2000000 20  ;表已创建。SQL> begin  2  dbms_stats.gather_table_stats(  3  ownname=> user,  4  tabname=> 't1',  5  cascade=> true,  6  estimate_percent=> null,   7  method_opt=>'for all columns size 1'  8  );  9  end; 10  /PL/SQL 过程已成功完成。

SQL> set autotrace traceonly;SQL> select * from t1 where id between 'A00000000000060000' and 'A00000000000070000';已选择10001行。执行计划----------------------------------------------------------Plan hash value: 3617692013--------------------------------------------------------------------------| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |--------------------------------------------------------------------------|   0 | SELECT STATEMENT  |      |    17 |   374 |  1190   (3)| 00:00:15 ||*  1 |  TABLE ACCESS FULL| T1   |    17 |   374 |  1190   (3)| 00:00:15 |--------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   1 - filter("ID"<='A00000000000070000' AND "ID">='A00000000000060000')统计信息----------------------------------------------------------          1  recursive calls          0  db block gets       7520  consistent gets       6849  physical reads          0  redo size     325111  bytes sent via SQL*Net to client       7741  bytes received via SQL*Net from client        668  SQL*Net roundtrips to/from client          0  sorts (memory)          0  sorts (disk)      10001  rows processed

针对该查询得到的基数是非常低的（17），但是很明显却返回了10001行。下面重新创建直方图，直方图的默认值为75个桶。

SQL> set autotrace off;SQL> begin  2  dbms_stats.gather_table_stats(  3  ownname=> user,  4  tabname=> 't1',  5  cascade=> true,  6  estimate_percent=> null,   7  method_opt=>'for all columns size 75'  8  );  9  end; 10  /PL/SQL 过程已成功完成。SQL> set autotrace traceonly;SQL> select * from t1 where id between 'A00000000000060000' and 'A00000000000070000';已选择10001行。执行计划----------------------------------------------------------Plan hash value: 3617692013--------------------------------------------------------------------------| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |--------------------------------------------------------------------------|   0 | SELECT STATEMENT  |      |  8924 |   191K|  1190   (3)| 00:00:15 ||*  1 |  TABLE ACCESS FULL| T1   |  8924 |   191K|  1190   (3)| 00:00:15 |--------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   1 - filter("ID"<='A00000000000070000' AND "ID">='A00000000000060000')统计信息----------------------------------------------------------          0  recursive calls          0  db block gets       7520  consistent gets       6849  physical reads          0  redo size     325111  bytes sent via SQL*Net to client       7741  bytes received via SQL*Net from client        668  SQL*Net roundtrips to/from client          0  sorts (memory)          0  sorts (disk)      10001  rows processed

直方图的默认值为75个桶，执行计划显示估计的基数为8924——这个结果至少大体上是正确的。创建直方图是一种方法，如果id列的值没有字符，是可以转换为数字的字符串（00000000000000001），那么在id列上面创建一个函数索引也是一种办法：create index t1_i1 on t1(grp_id, to_number(id));

2、致命的默认值

即时是数据库应用程序正确的使用了日期类型，也仍然需要避免null值。为了不让任何列为null，每个可空的列都有一个默认值。因此，大部分独立于数据库的开发人员会选择一个什么样的值来表示null日期呢？如果表示很久以后的日期呢？比如4000年12月31日。

SQL> create table t1   2  as  3  with generator as (  4  select  5  rownum id  6  from all_objects   7  where rownum <= 2000  8  )  9  select 10  /*+ ordered use_nl(v2) */ 11  decode( 12      mod(rownum - 1,1000), 13        0,to_date('4000-12-31','yyyy-mm-dd'), 14          to_date('2000-01-01','yyyy-mm-dd') + trunc((rownum - 1)/100)  15    )  date_closed 16  from 17    generator  v1, 18    generator  v2 19  where rownum <= 1827 * 100;表已创建。SQL> begin  2  dbms_stats.gather_table_stats(  3  ownname=> user,  4  tabname=> 't1',  5  cascade=> true,  6  estimate_percent=> null,   7  method_opt=>'for all columns size 1'    --直方图的默认值为1桶。  8  );  9  end; 10  /PL/SQL 过程已成功完成。SQL> set autotrace traceonly;SQL> select *  2    from t1  3   where date_closed between to_date('2003-01-01', 'yyyy-mm-dd') and  4         to_date('2003-12-31', 'yyyy-mm-dd');已选择36463行。执行计划----------------------------------------------------------Plan hash value: 3617692013--------------------------------------------------------------------------| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |--------------------------------------------------------------------------|   0 | SELECT STATEMENT  |      |   291 |  2328 |    61   (4)| 00:00:01 ||*  1 |  TABLE ACCESS FULL| T1   |   291 |  2328 |    61   (4)| 00:00:01 |--------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   1 - filter("DATE_CLOSED"<=TO_DATE(' 2003-12-31 00:00:00',              'syyyy-mm-dd hh24:mi:ss') AND "DATE_CLOSED">=TO_DATE(' 2003-01-01              00:00:00', 'syyyy-mm-dd hh24:mi:ss'))统计信息----------------------------------------------------------          1  recursive calls          0  db block gets       2759  consistent gets          0  physical reads          0  redo size     494301  bytes sent via SQL*Net to client      27145  bytes received via SQL*Net from client       2432  SQL*Net roundtrips to/from client          0  sorts (memory)          0  sorts (disk)      36463  rows processed     --这里实际放回了36463行记录，但是oracle却计算错误了（291）。SQL> set autotrace off;SQL> begin  2  dbms_stats.gather_table_stats(  3  ownname=> user,  4  tabname=> 't1',  5  cascade=> true,  6  estimate_percent=> null,   7  method_opt=>'for all columns size 11'    --直方图的默认值为11桶。  8  );  9  end; 10  /PL/SQL 过程已成功完成。SQL> set autotrace traceonly;SQL> select *  2    from t1  3   where date_closed between to_date('2003-01-01', 'yyyy-mm-dd') and  4         to_date('2003-12-31', 'yyyy-mm-dd');已选择36463行。执行计划----------------------------------------------------------Plan hash value: 3617692013--------------------------------------------------------------------------| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |--------------------------------------------------------------------------|   0 | SELECT STATEMENT  |      | 36320 |   283K|    61   (4)| 00:00:01 ||*  1 |  TABLE ACCESS FULL| T1   | 36320 |   283K|    61   (4)| 00:00:01 |--------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   1 - filter("DATE_CLOSED">=TO_DATE(' 2003-01-01 00:00:00',              'syyyy-mm-dd hh24:mi:ss') AND "DATE_CLOSED"<=TO_DATE(' 2003-12-31              00:00:00', 'syyyy-mm-dd hh24:mi:ss'))统计信息----------------------------------------------------------          0  recursive calls          0  db block gets       2759  consistent gets          0  physical reads          0  redo size     494301  bytes sent via SQL*Net to client      27145  bytes received via SQL*Net from client       2432  SQL*Net roundtrips to/from client          0  sorts (memory)          0  sorts (disk)      36463  rows processed          --36463：36320这次oracle计算得比较准确了。

可以看见加大了直方图的桶数之后，CBO估算的行数就比较接近真实值了，那我们再加大直方图的桶数试一下呐！

SQL> set autotrace off;SQL> begin  2  dbms_stats.gather_table_stats(  3  ownname=> user,  4  tabname=> 't1',  5  cascade=> true,  6  estimate_percent=> null,   7  method_opt=>'for all columns size 75'  8  );  9  end; 10  /PL/SQL 过程已成功完成。SQL> set autotrace trace;SQL> select *  2    from t1  3   where date_closed between to_date('2003-01-01', 'yyyy-mm-dd') and  4         to_date('2003-12-31', 'yyyy-mm-dd');已选择36463行。执行计划----------------------------------------------------------Plan hash value: 3617692013--------------------------------------------------------------------------| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |--------------------------------------------------------------------------|   0 | SELECT STATEMENT  |      | 36345 |   283K|    61   (4)| 00:00:01 ||*  1 |  TABLE ACCESS FULL| T1   | 36345 |   283K|    61   (4)| 00:00:01 |--------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   1 - filter("DATE_CLOSED">=TO_DATE(' 2003-01-01 00:00:00',              'syyyy-mm-dd hh24:mi:ss') AND "DATE_CLOSED"<=TO_DATE(' 2003-12-31              00:00:00', 'syyyy-mm-dd hh24:mi:ss'))统计信息----------------------------------------------------------          0  recursive calls          0  db block gets       2759  consistent gets          0  physical reads          0  redo size     494301  bytes sent via SQL*Net to client      27145  bytes received via SQL*Net from client       2432  SQL*Net roundtrips to/from client          0  sorts (memory)          0  sorts (disk)      36463  rows processed    --36463：36345 加大了直方图的桶数之后，CBO估算返回的行数误差更小了。

3、离散数据的风险

考虑一个包含period列的计数系统——存储1~12月的数据，并额外再加一个月，其对应的数值为99（同时包含了第二种选择，即这个特殊的月给定的值为13）。

SQL> create table t1   2  as  3  with generator as (  4  select  5      rownum   id  6    from  all_objects   7    where  rownum <= 1000  8  )  9  select 10    /*+ ordered use_nl(v2) */ 11    mod(rownum-1,13)  period_01,  12    mod(rownum-1,13)  period_02 13  from 14    generator  v1, 15    generator  v2 16  where 17    rownum <= 13000 18  ;表已创建。SQL> update t1 set   2  period_01 = 99,  3  period_02 = 13  4  where   5  period_01 = 0;已更新1000行。SQL> commit;提交完成。SQL> begin  2  dbms_stats.gather_table_stats(  3  user,  4  't1',  5  cascade => true,  6  estimate_percent => null,  7  method_opt => 'for all columns size 1'  8  );  9  end; 10  /PL/SQL 过程已成功完成。SQL> set autotrace on exp;SQL> select count(*) from t1 where period_01 between 4 and 6;  COUNT(*)----------      3000          --这里实际有3000行，但是oracle估算的是1663行，不准确！执行计划----------------------------------------------------------Plan hash value: 3724264953---------------------------------------------------------------------------| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |---------------------------------------------------------------------------|   0 | SELECT STATEMENT   |      |     1 |     3 |     6   (0)| 00:00:01 ||   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          ||*  2 |   TABLE ACCESS FULL| T1   |  1663 |  4989 |     6   (0)| 00:00:01 |---------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   2 - filter("PERIOD_01"<=6 AND "PERIOD_01">=4)SQL> select count(*) from t1 where period_02 between 4 and 6;  COUNT(*)----------      3000       --这里实际有3000行，但是oracle估算的是4167行，不准确！执行计划----------------------------------------------------------Plan hash value: 3724264953---------------------------------------------------------------------------| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |---------------------------------------------------------------------------|   0 | SELECT STATEMENT   |      |     1 |     3 |     6   (0)| 00:00:01 ||   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          ||*  2 |   TABLE ACCESS FULL| T1   |  4167 | 12501 |     6   (0)| 00:00:01 |---------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   2 - filter("PERIOD_02"<=6 AND "PERIOD_02">=4)SQL> set autotrace off;SQL> begin  2  dbms_stats.gather_table_stats(  3  user,  4  't1',  5  cascade => true,  6  estimate_percent => null,  7  method_opt => 'for all columns size 254'           --重新收集直方图。  8  );  9  end; 10  /PL/SQL 过程已成功完成。SQL> set autotrace on exp;SQL> select count(*) from t1 where period_01 between 4 and 6;  COUNT(*)----------      3000执行计划----------------------------------------------------------Plan hash value: 3724264953---------------------------------------------------------------------------| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |---------------------------------------------------------------------------|   0 | SELECT STATEMENT   |      |     1 |     3 |     6   (0)| 00:00:01 ||   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          ||*  2 |   TABLE ACCESS FULL| T1   |  3000 |  9000 |     6   (0)| 00:00:01 |---------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   2 - filter("PERIOD_01"<=6 AND "PERIOD_01">=4)SQL> select count(*) from t1 where period_02 between 4 and 6;  COUNT(*)----------      3000执行计划----------------------------------------------------------Plan hash value: 3724264953---------------------------------------------------------------------------| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |---------------------------------------------------------------------------|   0 | SELECT STATEMENT   |      |     1 |     3 |     6   (0)| 00:00:01 ||   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          ||*  2 |   TABLE ACCESS FULL| T1   |  3000 |  9000 |     6   (0)| 00:00:01 |---------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   2 - filter("PERIOD_02"<=6 AND "PERIOD_02">=4)

4、函数索引

SQL> create index idx_t1_PERIOD_01 on t1(upper(PERIOD_01));索引已创建。SQL> select i.index_name,i.index_type from user_indexes i;INDEX_NAME                     INDEX_TYPE------------------------------ ---------------------------IDX_T1_PERIOD_01               FUNCTION-BASED NORMALSQL> select c.TABLE_NAME,c.COLUMN_NAME from user_tab_cols c;TABLE_NAME                     COLUMN_NAME------------------------------ ------------------------------T1                             PERIOD_01T1                             PERIOD_02T1                             SYS_NC00003$

需要牢记的是，如果创建了一个基于函数的索引，那么实际上是在虚拟列上创建了索引，当收集关于该表及其索引的统计信息时，同时也就收集了虚拟列上的统计信息。这在类似情况下，诸如upper(PERIOD_01)='xxx'的谓词将被优化为：SYS_NC00003$='xxx'。

5、相互关联的列

如果在谓词中使用相互依赖（相关）的列容易将问题复杂化，前提是这些相关联的列同时出现在where子句中。

SQL> create table t1   2  nologging  3  as  4  select  5  trunc(dbms_random.value(0,25))n1,  6  rpad('x',40)ind_pad,  7  trunc(dbms_random.value(0,20))n2,  8  lpad(rownum,10,'0')small_vc,  9  rpad('x',200)padding 10  from 11  all_objects 12  where 13  rownum  <= 10000 14  ;表已创建。SQL> update t1 set n2 = n1;已更新10000行。SQL> commit;提交完成。SQL> begin  2  dbms_stats.gather_table_stats(  3  ownname => user,  4  tabname=> 'T1',  5  cascade=> true,  6  estimate_percent => null,  7  method_opt => 'for all columns size 1'  8  );  9  end; 10  /PL/SQL 过程已成功完成。SQL> set autotrace traceonly;SQL> select small_vc  2    from t1  3   where n1 = 2  4     and n2 = 2;已选择420行。执行计划----------------------------------------------------------Plan hash value: 3617692013--------------------------------------------------------------------------| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |--------------------------------------------------------------------------|   0 | SELECT STATEMENT  |      |    16 |   272 |    66   (0)| 00:00:01 ||*  1 |  TABLE ACCESS FULL| T1   |    16 |   272 |    66   (0)| 00:00:01 |--------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   1 - filter("N1"=2 AND "N2"=2)统计信息----------------------------------------------------------          1  recursive calls          0  db block gets        401  consistent gets          0  physical reads          0  redo size       9059  bytes sent via SQL*Net to client        712  bytes received via SQL*Net from client         29  SQL*Net roundtrips to/from client          0  sorts (memory)          0  sorts (disk)        420  rows processed

根据以上执行计划，CBO估算将会返回16行数据，实际确实返回了420行，如果这个表要与多表关联，基数一旦算错，必然导致整个SQL的执行计划全部出错，从而导致SQL性能下降。

接下来再次在相关列上执行查询，不过sql语句中包含了dynamic_sampling提示。

SQL> select /*+ dynamic_sampling(t1 1) */  2   small_vc  3    from t1  4   where n1 = 2  5     and n2 = 2;已选择420行。执行计划----------------------------------------------------------Plan hash value: 3617692013--------------------------------------------------------------------------| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |--------------------------------------------------------------------------|   0 | SELECT STATEMENT  |      |   370 |  6290 |    66   (0)| 00:00:01 ||*  1 |  TABLE ACCESS FULL| T1   |   370 |  6290 |    66   (0)| 00:00:01 |--------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   1 - filter("N1"=2 AND "N2"=2)Note-----   - dynamic sampling used for this statement (level=2)统计信息----------------------------------------------------------          0  recursive calls          0  db block gets        401  consistent gets          0  physical reads          0  redo size       9059  bytes sent via SQL*Net to client        712  bytes received via SQL*Net from client         29  SQL*Net roundtrips to/from client          0  sorts (memory)          0  sorts (disk)        420  rows processed