Important !! Clustering Factor Calculation Improvement

来源：互联网发布：linux添加管理员权限编辑：程序博客网时间：2024/05/05 02:44

Believe me, this article is worth reading

I’m currently not allowed to discuss Oracle 12c Database goodies but I am allowed to discuss things perhaps initially intended for 12c that are currently available and already back-ported to 11g. This includes a wonderful improvement in the manageability of how the Clustering Factor (CF) of an index can now be calculated. Many thanks to Martin Decker for pointing this out to me.

As anyone who has attended my Index Seminars will know, the CF of an index is one of the most important statistics used by the Cost Based Optimizer (CBO) in determining the most efficient execution plan. As such, it has always been an issue for me that the manner in which the CF is calculated has been so flawed.

Basically, the CF is calculated by performing a Full Index Scan and looking at the rowid of each index entry. If the table block being referenced differs from that of the previous index entry, the CF is incremented. If the table block being referenced is the same as the previous index entry, the CF is not incremented. So the CF gives an indication of how well ordered the data in the table is in relation to the index entries (which are always sorted and stored in the order of the index entries). The better (lower) the CF, the more efficient it would be to use the index as less table blocks would need to be accessed to retrieve the necessary data via the index.

However, there’s a basic flaw here. The CF calculation doesn’t take into consideration the fact the referenced table block, although maybe different from the previous one index entry, might already have recently been accessed. As such, during an index scan, the table block being accessed is almost certainly still cached in the buffer cache from the previous access, thereby not reducing the effectiveness of the index in any appreciable manner. A classic example of this would be a table with a few freelists. Although the data being inserted is not ordered precisely within the same data blocks, the data might actually be very well clustered within only a few blocks of each other.

Picture a table with 100 rows being inserted by 2 sessions simultaneously, each inserting 50 rows based on an ordered sequence. With one freelist, the data is basically inserted in one block first and then once full a second table block. The data is therefore perfectly ordered/clustered and the CF will evaluate to a value of 2 on such an indexed column. But with 2 freelists, one session could insert data into one block while the other session inserts into a second block, with the ordered sequenced values being randomly distributed among the 2 blocks. The CF could now potentially evaluate to a value of 100 as the rows are jumbled or “toggled” across the two blocks. This is a much much worse value (2 vs. 100) that can adversely impact the CBO calculations, although the efficiency of such an index is really almost identical as both table blocks are certain to be cached during an index scan regardless.

This is also a very common scenario with Automatic Segment Space Management (ASSM) tablespaces as I’ve discussed previously, which of course is now the default these days.

OK, let’s look at an example scenario. I’ll begin by creating a simple little table, an ordered sequence and a procedure that inserts 100,000 rows into the table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
SQL> create table bowie (id number, text varchar2(30));
 
Table created.
 
SQL> create sequence bowie_seq order;
 
Sequence created.
 
SQL> CREATE OR REPLACE PROCEDURE bowie_proc AS
 
2 BEGIN
 
3    FOR i IN 1..100000LOOP
 
4        INSERT INTO bowie VALUES (bowie_seq.NEXTVAL, 'ZIGGY STARDUST');
 
5        COMMIT;
 
6    END LOOP;
 
7 END;
 
8 /
 
Procedure created.

We note the table lives in an ASSM tablespace:

1
2
3
4
5
6
7
8
9
SQL> select table_name, i.tablespace_name, segment_space_management
 
from dba_tables i, dba_tablespaces t   where i.tablespace_name = t.tablespace_name and table_name='BOWIE';
 
TABLE_NAME   TABLESPACE_NAME                SEGMEN
 
------------ ------------------------------ ------
 
BOWIE        USERS                          AUTO

We next have 3 different sessions that simultaneously run the procedure to load the table. Note that an ordered sequence is used which means the 3 sessions are randomly grabbing the next sequenced value to insert. The data though is basically being inserted in order of the ID column, it’s just that the data is being distributed across a few blocks as we go along the table, rather than strictly one block after the other.

1
2
3
SQL> exec bowie_proc
 
PL/SQL procedure successfully completed.

Let’s create an index on the ID (sequenced) column and collect fresh statistics:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
SQL> create index bowie_id_i on bowie(id);
 
Index created.
 
SQL> EXEC dbms_stats.gather_table_stats(ownname=>user, tabname=>'BOWIE',      estimate_percent=> null, cascade=> true, method_opt=>'FOR ALL COLUMNS SIZE 1');
 
PL/SQL procedure successfully completed.
 
SQL> SELECT t.table_name, i.index_name, t.blocks, t.num_rows, i.clustering_factor
 
2 FROM user_tables t, user_indexes i
 
3 WHERE t.table_name = i.table_name AND i.index_name='BOWIE_ID_I';
 
TABLE_NAME   INDEX_NAME       BLOCKS   NUM_ROWS CLUSTERING_FACTOR
 
------------ ------------ ---------- ---------- -----------------
 
BOWIE        BOWIE_ID_I        1126     300000            241465

We notice that although the data in the table in reality is actually quite well clustered/ordered on the ID column, the actual CF of the index is not reflecting this. At a massive 241,465 it’s an extremely high (bad) CF, much closer in value to rows in the table than the number of table blocks, as the CF calculation keeps flipping back and forth between differing blocks. With such a high CF, the CBO is therefore going to cost an index scan accordingly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
SQL> select * from bowie where id between42 and 429;
 
388rows selected.
 
Execution Plan
 
----------------------------------------------------------
 
Plan hash value:1845943507
 
---------------------------------------------------------------------------
 
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
 
---------------------------------------------------------------------------
 
|  0 | SELECT STATEMENT  |       |   389|  7780|   310  (1)|00:00:04|
 
|* 1 |  TABLE ACCESS FULL| BOWIE |   389|  7780|   310  (1)|00:00:04|
 
---------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
 
---------------------------------------------------
 
1- filter("ID"<=429AND "ID">=42)
 
Statistics
 
----------------------------------------------------------
 
0 recursive calls
 
1 db blockgets
 
1093 consistent gets
 
0 physical reads
 
0 redo size
 
4084 bytes sent via SQL*Net to client
 
519 bytes received via SQL*Net from client
 
2 SQL*Net roundtrips to/from client
 
0 sorts (memory)
 
0 sorts (disk)
 
388 rows processed

Even though only approx. 0.13% of rows are being accessed and more importantly a similar low percentage of table blocks, the CBO has determined that a Full Table Scan (FTS) is the cheaper alternative. This is an all too familiar scenario, all down to the fact the CF is not accurately reflecting the true clustering of the data and subsequent efficiency of the index.

Finally, at long last, there’s now an official fix for this !!

Bug 13262857 Enh: provide some control over DBMS_STATS index clustering factor computation INDEX describes this scenario and currently has available patches that can be applied on both Exadata databases and Oracle versions 11.1.0.7, 11.2.0.2 and 11.2.0.3. The patches (eg. Patch ID 15830250) describe the fix as addressing “Index Clustering Factor Computation Is Pessimistic“. I couldn’t have described it better myself

Once applied (the following demo is on a patched 11.2.0.3 database), there is a new statistics collection preference that can be defined, calledTABLE_CACHED_BLOCKS. This basically sets the number of table blocks we can assume would already be cached when performing an index scan and can be ignored when incrementing the CF during statistics gathering. The default is 1 (i.e. as performed presently) but can be set up to be a value between 1 and 255, meaning during the collection of index statistics, it will not increment the CF if the table block being referenced by the current index entry has already been referenced by any of the prior 255 index entries (if set to 255). It basically sets the appropriate parameter in the sys_op_countchg function used to calculate the CF value during statistic gathering to not increment the CF if the current table block has already been accessed “x” index entries previously.

The TABLE_CACHED_BLOCKS preference can be set by either theDBMS_STATS.SET_TABLE_PREFS, DBMS_STATS.SET_SCHEMA_PREFS orDBMS_STATS.SET_DATABASE_PREFS procedures.

So let’s now change the TABLE_CACHED_BLOCKS preference for this table and re-calculate the index statistics:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
SQL> exec dbms_stats.set_table_prefs(ownname=>user, tabname=>'BOWIE',
 
pname=>'TABLE_CACHED_BLOCKS', pvalue=>42);
 
PL/SQL procedure successfully completed.
 
SQL> EXEC dbms_stats.gather_index_stats(ownname=>user, indname=>'BOWIE_ID_I', estimate_percent=> null);
 
PL/SQL procedure successfully completed.
 
SQL> SELECT t.table_name, i.index_name, t.blocks, t.num_rows, i.clustering_factor
 
2 FROM user_tables t, user_indexes i
 
3 WHERE t.table_name = i.table_name AND i.index_name='BOWIE_ID_I';
 
TABLE_NAME   INDEX_NAME       BLOCKS   NUM_ROWS CLUSTERING_FACTOR
 
------------ ------------ ---------- ---------- -----------------
 
BOWIE        BOWIE_ID_I        1126     300000              1035

We notice that the CF has now been significantly reduced (down from 241465 to just 1035), reflecting far more accurately the true clustering of the data when considering the actual effectiveness of using the index.

If we now run the same query as before:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
SQL> select * from bowie where id between42 and 429;
 
388rows selected.
 
Execution Plan
 
----------------------------------------------------------
 
Plan hash value:3472402785
 
------------------------------------------------------------------------------------------
 
| Id  | Operation                   | Name       | Rows  | Bytes | Cost (%CPU)|Time     |
 
------------------------------------------------------------------------------------------
 
|  0 | SELECT STATEMENT            |            |   389|  7780|     4  (0)|00:00:01|
 
|  1 |  TABLE ACCESS BY INDEX ROWID| BOWIE      |   389|  7780|     4  (0)|00:00:01|
 
|* 2 |   INDEX RANGE SCAN          | BOWIE_ID_I |   389|       |     2  (0)|00:00:01|
 
------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
 
---------------------------------------------------
 
2- access("ID">=42AND "ID"<=429)
 
Statistics
 
----------------------------------------------------------
 
0 recursive calls
 
0 db blockgets
 
6 consistent gets
 
0 physical reads
 
0 redo size
 
9882 bytes sent via SQL*Net to client
 
519 bytes received via SQL*Net from client
 
2 SQL*Net roundtrips to/from client
 
0 sorts (memory)
 
0 sorts (disk)
 
388 rows processed

We notice the index is now being selected by the CBO. At a cost of 4(previously the cost was somewhat greater than the 310 cost of the FTS), this much more accurately reflects the true cost of using the index (notice only 6 consistent gets are performed).

Being able to now set the TABLE_CACHED_BLOCKS preference during statistics collection finally gives us a fully supported and easy method to collect more accurate CF statistics. This in turn can only lead to more informed and accurate decisions by the CBO and ultimately better performing applications. Although available right now via the back ported patches, this will no doubt all be fully documented once the 12c database is finally released.

I can’t recommend enough the use of this new capability