Re: Why index Scan in this case ?

From: <fitzjarrell_at_cox.net>
Date: 2 Nov 2005 09:53:22 -0800
Message-ID: <1130954002.530370.220130@o13g2000cwo.googlegroups.com>

Js wrote:
> Thanks Peter,
>
> Actually I could not clear my question. I know this is because
> histograms are not there.
>
> analyze table t compute statistics
> is equal to
> analyze table t compute statistics for table for all tables for all
> indexes.
>
> Whey you tried with histograms it pick the full table scan.
>
>
> My question was : I analyzed table and index.
>
> Oracle knows the number of rows. -- Table Stats
> Oracle knows the distinct no of columns - Index Stats
>
> Oracle know that this table has 5792 records and 1 distinct value.
>
> Anyhow index usage for all the would be costly.
>
> Then y still oracle is choosing index scan ?

As you've noted histograms make the difference:

SQL> -- create table
SQL> create table t as select * From sys.dba_objects;

Table created.

SQL>
SQL> -- drastically change the data distribution
SQL> update t set object_type = 'TABLE';

7820 rows updated.

SQL>
SQL> -- add an index
SQL> create index objtype_t on t(object_type);

Index created.

SQL>
SQL> -- compute statistics and generate histograms
SQL> exec

dbms_stats.gather_table_stats(ownname=>'SCOTT',tabname=>'T',method_opt=>'for all indexed columns size auto',cascade=>true);

PL/SQL procedure successfully completed.

SQL>
SQL> set autotrace traceonly
SQL>
SQL> -- basic select
SQL> select * from t;

7820 rows selected.

Execution Plan

   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=25 Card=7820 Bytes=6
          49060)

   1    0   TABLE ACCESS (FULL) OF 'T' (Cost=25 Card=7820 Bytes=649060
          )

Statistics

          0  recursive calls
          0  db block gets
        711  consistent gets
          0  physical reads
          0  redo size
     507845  bytes sent via SQL*Net to client
       6382  bytes received via SQL*Net from client
        523  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       7820  rows processed

SQL>

SQL> -- select with predicate
SQL> select * From t where object_type = 'TABLE';

7820 rows selected.

Execution Plan

   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=26 Card=7820 Bytes=6
          49060)

   1    0   TABLE ACCESS (FULL) OF 'T' (Cost=26 Card=7820 Bytes=649060
          )

Statistics

          0  recursive calls
          0  db block gets
        711  consistent gets
          0  physical reads
          0  redo size
     507845  bytes sent via SQL*Net to client
       6382  bytes received via SQL*Net from client
        523  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       7820  rows processed

SQL> Computing histograms can help query access by eliminating the 'guesswork' the optimizer is doing with respect to data distribution. Optimizer plans, in the absence of histograms, are based upon the assumption of data being distributed evenly across the desired keys. Such assumptions may work well for many installations, however these same assumptions may wreak havoc on data distributed such that clusters and gaps appear as the optimizer will assume, wrongly, that a given range of values will be located in close proximity to each other. Should the data returned exceed 30% of the total data in the table a full-table scan may be chosen, when, in actuality, an index scan would be more appropriate. Histograms will illustrate the relative distribution of the data keys, and will provide the optimizer with better information with which to formulate an access plan. Histograms are created using 'buckets', or ranges, to group key values. Such groupings can be seen in the USER_TAB_HISTOGRAMS view; the endpoint number for each group is listed, as well as the normalized value for that group. By default DBMS_STATS computes histograms with one 'bucket' for each indexed column if the SIZE parameter is not specified; with fairly evenly distributed data such histograms are acceptable. With skewed data a better histogram is necessary, and with 9i and later releases Oracle provides the SIZE AUTO specification, allowing Oracle to determine how many 'buckets' will be necessary. An example may more clearly explain how this works.

A table, TEST_TAB, is created with the following structure:

Table TEST_TAB

Name                            Null?    Type
 ------------------------------- -------- ----
 A                                        NUMBER(6)
 B                                        NUMBER(6)

Skewed data is loaded into this table; Column A contains unique values from 1 to 10000, while Column B contains 10 distinct values: 1, 2, 3, 4, 5, 9996, 9997, 9998, 9999 and 10000. The value 5 occurs 9991 times in the data; all other values occur only once.

Test queries run against this data are listed below:

(1) select * from test_tab where b=5;
(2) select * from test_tab where b=3;

Both of the above queries execute a full table scan as no other access method is available.

Creating an index on column B changes the execution plans, resulting in both queries executing an index range scan in place of the full table scan. However, with an index present, it would be preferable to perform an index range scan for query (2), where a small percentage of rows satisfy the condition, but perform a full table scan for query (1), where the majority of the table data would be returned.

The table is analyzed using the dbms_stats.gather_table_stats() procedure;:

SQL> exec dbms_stats.gather_table_stats(ownname=>NULL, tabname=>'TEST_TAB', method_opt => 'for all columns');:

The computed statistics from dba_tables are shown below:

NUM_ROWS BLOCKS EMPTY_BLOCKS AVG_SPACE CHAIN_CNT AVG_ROW_LEN ---------- ---------- ------------ ---------- ---------- -----------

10000 64 0 86 0 10

And the statistics from dba_tab_columns :

NUM_DISTINCT LOW HIGH DENSITY NUM_NULLS NUM_BUCKETS LAST_ANALYZ SAMPLE_SIZE
------------ ---- ---- --------- ---------- ----------- -----------

10000 Full Full .0001 0 1 30-JUN-1999 10000

10 Full Full .1 0 1 30-JUN-1999 10000

>From the USER_TAB_HISTOGRAMS view the following information is
available:

TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ---------- ----------- --------------- --------------

  TEST_TAB           A               0              1
  TEST_TAB           A               1          10000
  TEST_TAB           B               0              1
  TEST_TAB           B               1          10000

DBMS_STATS.GATHER_TABLE_STATS has created 1 bucket for each column, so all values for the column are in the same bucket. The ENDPOINT_NUMBER represents the bucket number and ENDPOINT_VALUE represents the last column value in that bucket.

Now, unfortunately, both query (1) and (2) again execute a full table scan. Simply having statistics about the table and columns does not help the optimizer distinguish how many rows have each value. The reason both queries execute a full table scan is the 1 bucket histogram; any value selected should be in that one bucket. This, of course,, is not the case. What is now necessary is to create a more accurate histogram so the Optimizer knows how many values occur for each column. Looking at the test queries again we have

Query (1): select * from test_tab where b=5;

which should execute a full table scan as the preponderance of rows in the table have b=5.
Query (2): select * from test_tab where b=3;

which should execute an index range scan, as only 1 of the 10,000 values has b=3. To hopefully correct this situation dbms_stats.gather_table_stats() is executed again, this time specifying ten buckets for the histogram:

SQL> exec dbms_stats.gather_table_stats(ownname=>NULL, tabname=>'TEST_TAB', method_opt => 'for all indexed columns size 10');

Querying USER_TAB_HISTOGRAMS the new histogram is:

TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ---------- ----------- --------------- --------------

  TEST_TAB           B               1              1
  TEST_TAB           B               2              2
  TEST_TAB           B               3              3
  TEST_TAB           B               4              4
  TEST_TAB           B            9995              5
  TEST_TAB           B            9996           9996
  TEST_TAB           B            9997           9997
  TEST_TAB           B            9998           9998
  TEST_TAB           B            9999           9999
  TEST_TAB           B           10000          10000

The ENDPOINT_VALUE shows the column value and the ENDPOINT_NUMBER shows the cumulative number of rows up to and including that value. For example, for ENDPOINT_VALUE 2 there is an ENDPOINT_NUMBER of 2; the previous ENDPOINT_NUMBER is 1, hence the number of rows with value 2 is the current ENDPOINT_NUMBER minus the previous ENDPOINT_NUMBER: 2 - 1 = 1 row having b=2; for ENDPOINT_VALUE 5.the ENDPOINT_NUMBER is 9995. The previous ENDPOINT_NUMBER is 4, so 9995 - 4 = 9991 rows where b=5. This now accurately reflects the data distribution in the table, and is more likely to provide correct execution plans for the two example queries. This is proven by the execution plans for both queries, shown below:

SQL> select * from test_tab where b=5
SQL> / Execution Plan

0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=10 Card=9991 Bytes=99910)
1 0 TABLE ACCESS (FULL) OF 'TEST_TAB' (Cost=10 Card=9991 Bytes=99910)

SQL> select * from test_tab where b=3
SQL> / Execution Plan

0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=6 Card=500 Bytes=5000) 1 0 TABLE ACCESS (BY ROWID) OF 'TEST_TAB' (Cost=6 Card=500 Bytes=5000)
2 1 INDEX (RANGE SCAN) OF 'TEST_TAB_B' (NON-UNIQUE) For low cardinality data histograms having one bucket for each distinct value is fine and will not create height-balanced results, however there may be a need to create histograms on data with a larger number of distinct values; using one bucket per value in this case creates far too much overhead; histograms are then created with fewer buckets, creating height-balanced buckets with the exception of the last bucket, which may have fewer values than the rest. If the histogram created in the example above used eight buckets instead of ten: TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ---------- ----------- --------------- --------------

  TEST_TAB           B               0              1
  TEST_TAB           B               1              5
  TEST_TAB           B               2              5
  TEST_TAB           B               3              5
  TEST_TAB           B               4              5
  TEST_TAB           B               5              5
  TEST_TAB           B               6              5
  TEST_TAB           B               7              5
  TEST_TAB           B               8          10000

Oracle creates the requested number of buckets by placing the same number of values into each bucket. The ENDPOINT_NUMBER is the actual bucket number and ENDPOINT_VALUE is the endpoint value of the bucket determined by the column value. Bucket 0 holds the lowest value for the column, which is 1. Buckets 1 through 7 contain values up to and including 5, with Bucket 8 containing values from 5 to 10000. Buckets 1 through 7 are height-balanced, so all have the same number of values, leaving Bucket 8 to contain fewer values. For this particular data such a histogram would be a poor choice; for tables with a large number of distinct values, spread over an even larger dataset, such a histogram would prove quite useful as it would provide a better distribution map than Oracle would normally assume exists.

David Fitzjarrell Received on Wed Nov 02 2005 - 11:53:22 CST