Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
![]() |
![]() |
Home -> Community -> Usenet -> c.d.o.server -> Re: Why index Scan in this case ?
Js wrote:
> Thanks Peter,
>
> Actually I could not clear my question. I know this is because
> histograms are not there.
>
> analyze table t compute statistics
> is equal to
> analyze table t compute statistics for table for all tables for all
> indexes.
>
> Whey you tried with histograms it pick the full table scan.
>
>
> My question was : I analyzed table and index.
>
> Oracle knows the number of rows. -- Table Stats
> Oracle knows the distinct no of columns - Index Stats
>
> Oracle know that this table has 5792 records and 1 distinct value.
>
> Anyhow index usage for all the would be costly.
>
> Then y still oracle is choosing index scan ?
As you've noted histograms make the difference:
SQL> -- create table
SQL> create table t as select * From sys.dba_objects;
Table created.
SQL> SQL> -- drastically change the data distribution SQL> update t set object_type = 'TABLE';
7820 rows updated.
SQL> SQL> -- add an index SQL> create index objtype_t on t(object_type);
Index created.
SQL> SQL> -- compute statistics and generate histograms SQL> exec
PL/SQL procedure successfully completed.
SQL> SQL> set autotrace traceonly SQL> SQL> -- basic select SQL> select * from t;
7820 rows selected.
Execution Plan
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=25 Card=7820 Bytes=6 49060) 1 0 TABLE ACCESS (FULL) OF 'T' (Cost=25 Card=7820 Bytes=649060 )
Statistics
0 recursive calls 0 db block gets 711 consistent gets 0 physical reads 0 redo size 507845 bytes sent via SQL*Net to client 6382 bytes received via SQL*Net from client 523 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 7820 rows processed SQL>
7820 rows selected.
Execution Plan
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=26 Card=7820 Bytes=6 49060) 1 0 TABLE ACCESS (FULL) OF 'T' (Cost=26 Card=7820 Bytes=649060 )
Statistics
0 recursive calls 0 db block gets 711 consistent gets 0 physical reads 0 redo size 507845 bytes sent via SQL*Net to client 6382 bytes received via SQL*Net from client 523 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 7820 rows processed
SQL> Computing histograms can help query access by eliminating the 'guesswork' the optimizer is doing with respect to data distribution. Optimizer plans, in the absence of histograms, are based upon the assumption of data being distributed evenly across the desired keys. Such assumptions may work well for many installations, however these same assumptions may wreak havoc on data distributed such that clusters and gaps appear as the optimizer will assume, wrongly, that a given range of values will be located in close proximity to each other. Should the data returned exceed 30% of the total data in the table a full-table scan may be chosen, when, in actuality, an index scan would be more appropriate. Histograms will illustrate the relative distribution of the data keys, and will provide the optimizer with better information with which to formulate an access plan. Histograms are created using 'buckets', or ranges, to group key values. Such groupings can be seen in the USER_TAB_HISTOGRAMS view; the endpoint number for each group is listed, as well as the normalized value for that group. By default DBMS_STATS computes histograms with one 'bucket' for each indexed column if the SIZE parameter is not specified; with fairly evenly distributed data such histograms are acceptable. With skewed data a better histogram is necessary, and with 9i and later releases Oracle provides the SIZE AUTO specification, allowing Oracle to determine how many 'buckets' will be necessary. An example may more clearly explain how this works.
A table, TEST_TAB, is created with the following structure:
Table TEST_TAB
Name Null? Type ------------------------------- -------- ---- A NUMBER(6) B NUMBER(6)
Skewed data is loaded into this table; Column A contains unique values from 1 to 10000, while Column B contains 10 distinct values: 1, 2, 3, 4, 5, 9996, 9997, 9998, 9999 and 10000. The value 5 occurs 9991 times in the data; all other values occur only once.
Test queries run against this data are listed below:
(1) select * from test_tab where b=5;
(2) select * from test_tab where b=3;
Both of the above queries execute a full table scan as no other access method is available.
Creating an index on column B changes the execution plans, resulting in both queries executing an index range scan in place of the full table scan. However, with an index present, it would be preferable to perform an index range scan for query (2), where a small percentage of rows satisfy the condition, but perform a full table scan for query (1), where the majority of the table data would be returned.
The table is analyzed using the dbms_stats.gather_table_stats() procedure;:
SQL> exec dbms_stats.gather_table_stats(ownname=>NULL, tabname=>'TEST_TAB', method_opt => 'for all columns');:
The computed statistics from dba_tables are shown below:
NUM_ROWS BLOCKS EMPTY_BLOCKS AVG_SPACE CHAIN_CNT AVG_ROW_LEN ---------- ---------- ------------ ---------- ---------- -----------
10000 64 0 86 0 10
And the statistics from dba_tab_columns :
NUM_DISTINCT LOW HIGH DENSITY NUM_NULLS NUM_BUCKETS LAST_ANALYZ
SAMPLE_SIZE
------------ ---- ---- --------- ---------- ----------- -----------
10000 Full Full .0001 0 1 30-JUN-1999 10000
10 Full Full .1 0 1 30-JUN-1999 10000
>From the USER_TAB_HISTOGRAMS view the following information is
available:
TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ---------- ----------- --------------- --------------
TEST_TAB A 0 1 TEST_TAB A 1 10000 TEST_TAB B 0 1 TEST_TAB B 1 10000
DBMS_STATS.GATHER_TABLE_STATS has created 1 bucket for each column, so all values for the column are in the same bucket. The ENDPOINT_NUMBER represents the bucket number and ENDPOINT_VALUE represents the last column value in that bucket.
Now, unfortunately, both query (1) and (2) again execute a full table scan. Simply having statistics about the table and columns does not help the optimizer distinguish how many rows have each value. The reason both queries execute a full table scan is the 1 bucket histogram; any value selected should be in that one bucket. This, of course,, is not the case. What is now necessary is to create a more accurate histogram so the Optimizer knows how many values occur for each column. Looking at the test queries again we have
Query (1): select * from test_tab where b=5;
which should execute a full table scan as the preponderance of rows in
the table have b=5.
Query (2): select * from test_tab where b=3;
which should execute an index range scan, as only 1 of the 10,000 values has b=3. To hopefully correct this situation dbms_stats.gather_table_stats() is executed again, this time specifying ten buckets for the histogram:
SQL> exec dbms_stats.gather_table_stats(ownname=>NULL, tabname=>'TEST_TAB', method_opt => 'for all indexed columns size 10');
Querying USER_TAB_HISTOGRAMS the new histogram is:
TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ---------- ----------- --------------- --------------
TEST_TAB B 1 1 TEST_TAB B 2 2 TEST_TAB B 3 3 TEST_TAB B 4 4 TEST_TAB B 9995 5 TEST_TAB B 9996 9996 TEST_TAB B 9997 9997 TEST_TAB B 9998 9998 TEST_TAB B 9999 9999 TEST_TAB B 10000 10000
The ENDPOINT_VALUE shows the column value and the ENDPOINT_NUMBER shows the cumulative number of rows up to and including that value. For example, for ENDPOINT_VALUE 2 there is an ENDPOINT_NUMBER of 2; the previous ENDPOINT_NUMBER is 1, hence the number of rows with value 2 is the current ENDPOINT_NUMBER minus the previous ENDPOINT_NUMBER: 2 - 1 = 1 row having b=2; for ENDPOINT_VALUE 5.the ENDPOINT_NUMBER is 9995. The previous ENDPOINT_NUMBER is 4, so 9995 - 4 = 9991 rows where b=5. This now accurately reflects the data distribution in the table, and is more likely to provide correct execution plans for the two example queries. This is proven by the execution plans for both queries, shown below:
SQL> select * from test_tab where b=5
SQL> /
Execution Plan
SQL> select * from test_tab where b=3
SQL> /
Execution Plan
TEST_TAB B 0 1 TEST_TAB B 1 5 TEST_TAB B 2 5 TEST_TAB B 3 5 TEST_TAB B 4 5 TEST_TAB B 5 5 TEST_TAB B 6 5 TEST_TAB B 7 5 TEST_TAB B 8 10000
Oracle creates the requested number of buckets by placing the same number of values into each bucket. The ENDPOINT_NUMBER is the actual bucket number and ENDPOINT_VALUE is the endpoint value of the bucket determined by the column value. Bucket 0 holds the lowest value for the column, which is 1. Buckets 1 through 7 contain values up to and including 5, with Bucket 8 containing values from 5 to 10000. Buckets 1 through 7 are height-balanced, so all have the same number of values, leaving Bucket 8 to contain fewer values. For this particular data such a histogram would be a poor choice; for tables with a large number of distinct values, spread over an even larger dataset, such a histogram would prove quite useful as it would provide a better distribution map than Oracle would normally assume exists.
David Fitzjarrell Received on Wed Nov 02 2005 - 11:53:22 CST
![]() |
![]() |