Re: CBO: Skewed Data and "NOT EQUALS"

From: DA Morgan <damorgan_at_psoug.org>
Date: Thu, 02 Aug 2007 04:49:09 -0700
Message-ID: <1186055347.36919@bubbleator.drizzle.com>

john.howard.65_at_gmail.com wrote:
> I have a table with a column whose values are very skew. A large
> majority of rows has a single value for this column. I have an index,
> a not-null constraint and a frequency histogram on this column.
>
> The CBO uses the histogram data correctly when I query with an
> "equals" condition. By which I mean if I want rows where the column is
> the majority value a full table scan is used and if I want rows for a
> low cardinality value then the index is used.
>
> However this all changes if I use a "not equals" condition. If I ask
> for rows *not equal* to the majority value it does a full table scan
> when the index would be much better.
>
> Setup:
> (I'm running on 9.2 & 10.2)
>
> CREATE TABLE skewtab AS
> SELECT 'X' col , RPAD('X',4000) rest
> FROM all_objects, all_objects
> WHERE ROWNUM <=50000;
>
> ALTER TABLE SKEWTAB
> MODIFY(COL NOT NULL);
>
> UPDATE skewtab
> SET col = 'Z'
> WHERE ROWNUM <= 20;
>
> UPDATE skewtab
> SET col = 'Y'
> WHERE ROWNUM <= 10;
>
> CREATE INDEX I1 on skewtab (col);
>
> ANALYZE TABLE skewtab COMPUTE STATISTICS;
>
> ANALYZE TABLE skewtab COMPUTE STATISTICS FOR COLUMNS col SIZE 5;
>
> ANALYZE INDEX i1 COMPUTE STATISTICS;
>
>
> SQL> select *
> 2 from skewtab
> 3 where col <> 'X'
> 4 /
>
> 20 rows selected.
>
> Execution Plan
> ----------------------------------------------------------
> 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=7636 Card=20
> Bytes=80020)
> 1 0 TABLE ACCESS (FULL) OF 'SKEWTAB' (Cost=7636 Card=20
> Bytes=80020)
>
>
> Statistics
> ----------------------------------------------------------
> ...
> 50082 consistent gets
> ...
>
>
> Forcing the index with a hint :
>
> SQL> select --+ INDEX(skewtab i1)
> 2 *
> 3 from skewtab
> 4 where col <> 'X'
> 5 /
>
> 20 rows selected.
>
> Execution Plan
> ----------------------------------------------------------
> 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=39 Card=20
> Bytes=80020)
> 1 0 TABLE ACCESS (BY INDEX ROWID) OF 'SKEWTAB' (Cost=39
> Card=20 Bytes=80020)
> 2 1 INDEX (FULL SCAN) OF 'I1' (NON-UNIQUE) (Cost=92 Card=20)
>
> Statistics
> ----------------------------------------------------------
> ...
> 116 consistent gets
> ...
>
> The optimizer can even see that this is better (Cost = 39 versus 7636)
> yet doesn't choose it.
>
> Even this is a little disappointing when, rather than a full index
> scan, the CBO could transform the query and do 2 range scans:
>
> SQL> select *
> 2 from skewtab
> 3 where col > 'X'
> 4 or col < 'X'
> 5 /
>
> 20 rows selected.
>
>
> Execution Plan
> ----------------------------------------------------------
> 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=4 Card=2 Bytes=8002)
> 1 0 CONCATENATION
> 2 1 TABLE ACCESS (BY INDEX ROWID) OF 'SKEWTAB' (Cost=2
> Card=1 Bytes=4001)
> 3 2 INDEX (RANGE SCAN) OF 'I1' (NON-UNIQUE) (Cost=2
> Card=1)
> 4 1 TABLE ACCESS (BY INDEX ROWID) OF 'SKEWTAB' (Cost=2
> Card=1 Bytes=4001)
> 5 4 INDEX (RANGE SCAN) OF 'I1' (NON-UNIQUE) (Cost=2
> Card=1)
>
> Statistics
> ----------------------------------------------------------
> ...
> 28 consistent gets
> ...
>
>
> Is this reasonable or should I ask for my money back?
>
> Thanks for your time,
> John

Consider using a function based index of a type I first saw mentioned by Tom Kyte a few years back. Go to Morgan's Library at www.psoug.org. Scroll down to Indexes and on the Indexes page search for Tom Kyte.

As an additional advantage the size of your index will be substantially smaller.

-- 
Daniel A. Morgan
University of Washington
damorgan_at_x.washington.edu (replace x with u to respond)
Puget Sound Oracle Users Group
www.psoug.org

Received on Thu Aug 02 2007 - 06:49:09 CDT