CBO: Skewed Data and "NOT EQUALS"

From: <john.howard.65_at_gmail.com>
Date: Thu, 02 Aug 2007 04:21:20 -0700
Message-ID: <1186053680.990529.281200@22g2000hsm.googlegroups.com>

I have a table with a column whose values are very skew. A large majority of rows has a single value for this column. I have an index, a not-null constraint and a frequency histogram on this column.

The CBO uses the histogram data correctly when I query with an "equals" condition. By which I mean if I want rows where the column is the majority value a full table scan is used and if I want rows for a low cardinality value then the index is used.

However this all changes if I use a "not equals" condition. If I ask for rows *not equal* to the majority value it does a full table scan when the index would be much better.

Setup:
(I'm running on 9.2 & 10.2)

CREATE TABLE skewtab AS
SELECT 'X' col , RPAD('X',4000) rest
FROM all_objects, all_objects
WHERE ROWNUM <=50000;

ALTER TABLE SKEWTAB
MODIFY(COL NOT NULL); UPDATE skewtab
SET col = 'Z'
WHERE ROWNUM <= 20;

UPDATE skewtab
SET col = 'Y'
WHERE ROWNUM <= 10;

CREATE INDEX I1 on skewtab (col);

ANALYZE TABLE skewtab COMPUTE STATISTICS;

ANALYZE TABLE skewtab COMPUTE STATISTICS FOR COLUMNS col SIZE 5;

ANALYZE INDEX i1 COMPUTE STATISTICS;

SQL> select *
  2 from skewtab
  3 where col <> 'X'
  4 /

20 rows selected.

Execution Plan

0 SELECT STATEMENT Optimizer=CHOOSE (Cost=7636 Card=20 Bytes=80020)

1 0 TABLE ACCESS (FULL) OF 'SKEWTAB' (Cost=7636 Card=20 Bytes=80020)

Statistics

...

50082 consistent gets
...

Forcing the index with a hint :

SQL> select --+ INDEX(skewtab i1)
  2 *
  3 from skewtab
  4 where col <> 'X'
  5 /

20 rows selected.

Execution Plan

0 SELECT STATEMENT Optimizer=CHOOSE (Cost=39 Card=20 Bytes=80020)

1 0 TABLE ACCESS (BY INDEX ROWID) OF 'SKEWTAB' (Cost=39 Card=20 Bytes=80020)

2 1 INDEX (FULL SCAN) OF 'I1' (NON-UNIQUE) (Cost=92 Card=20)

Statistics

...

116 consistent gets
...

The optimizer can even see that this is better (Cost = 39 versus 7636) yet doesn't choose it.

Even this is a little disappointing when, rather than a full index scan, the CBO could transform the query and do 2 range scans:

SQL> select *
  2 from skewtab
  3 where col > 'X'
  4 or col < 'X'
  5 /

20 rows selected.

Execution Plan

0 SELECT STATEMENT Optimizer=CHOOSE (Cost=4 Card=2 Bytes=8002) 1 0 CONCATENATION
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'SKEWTAB' (Cost=2 Card=1 Bytes=4001)

3 2 INDEX (RANGE SCAN) OF 'I1' (NON-UNIQUE) (Cost=2 Card=1)

4 1 TABLE ACCESS (BY INDEX ROWID) OF 'SKEWTAB' (Cost=2 Card=1 Bytes=4001)

5 4 INDEX (RANGE SCAN) OF 'I1' (NON-UNIQUE) (Cost=2 Card=1)

Statistics

...

28 consistent gets
...

Is this reasonable or should I ask for my money back?

Thanks for your time,
John Received on Thu Aug 02 2007 - 06:21:20 CDT