RE: null values

From: Mark W. Farnham <mwf_at_rsiz.com>
Date: Mon, 18 Nov 2013 22:52:39 -0500
Message-ID: <04a701cee4da$cba191d0$62e4b570$_at_rsiz.com>

I think if you apply a predicate to null_col then you'll get good estimates, so *probably* this mis-estimate is limited to counting. You've got the test case handy, so give it a whirl. Something like where null_col > 0
should do it.

-----Original Message-----
From: Martijn Bos [mailto:maboc_at_maboc.nl] Sent: Monday, November 18, 2013 12:48 PM To: Mark W. Farnham
Cc: Oracle-L
Subject: Re: null values

Please find my comments inline

Mark, thanks for allready spending time on my problems.

Martijn

On Mon, Nov 18, 2013 at 06:52:21AM -0500, Mark W. Farnham wrote:
> I missed the input column that your stats included 900 num_rows.
>
> You may be on to something. If you add two rows, one null for that
> column and the other non-null, that num_rows stat should move to 901;
> does the estimate move to 1002?
>

insert into cbo_tab_1 values (1001,1,1,null,'filler'); insert into cbo_tab_1 values (1002,1,1,1,'filler'); commit;

Table Stats ***** TABLE_NAME STATUS PCT_FREE PCT_USED NUM_ROWS BLOCKS EMPTY_BLOCKS LAST_ANALYZED AVG_ROW_LEN --------------- -------- ---------- ---------- ---------- ---------- ------------ -------------------- ----------- CBO_TAB_1 VALID 99 1002 1126 0 2013-NOV-18 15:06:53 515
Table Column Stats ***** TABLE_NAME COLUMN_NAME NUM_DISTINCT LOW_VALUE HIGH_VALUE DENSITY NUM_NULLS NUM_BUCKETS LAST_ANALYZED SAMPLE_SIZE GLO USE AVG_COL_LEN HISTOGRAM --------------- --------------- ------------ ---------- ---------- ---------- ---------- ----------- -------------------- ----------- --- --- ----------- --------------- CBO_TAB_1 NULL_COL 900 C102 C20A64 .001111111 101 1 2013-NOV-18 15:06:53 901 YES NO 4 NONE
Index Stats ***** INDEX_NAME INDEX_TYPE UNIQUENES BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR NUM_ROWS LAST_ANALYZED ------------------------------ --------------------------- --------- ---------- ----------- ------------- ----------------- ----------
CBO_TAB_1_NULL_COL NORMAL NONUNIQUE 1 2 900 901 901 2013-NOV-18 15:06:54
Index Columns **** INDEX_NAME TABLE_NAME COLUMN_NAME COLUMN_POSITION ------------------------------ --------------- -------------------------
CBO_TAB_1_NULL_COL CBO_TAB_1 NULL_COL 1

SQL> explain plan for select count(null_col) from cbo_tab_1; _at_xpln

ID         OPERATION                                OBJECT
ALIAS                CARDINALITY       COST OPTIMIZER            BYTES
---------- ----------------------------------------
---------------------------------------- -------------------- -----------
---------- --------------- ----------
--- 0       SELECT STATEMENT()
1          2 ALL_ROWS                 4
--P 1        SORT(AGGREGATE)
1                                     4
--P 2         INDEX(FAST FULL SCAN)                 (INDEX)
MARTIJN.CBO_TAB_1_NULL_COL       CBO_TAB_1_at_SEL$1             1002          2
ANALYZED              4008

ID PROJECTION

     1 (#keys=0) COUNT("NULL_COL")[22]
     2 "NULL_COL"[NUMBER,22]

SQL> As you can see, the index_stas display 901 rows in the index. (which I think is correct).
However, the CBO estimates 1002 rows now. So the problem persits

> Perhaps the CBO is (incorrectly) just using the table row stat for
> unrestricted query cardinality estimates, still getting the correct
> index choice because it is smaller by blocks than the table and no
> other index contains the column.
>
> Of course for full scans the size in blocks drives the relevant cost,
> but if the CBO were considering nested loop amounts and multiplying by
> a number far off, it could produce sub optimal plans.
>

Exactly. (See also below)

> Hmm. Try your test with 900 nulls and 100 not-null (easiest test from
> your existing script by reversing logic of mod to insert null). That
> should show us if it is just off by 10% or completely using the table
> stats instead of the index stats for cardinality.
>

The insert now is done as follows:
insert into cbo_tab_1 select level,

                             mod(level,10), 
                             round(level/100), 
--                             decode(mod(level, 10), 0, null, level), 
                             decode(mod(level, 10), 0, level, null), 
                             'filler' 
                      from   dual 
                      connect by level<=1000;

***** Table Stats *****
TABLE_NAME      STATUS     PCT_FREE   PCT_USED   NUM_ROWS     BLOCKS
EMPTY_BLOCKS LAST_ANALYZED        AVG_ROW_LEN

--------------- -------- ---------- ---------- ---------- ----------

------------ -------------------- -----------
CBO_TAB_1       VALID            99                  1000       1000
0 2013-NOV-18 15:14:06         512


***** Table Column Stats *****
TABLE_NAME      COLUMN_NAME     NUM_DISTINCT LOW_VALUE  HIGH_VALUE
DENSITY  NUM_NULLS NUM_BUCKETS LAST_ANALYZED        SAMPLE_SIZE GLO USE

AVG_COL_LEN HISTOGRAM

--------------- --------------- ------------ ---------- ----------
---------- ---------- ----------- -------------------- ----------- --- ---
----------- ---------------
CBO_TAB_1       NULL_COL                 100 C10B       C20B
.01        900           1 2013-NOV-18 15:14:06         100 YES NO

2 NONE

Index Stats ***** INDEX_NAME INDEX_TYPE UNIQUENES BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR NUM_ROWS LAST_ANALYZED ------------------------------ --------------------------- --------- ---------- ----------- ------------- ----------------- ----------
CBO_TAB_1_NULL_COL NORMAL NONUNIQUE 0 1 100 100 100 2013-NOV-18 15:14:06

The statistics show that the index has 100 rows. Again, I think that that should be enough information for the CBO, given the query.

And the plan output:
explain plan for select count(null_col) from cbo_tab_1; _at_xpln

ID         OPERATION                                OBJECT
ALIAS                CARDINALITY       COST OPTIMIZER            BYTES
---------- ----------------------------------------
---------------------------------------- -------------------- -----------
---------- --------------- ----------
--- 0       SELECT STATEMENT()
1          1 ALL_ROWS                 2
--P 1        SORT(AGGREGATE)
1                                     2
--P 2         INDEX(FULL SCAN)                      (INDEX)
MARTIJN.CBO_TAB_1_NULL_COL       CBO_TAB_1_at_SEL$1             1000          1
ANALYZED              2000

ID PROJECTION

     1 (#keys=0) COUNT("NULL_COL")[22]
     2 "NULL_COL"[NUMBER,22]

SQL> Now the cardianlity estimate is way off. So potentially creating sub-optimal exceution-paths for more complex query's.

I do notice that the excecution path changed from Fast Full Scan to Full Scan.

> "Thought" is the spelling you were wondering about.
>

Thank you. Somehow this is a very difficult word to spell for me.

> -----Original Message-----
> From: Martijn Bos [mailto:maboc_at_maboc.nl]
> Sent: Monday, November 18, 2013 1:22 AM
> To: Mark W. Farnham
> Subject: Re: null values
>
> Hi Mark,
>
> Thanks for your response.
> Please consider my remarks (inline)
>
> Best regards
> Martijn
>
> On Sun, Nov 17, 2013 at 09:57:05PM -0500, Mark W. Farnham wrote:
> > I think it is because there are 900 distinct keys, but the index is
> > non-unique, so it cannot know merely from the stats there actually
> > are no duplicates and the rest as nulls.
>
> whether there are duplicates or not is, as far as I understand, not so
> interesting.
>
> if my data is:
>
> 1
> 2
> null
> 1
> 2
> null
>
> then count(column) == 4
>
> So duplicates are not so important, in this case I think.
>
> > But it does get selected because it will be fewer total blocks for
> > the fast full scan than the table.
> >
>
> In the index_stats is a column num_rows which also says 900. I would
> have thaught (not sure how to spell the past tense of think :-)) that
> that statistics is sufficient.
>
>
> > mwf
> >
> > -----Original Message-----
> > From: oracle-l-bounce_at_freelists.org
> > [mailto:oracle-l-bounce_at_freelists.org]
> > On Behalf Of Martijn Bos
> > Sent: Sunday, November 17, 2013 4:51 PM
> > To: Oracle-L
> > Subject: null values
> >
> > Hi List,
> >
> > On a play-around sytem I do the following:
> > (This is a 11.2.0.4 RDBMS on linux x86 (32 bits))
> >
> > drop table cbo_tab_1;
> > create table cbo_tab_1 (id number, sca number(10,0), clu
> > number(10,0), null_col number(10,0), filler char(500)) pctfree 99
> > pctused 1;
> >
> > -- Following statement makes sure that ther actually are nulls in
> > column null_col insert into cbo_tab_1 select level, mod(level,10),
> > round(level/100),
> > decode(mod(level, 10), 0, null, level), 'filler' from dual connect by
> > level<=1000;
> > commit;
> >
> > create unique index CBO_TAB_1_ID on CBO_TAB_1(id);
> > -- create index cbo_tab_1_id on cbo_tab_1(id); create index
> > cbo_tab_1_sca on cbo_tab_1(sca); create index cbo_tab_1_clu on
> > cbo_tab_1(clu); create index cbo_tab_1_null_col on
> > cbo_tab_1(null_col);
> >
> > exec dbms_stats.gather_table_stats('MARTIJN','CBO_TAB_1',
> > block_sample=>false, cascade=>true, method_opt=>'for all columns
> > size 1'); alter system flush shared_pool; --Since I try it a lot I
> > need it to parse over and over again.
> >
> > explain plan for select count(null_col) from cbo_tab_1; --HERE IS
> > THE SQL OF WHICH I DON'T UNDERSTAND THE EXPLAIN PLAN
> >
> > SQL> _at_xpln -- A script I created just to get a little more
> > SQL> understanding of
> > explaining statement.
> >
> > ID OPERATION OBJECT
> > ALIAS CARDINALITY COST OPTIMIZER BYTES
> > ---------- ----------------------------------------
> > ---------------------------------------- --------------------
> > -----------
> > ---------- --------------- ----------
> > --- 0 SELECT STATEMENT()
> > 1 2 ALL_ROWS 4
> > --P 1 SORT(AGGREGATE)
> > 1 4
> > --P 2 INDEX(FAST FULL SCAN) (INDEX)
> > MARTIJN.CBO_TAB_1_NULL_COL CBO_TAB_1_at_SEL$1 1000
> 2
> > ANALYZED 4000
> >
> > ID PROJECTION
> > ------
> >
> ----------------------------------------------------------------------
> -----
> > 1 (#keys=0) COUNT("NULL_COL")[22]
> > 2 "NULL_COL"[NUMBER,22]
> > SQL>
> >
> > (if you find that suspect, I also added a more documented way to
> > display explain plans)
> >
> > SQL> select * from table(dbms_xplan.display(format=>'all'));
> >
> > PLAN_TABLE_OUTPUT
> > --------------------------------------------------------------------
> > --
> > ------
> > --------------------------------------------------------------------
> > --
> > ------
> > ------------------------------------------------
> > Plan hash value: 2862360477
> >
> > --------------------------------------------------------------------
> > --
> > ------
> > ----------------
> > | Id | Operation | Name | Rows | Bytes |
Cost
> > (%CPU)| Time |
> > --------------------------------------------------------------------
> > --
> > ------
> > ----------------
> > | 0 | SELECT STATEMENT | | 1 | 4 |
2
> > (0)| 00:02:01 |
> > | 1 | SORT AGGREGATE | | 1 | 4 |
> > | |
> > | 2 | INDEX FAST FULL SCAN| CBO_TAB_1_NULL_COL | 1000 | 4000 |
2
> > (0)| 00:02:01 |
> > --------------------------------------------------------------------
> > --
> > ------
> > ----------------
> >
> > Query Block Name / Object Alias (identified by operation id):
> > -------------------------------------------------------------
> >
> > 1 - SEL$1
> > 2 - SEL$1 / CBO_TAB_1_at_SEL$1
> >
> > Column Projection Information (identified by operation id):
> > -----------------------------------------------------------
> >
> > 1 - (#keys=0) COUNT("NULL_COL")[22]
> > 2 - "NULL_COL"[NUMBER,22]
> > SQL>
> >
> >
> >
> > Now the thing that is troubling me is the cardinality estimate at id
> > 2 (1000). I would say that the CBO could esitmate 900 rows.
> > Where does the CBO get's that idea of 1000?
> >
> >
> > SQL> select TABLE_NAME, STATUS, PCT_FREE, PCT_USED, NUM_ROWS,
> > SQL> BLOCKS, EMPTY_BLOCKS, LAST_ANALYZED, AVG_ROW_LEN
> > from user_tables
> > where table_name='CBO_TAB_1';
> >
> > TABLE_NAME STATUS PCT_FREE PCT_USED NUM_ROWS BLOCKS
> > EMPTY_BLOCKS LAST_ANALYZED AVG_ROW_LEN
> > --------------- -------- ---------- ---------- ---------- ----------
> > ------------ -------------------- -----------
> > CBO_TAB_1 VALID 99 1000 1000
> > 0 2013-NOV-17 22:27:10 515
> >
> > SQL> select index_name, index_type, uniqueness, blevel, leaf_blocks,
> > SQL> distinct_keys, clustering_factor, num_rows, last_analyzed
> > from user_indexes
> > where table_name='CBO_TAB_1';
> >
> > INDEX_NAME INDEX_TYPE UNIQUENES
> > BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR NUM_ROWS
> LAST_ANALYZED
> > ------------------------------ --------------------------- ---------
> > ---------- ----------- ------------- ----------------- ----------
> > --------------------
> > CBO_TAB_1_ID NORMAL UNIQUE
> > 1 2 1000 1000 1000 2013-NOV-17
> > 22:27:10
> > CBO_TAB_1_SCA NORMAL NONUNIQUE
> > 1 2 10 1000 1000 2013-NOV-17
> > 22:27:10
> > CBO_TAB_1_CLU NORMAL NONUNIQUE
> > 1 2 11 1000 1000 2013-NOV-17
> > 22:27:10
> > CBO_TAB_1_NULL_COL NORMAL NONUNIQUE
> > 1 2 900 900 900 2013-NOV-17
> > 22:27:10
> > SQL>
> >
> > As far as I'm concerned/understand the CBO chooses the right index
> > (CBO_TAB_1_NULL_COL). CBO_TAB_1_NULL_COL has 900 num rows, since the
> > other
> > 100 rows are null.
> > I would say that in this specific case it would be easy for the CBO
> > to estimate 900 rows since it only visits this index (all
> > information is in the index).
> >
> > Now I'm thinking that I might oversee something very simple (or that
> > i'm flat out wrong), so would anyone be so kind as to point me in
> > the right direction for some guidance, or (better yet :-)) explain
> > where my thinking is wrong.
> >
> >
> > Best Regards,
> > Martijn
> >
>

--
http://www.freelists.org/webpage/oracle-l

Received on Tue Nov 19 2013 - 04:52:39 CET

This message: [ Message body ]
Next message: Mark W. Farnham: "RE: null values"
Previous message: Jonathan Lewis: "RE: null values"
In reply to: Martijn Bos: "Re: null values"
Next in thread: Mark W. Farnham: "RE: null values"
Reply: Mark W. Farnham: "RE: null values"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message