Home » RDBMS Server » Performance Tuning » cardinality of join affected by transitivity (10201)
cardinality of join affected by transitivity [message #437114] |
Thu, 31 December 2009 02:24 |
stsy38
Messages: 14 Registered: December 2009 Location: france
|
Junior Member |
|
|
Hello
I have a little problem on the estimated cardinality of join when oracle applies transitivity.
example
i create a table a (id number);
I fill it with unique values representing AAAAMM
(from 200101 to 202501) => 289 rows unique)
i create a table b (id number)
i fill it as
insert into b select * from a; => 8192 times
The problem is for the query :
select count(*) from b,a where a.id=b.id and a.id between 200810 and 200903;
Here is the xplain plan. I dont analyze the table but even with analyze the cardinality of join is wrong
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | 331 (12)| 00:00:04 |
| 1 | SORT AGGREGATE | | 1 | 26 | | |
|* 2 | HASH JOIN | | 1032 | 26832 | 331 (12)| 00:00:04 |
|* 3 | TABLE ACCESS FULL| A | 6 | 78 | 2 (0)| 00:00:01 |
|* 4 | TABLE ACCESS FULL| B | 49713 | 631K| 328 (11)| 00:00:04 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("A"."ID"="B"."ID")
3 - filter("A"."ID">=200810 AND "A"."ID"<=200903)
4 - filter("B"."ID">=200810 AND "B"."ID"<=200903)
Note
-----
- dynamic sampling used for this statement
the 2 cardinality on the table are good but this of the join is reduced. Miss i something or is it a normal behavior of the optimizer ?
Thanks for your answer
Happy new year in advance
(excuse for my average english)
|
|
|
Re: cardinality of join affected by transitivity [message #437152 is a reply to message #437114] |
Thu, 31 December 2009 05:47 |
rleishman
Messages: 3728 Registered: October 2005 Location: Melbourne, Australia
|
Senior Member |
|
|
Oracle does not know that every row in A has matching rows in B, so it assumes that there are sometimes misses.
If you created a foreign key on B that FORCED a match in A, then I dare say the cardinality would be a bit closer.
Ross Leishman
|
|
|
Re: cardinality of join affected by transitivity [message #437155 is a reply to message #437114] |
Thu, 31 December 2009 06:02 |
stsy38
Messages: 14 Registered: December 2009 Location: france
|
Junior Member |
|
|
hello
thanks for your answer
However if i make the following query there is no problem on cardinality (so for the optimizer, all rows of b has matching rows in B)
select count(*) from b,a where a.id=b.id;
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 10 | 342 (15)| 00:00:04 |
| 1 | SORT AGGREGATE | | 1 | 10 | | |
|* 2 | HASH JOIN | | 2367K| 22M| 342 (15)| 00:00:04 |
| 3 | TABLE ACCESS FULL| A | 289 | 1445 | 2 (0)| 00:00:01 |
| 4 | TABLE ACCESS FULL| B | 2367K| 11M| 321 (10)| 00:00:04 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("A"."ID"="B"."ID")
What surprises me is that the
cardinality formula for a join is
Carda x Cardb /(max(ndv(id.joina),ndv(id.joinb))
the problem is that the optimizer seems to calculate as
49713 x 6 / max (6,289) = 1032
It estimates the ndv of table b filtered to 296 although there is 6 distinct values !! However the optimizer knows the distinct values of id
explain plan for
select distinct id from b where id between 200810 and 200903;
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6 | 30 | 337 (14)| 00:00:04 |
| 1 | SORT UNIQUE | | 6 | 30 | 337 (14)| 00:00:04 |
|* 2 | TABLE ACCESS FULL| B | 46604 | 227K| 333 (13)| 00:00:04 |
---------------------------------------------------------------------------
Miss i something ?
Thanks
|
|
|
|
Re: cardinality of join affected by transitivity [message #437193 is a reply to message #437158] |
Thu, 31 December 2009 20:58 |
rleishman
Messages: 3728 Registered: October 2005 Location: Melbourne, Australia
|
Senior Member |
|
|
Is transitivity the problem? What happens when you add the transitive predicate literally to the SQL?
select count(*)
from b,a
where a.id=b.id
and a.id between 200810 and 200903
and b.id between 200810 and 200903
And could you please post your SQLs and plans in [code]..[/code] tags to make them more readable.
Ross Leishman
|
|
|
Re: cardinality of join affected by transitivity [message #437218 is a reply to message #437155] |
Fri, 01 January 2010 18:33 |
narsap
Messages: 8 Registered: December 2009
|
Junior Member |
|
|
Quote:
What surprises me is that the
cardinality formula for a join is
Carda x Cardb /(max(ndv(id.joina),ndv(id.joinb))
the problem is that the optimizer seems to calculate as
49713 x 6 / max (6,289) = 1032
It estimates the ndv of table b filtered to 296 although there is 6 distinct values !! However the optimizer knows the distinct values of id
I think you got confused in the interpretation of your formula.
The NDV(id.joina) is number of ditinct values in the entire table (which is 289) , and not the filtered subset generated by the query (which is 6). Same applies for table B.
With these corrections applied, your formula for calculating cardinality of hash join becomes:
49713 x 6 / max (289,289) = 1032 which is what you can see in the plan. Hope this helps.
p.s. BTW, when I tried your test case on 10.2.0.1, I got different figures for individual table access steps. Especially, the rows from table B showed way different figure (107K) from the actual count (49152). Even the rows from table A showed way different figure (13) from the actual count (6). It might be a limitation of the optimizer which oracle has managed to fix in later versions.
|
|
|
Re: cardinality of join affected by transitivity [message #437407 is a reply to message #437114] |
Mon, 04 January 2010 05:27 |
stsy38
Messages: 14 Registered: December 2009 Location: france
|
Junior Member |
|
|
Hello
happy new year
rleishman : nothing changes with youy suggestion
narsap : are you sure that the ndv doesn t concern the filtering ?? I don t think
In metalink note (How ORACLE CBO Calculate Join Cardinality with Join Elimination Filter? [ID 728004.1]
) oracle says :
"oracle consider the max ever NDV in the result set, considering the join filter as NDV eliminator"
thanks
|
|
|
|
Re: cardinality of join affected by transitivity [message #438027 is a reply to message #437457] |
Wed, 06 January 2010 20:04 |
rleishman
Messages: 3728 Registered: October 2005 Location: Melbourne, Australia
|
Senior Member |
|
|
I agree with @narsap that the unfiltered NDV values seem to be what Oracle IS using, but that's not what they SHOULD be using.
Consider the classic EMP / DEPT tables. DEPT contains 4 rows, EMP contains 13 rows; there is a Foreign Key between the two, and EMP contains 3 distinct DEPTNO values.
SELECT *
FROM emp
JOIN dept USING (deptno)
Cardinality = 4 x 13 / greatest(4,3) = 13 - CORRECT
SELECT *
FROM emp
JOIN dept USING (deptno)
WHERE deptno = 20
Cardinality = 1 x 5 / greatest(4,3) = 1.2
However using filtered NDVs, we get:
Cardinality = 1 x 5 / greatest(1,1) = 5
Running this test case on the Oracle SCOTT Schema, I get the answer 5 in Explain Plan.
This demonstrates that @stsy38's experience is unusual, whereby CBO appears to use unfiltered NDVs in the calculation.
What happens when you perform this test case on the SCOTT schema?
Ross Leishman
|
|
|
|
Re: cardinality of join affected by transitivity [message #438174 is a reply to message #438157] |
Thu, 07 January 2010 06:10 |
rleishman
Messages: 3728 Registered: October 2005 Location: Melbourne, Australia
|
Senior Member |
|
|
@narsap,
Let's look at it another way.
Can you provide an example where using unfiltered estimates of the Number of Distinct Values in the denominator of that cardinality calculation would produce anything remotely like an accurate estimate of cardinality?
Mathematically, filtered NDVs are the only way it makes sense to me.
Ross Leishman
|
|
|
Re: cardinality of join affected by transitivity [message #438220 is a reply to message #438174] |
Thu, 07 January 2010 10:07 |
narsap
Messages: 8 Registered: December 2009
|
Junior Member |
|
|
Ross,
My sincere apologies for making a not-so-correct statement in the earlier post. Your (and OP's) interpretation of NDV is is in-line with what oracle appears to be saying (on metalink). But as you said, that appears to be true only in case of equality predicates. In case of range predicates, oracle appears to be using a figure which is close to the maximum of NDVs on entire table data. And this appears to occur irrespective of transitive closure exists or not.
|
|
|
Re: cardinality of join affected by transitivity [message #438397 is a reply to message #438220] |
Fri, 08 January 2010 11:03 |
narsap
Messages: 8 Registered: December 2009
|
Junior Member |
|
|
Ok. Jonathan Lewis & Alberto Dell'Era to rescue again.
First, it is still a fact that the join cardinality formula involves NDVs which are for a result-set and not for a table. That is for sure (but NEVER SAY SURE with Oracle see results for my test case below ).
Now, back to OP's test case, its results and their explanation.
In this case, oracle calculates the join cardinality as:
(Filtered cardinality of A) * (Filtered cardinality of B) / Graetest (NDV(A.id) in filtered results, NDV(B.id) in filtered results)
Now, the figures that we see in EXPLAIN PLAN are estimates and hence the formula is bound to use estimates, rather than actuals.
So the formula becomes:
(Estimated filtered cardinality of A) * (Estimated filtered cardinality of B) / Graetest (Adjusted NDV(A.id) in filtered results, Adjusted NDV(B.id) in filtered results)
Now, we have estimated filtered cardinality for table A (6), estimated filtered cardinality for table B (49713). Assuming (as OP has not shared those details) statistics collection has accurately populated the NUM_DISTINCT column in USER_TAB_COLUMNS for both A & B tables, it will be 289. Using this, we can calculate adjusted NDV for B = 289 * (1 - ((289 - 1)/289)^49713) = 289. Similarly adjusted NDV for A = 289 * (1 - ((289 - 1)/289)^6) = 5.94 ~ 6.
Putting these values in the cardinality formula, we get:
Join cardinality = (49713 * 6) / Greatest(6, 289) = 1032, which is what OP gets.
Now the fun part. When I executed the test case, I never got join cardinality exactly same as formula when using BETWEEN operator (but I got it when using equality or IN-LISTs). Here are my results:
SQL> select * from v$version ;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE 10.2.0.4.0 Production
TNS for Solaris: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production
SQL> drop table a purge ;
Table dropped.
SQL> drop table b purge ;
Table dropped.
SQL> create table a nologging as select to_number(to_char(add_months(to_date(200101, 'YYYYMM'), (level - 1)), 'YYYYMM')) id from dual connect by level <= 289 ;
Table created.
SQL> create table b nologging as select to_number(to_char(add_months(to_date(200101, 'YYYYMM'), mod((level - 1), 289)), 'YYYYMM')) id from dual connect by level <= 289 * 8192 ;
Table created.
SQL> REM How stats are collected on B may have significant impact on the cardinalities.
SQL> REM For the purpose of this test, I will collect stats with 100% data sampled.
SQL> exec dbms_stats.gather_table_stats(user, 'A', estimate_percent=>NULL) ;
PL/SQL procedure successfully completed.
SQL> exec dbms_stats.gather_table_stats(user, 'B', estimate_percent=>NULL) ;
PL/SQL procedure successfully completed.
SQL> select num_distinct, density, num_nulls, histogram, num_rows from user_tab_columns uc, user_tables ut where uc.table_name = ut.table_name and uc.table_name in ('A', 'B') ;
NUM_DISTINCT DENSITY NUM_NULLS HISTOGRAM NUM_ROWS
------------ ---------- ---------- --------------- ----------
289 .003460208 0 NONE 289
289 .003460208 0 NONE 2367488
SQL> explain plan for select count(*) from b,a where a.id = b.id and a.id between 200810 and 200903 ;
Explained.
SQL> select * from table(dbms_xplan.display) ;
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 10 | 879 |
| 1 | SORT AGGREGATE | | 1 | 10 | |
| 2 | HASH JOIN | | 4938 | 49380 | 879 |
| 3 | TABLE ACCESS FULL| A | 13 | 65 | 3 |
| 4 | TABLE ACCESS FULL| B | 108K| 527K| 875 |
------------------------------------------------------------
Note
-----
- 'PLAN_TABLE' is old version
14 rows selected.
SQL> REM This shows that skewness of data in B and absense of histograms results in
SQL> explain plan for select count(*) from b,a where a.id = b.id and a.id between 200810 and 200903 ;
Explained.
SQL> select * from table(dbms_xplan.display) ;
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 10 | 879 |
| 1 | SORT AGGREGATE | | 1 | 10 | |
| 2 | HASH JOIN | | 4938 | 49380 | 879 |
| 3 | TABLE ACCESS FULL| A | 13 | 65 | 3 |
| 4 | TABLE ACCESS FULL| B | 108K| 527K| 875 |
------------------------------------------------------------
Note
-----
- 'PLAN_TABLE' is old version
14 rows selected.
SQL> REM let's see if the column usage affects the stats.
SQL> exec dbms_stats.gather_table_stats(user, 'A', estimate_percent=>null) ;
PL/SQL procedure successfully completed.
SQL> exec dbms_stats.gather_table_stats(user, 'B', estimate_percent=>null) ;
PL/SQL procedure successfully completed.
SQL> select num_distinct, density, num_nulls, histogram, num_rows from user_tab_columns uc, user_tables ut where uc.table_name = ut.table_name and uc.table_name in ('A', 'B') ;
NUM_DISTINCT DENSITY NUM_NULLS HISTOGRAM NUM_ROWS
------------ ---------- ---------- --------------- ----------
289 .003460208 0 HEIGHT BALANCED 289
289 .003460208 0 HEIGHT BALANCED 2367488
SQL> REM so it appears column usage does affect and now we have histograms
SQL> REM let's see if it helps the cardinality.
SQL> explain plan for select count(*) from b,a where a.id = b.id and a.id between 200810 and 200903 ;
Explained.
SQL> select * from table(dbms_xplan.display) ;
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 10 | 879 |
| 1 | SORT AGGREGATE | | 1 | 10 | |
| 2 | HASH JOIN | | 917 | 9170 | 879 |
| 3 | TABLE ACCESS FULL| A | 6 | 30 | 3 |
| 4 | TABLE ACCESS FULL| B | 46604 | 227K| 875 |
------------------------------------------------------------
Note
-----
- 'PLAN_TABLE' is old version
14 rows selected.
SQL> REM and yes, it does affect. The cardinalities are much accurate now
SQL> REM the adjusted num_distinct for B is 289 * (1 - ((289 - 1)/289) ^ 46604) = 289
SQL> REM which makes the join cardinality as 46604 * 6 / MAX(289, 6) = 968
SQL> REM Histograms or no histograms, the hash join cardinality formula appears to give close enoughresults.
SQL> explain plan for select count(*) from b,a where a.id = b.id and a.id = 200903 ;
Explained.
SQL> select * from table(dbms_xplan.display) ;
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 10 | 873 |
| 1 | SORT AGGREGATE | | 1 | 10 | |
| 2 | HASH JOIN | | 8192 | 81920 | 873 |
| 3 | TABLE ACCESS FULL| A | 1 | 5 | 3 |
| 4 | TABLE ACCESS FULL| B | 8192 | 40960 | 869 |
------------------------------------------------------------
Note
-----
- 'PLAN_TABLE' is old version
14 rows selected.
SQL> explain plan for select count(*) from b,a where a.id = b.id and a.id in (200810, 200811, 200812
, 200901, 200902, 200903) ;
Explained.
SQL> select * from table(dbms_xplan.display) ;
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 10 | 949 |
| 1 | SORT AGGREGATE | | 1 | 10 | |
| 2 | HASH JOIN | | 1020 | 10200 | 949 |
| 3 | TABLE ACCESS FULL| A | 6 | 30 | 3 |
| 4 | TABLE ACCESS FULL| B | 49152 | 240K| 945 |
------------------------------------------------------------
Note
-----
- 'PLAN_TABLE' is old version
14 rows selected.
SQL> spool off
So does it mean the BETWEEN clause is causing issues here or is it simply an optimizer limitation (or bug) for my version?
|
|
|
|
|
Goto Forum:
Current Time: Fri Nov 22 14:06:19 CST 2024
|