Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Mailing Lists -> Oracle-L -> RE: Statistical sampling and representative stats collection

RE: Statistical sampling and representative stats collection

From: Larry Elkins <elkinsl_at_flash.net>
Date: Wed, 29 May 2002 03:58:51 -0800
Message-ID: <F001.0046DD72.20020529035851@fatcity.com>


> -----Original Message-----
> From: root_at_fatcity.com [mailto:root_at_fatcity.com]On Behalf Of John
> Kanagaraj
> Sent: Tuesday, May 28, 2002 1:49 PM
> To: Multiple recipients of list ORACLE-L
> Subject: RE: Statistical sampling and representative stats collection
>
>
> Btw, there is an option in 9i DBMS_STATS.AUTO_SAMPLE_SIZE. Has
> anyone tried
> this out? Or know how it works?

I would be interested in that as well.

> > for these stats on non-indexed columns? If so, I disagree on
> > this point,
> > even if it is general advice and not a rule. Stats on
> > non-indexed columns
> > can play a *large* role in CBO decisions.
>
> I agree with you, although I do have to contend that the effect
> is not very
> pronounced in all databases. This was revealed in some depth in Wolfgang's
> paper at IOUG where he was able to actually determine this in a
> 10053 trace,
> and it was an eye opener for me. The issue I have with this is that the
> default number of buckets is 2 and that is nowhere near what is needed. On
> the other hand, indiscriminately increasing the bucket size would
> leave you
> with a _Large_ number of histograms and result in the 'row cache
> lock' latch
> being taken out more that it should have been (as well as add to
> the cost of
> parsing).

I would only consider gathering this info, or customizing the stats gathering process in general, on a case by case basis and only when needed. Due to the nature of the data and queries, you might find the need to gather the non-indexed column stats on a single table and 1 column, or maybe 2 tables, or maybe many tables and columns, or maybe none at all.

You already know, then, how stats on non-indexed columns can play a role. But since I received a few back-channel emails asking for examples of how stats on the non-indexed columns make a difference, I'll include the following extreme example that illustrates. But by no means am I saying you should always gather them, just on an as needed basis. And I haven't included the 10053 traces since that could get very tedious ;-)

Say we have two tables, tables A and B, each consisting of two columns, A and B. Column A is the PK on each, column B is not indexed. So:

Table A: 50,000 Rows
Table B: 50,000 Rows

Column A.A: (PK) Values 1 thru 50,000
Column B.A: (PK) Values 1 thru 50,000
Column A.B: Not indexed, values 1 thru 25000, each occurring twice – e.g.
1,1,2,2,3,3...
Column B.B: Not indexed, values 0 and 1, each occurring 25,000 times

Note that I am not really even dealing with any skewness at this point – this is an even distribution of values. Here are 4 sample queries:

Sample Queries:

Select *
>From A, B

WHERE A.A = B.A
  and A.B = 5 –- Filters down to 2 rows, I want to drive by A in an NL fashion

Select *
>From A, B

WHERE A.A = B.A
  and A.B = 5 –- Filters down to 2 rows, I still want to drive by A in an NL fashion
  and B.B = 1 –- Filters down to 25,000 rows if treated standalone

Select *
>From A, B

WHERE A.A = B.A
  and B.B = 1 –- Filters down to 25,000 rows, I want FTS's and HJ.

Select *
>From A, B

WHERE A.A = B.A
  and B.B = 27000 –- no rows meet this, I want to drive the query by table B.

With a basic "compute for table for all indexed columns", we get hash joins with all four, in the case of the first two, driving by table A, and the next two driving by table B. There were no stats on the non-indexed column. This isn't what I really wanted for all 4 queries.

Now, if I simply do an "analyze table compute statistics", in which case stats are gathered for both columns A and B, you get the default bucket and two rows for each column over in DBA_TAB_HISTOGRAMS.

In the case of the first 2 queries, I get a nested loops driven by table A using an FTS and an index lookup into table B. This is what I want because of the really restrictive filtering criteria on table A’s non-indexed column B, resulting in two rows being selected, and two indexed lookups into table B. The stats on the non-indexed column helped the CBO make this decision.

In the case of queries 3 and 4, I would get a MERGE JOIN, using a full index scan on A’s PK to get each row, for the purpose of feeding the data in sorted order (I might have preferred an FTS and actually doing the sort!). And an FTS on table B. Note that by increasing the HASH_AREA_SIZE, queries 3 and 4 simply went with FTS’s and an HJ, driving by table B. I would have hoped that in the case of query 4 that the CBO would have chosen to drive by table B, using a nested loops index approach into table A. Upon no rows being returned from table B, table A would not have been accessed at all.

So, we go back and analyze specifying a SIZE for column B. Queries 1 and 2 use a nested loops driving by table A, and the restrictive filtering criteria result in 2 indexed lookups into table B. This is what I want. In the case of query 3, and having dropped the hash area size back down, it does FTS’s and an HJ driving by table B. This is what I want.

And in the case of query 4, it uses a nested loops approach, driving by an FTS on table B, and an indexed lookup into table A. This is what we want. The criteria on the non-indexed column on table B will filter out all the rows, meaning table A will not even be accessed. Without the stats on the non-indexed column on table B, the CBO would have FTS’d both tables doing a hash join.

Ok, so this is an extreme example, and it is not a one size fits all approach that should always be applied. This was simply an example case illustrating how the stats on the non-indexed columns can play a role, and how we might even need to specify the SIZE to get the desired results (going back to your comment about Wolfgang's 10053 examples).

Larry G. Elkins
elkinsl_at_flash.net
214.954.1781

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Larry Elkins
  INET: elkinsl_at_flash.net

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).
Received on Wed May 29 2002 - 06:58:51 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US