Oracle-L: RE: Statistical sampling and representative stats collection

From: John Kanagaraj <john.kanagaraj_at_hds.com>
Date: Thu, 23 May 2002 17:13:23 -0800
Message-ID: <F001.0046A3DC.20020523171323@fatcity.com>

Jack,

I conducted some tests of ANALYZE ESTIMATE vs COMPUTE for my IOUG paper and arrived at the following: (cut-and-paste of relevant parts of the paper)

Begin Quote --- MYTH: "COMPUTE IS BETTER THAN ESTIMATE" This one generates an endless debate actually, so we will not take a firm stand either way. Rather, we will present some figures that throw some light on the issue and allow us to step back and look at the situation. The problem with COMPUTE is that it has to scan the entire table, sort it and figure out the exact data distribution. On the other hand, ESTIMATE steps through samples of the data, sorts and analyzes only a portion of the data.

In a recent test for the effectiveness of COMPUTE versus ESTIMATE on a static clone of a reasonably large Oracle Apps database, the statistics were generated and stored for both COMPUTE and ESTIMATE. The Database consisted of about 3,300 tables and 6,000 indexes and occupied approximately 120 Gb. The ESTIMATE percentage was defaulted to 10% and no activity other than ANALYZE was allowed on this clone during the entire period. Table statistics including row count, average row length and blocks occupied were analyzed. This showed that there were some differences in row count and average row length on 321 of these tables. Row count differences ranged from a value of 53 row less in the ESTIMATE of a table containing 205,743 rows (0.025%) all the way up-to a count difference of 101,704 in 13,311,090 rows (0.76%). Even assuming a difference of a maximum of 5% in these scenarios, you are not far off the goal. Further analysis showed that a smaller average row length coupled with a small table produced larger variations than was usually seen.

The differences however, were far more pronounced in Indexes - differences of upto 300% were noticed. Further analysis showed that this was related to the percentage of deleted leaf rows in the index. If this percentage is high, the possibility of ESTIMATE going wrong was also high, as the deletions are not factored in correctly. This was especially true if the deletions occurred in leaf blocks that were probably not involved in the ESTIMATE. When the deleted leaf rows was low or even nil within the index, the percentage difference was much lower, in the range of 4 to 5%.

The real myth killer is the cost of COMPUTE versus ESTIMATE - COMPUTE required 66,553,308 reads versus 38,951,158 reads for ESTIMATE - almost 70% more reads for COMPUTE. The sorting involved in determining the averages and data distribution was a clincher - COMPUTE processed 4,263,724,259 rows in sorting operations while ESTIMATE sorted just 18,025,069 - i.e. about 235% more rows were sorted for the COMPUTE operation. The last nail in the coffin was the time taken to COMPUTE statistics - about 36 hours against the time to ESTIMATE of just 12 hours.

While the figures speak for themselves, we will offer some general advice to the cautious: ESTIMATE on tables and COMPUTE on Indexes. Columns are analyzed by default, but serve no useful purpose other than showing data spread. Hence, you could ANALYZE only Tables and Indexed columns alone. An identified list of 'small' tables could also be COMPUTED rather than ANALYZED. This advice is given because ESTIMATE on a table comes close as far as row count goes, while COMPUTE on Indexes generates a more accurate picture of both data distribution as well as object size statistics. Testing the effectiveness of COMPUTE versus ANALYZE is simple and provides you with figures that you can use to decide the strategy for your situation.

Before we move to the next topic, keep in mind that an ANALYZE/ESTIMATE with a sample size greater than or equal to 50% will result in COMPUTE.

End Quote ---

The problem is that this simple mathematical model looks only at object sizes and did not look at Column spread and sensitivity. However, I believe that the combination of ESTIMATE on Tables and COMPUTE on Indexes would catch most of it.

As always, YMMV!

John Kanagaraj
Oracle Applications DBA
DBSoft Inc
(W): 408-970-7002

The manuals for Oracle are here: http://tahiti.oracle.com The manual for Life is here: http://www.gospelcom.net

The opinions and statements above are entirely my own and not those of my employer or clients **

> -----Original Message-----
> From: Jack Silvey [mailto:jack_silvey_at_yahoo.com]
> Sent: Tuesday, May 21, 2002 2:44 PM
> To: Multiple recipients of list ORACLE-L
> Subject: Statistical sampling and representative stats collection
>
>
> Hi all,
>
> Did some investigation about statistical sampling this
> weekend since we are going to optimize our analyze
> process soon, and would like some input from all you
> orabrains on this one.
>
> I opened a TAR with Oracle asking about the sampling
> algorithm of stats collection, and they assured me it
> was random.
>
> The goal of analyze...estimate is to collect stats
> that are representative of the data population as a
> whole using a given sample set. Since analyzing tables
> takes up resources (does sorts to order the data for
> investigation) the fewer rows you use in estimate, the
> less system resources you use and the faster the
> analyze will go.
>
> Since our goal is to get as small a sample as possible
> and still have stats that are representative, my
> contention is that we could start by finding what the
> margin of error will be for each sample size and gauge
> our tolerance for it.
>
> One standard way to calculate margin of error for a
> given sample is by using this formula:
>
> M = 1/SQRT(N)
>
> where:
> M = margin of error
> N=sample size
>
> So, if we can tolerate stats that have a 1% a margin
> of error (will deviate from representative of the
> whole population by 1%), our sample size should be
> 10,000 rows.
>
> Also, a corollary (not a toyota corollary, though) to
> this would be that the more rows you add to your
> sample, the closer to representative your sample will
> be. So, in order to test whether your sample is
> representative enough, you could analyze using either
> estimate 49% or compute, take a snapshot of the stats,
> and then compare the stats from a 10,000 row estimate
> to those. Then, add rows to your estimate until you
> are satisfied with the stats.
>
> This of course is a pie in the sky mathematical model,
> but seems like a reasonable place to start with
> testing.
>
> Input? Input? Buhler? Buhler?
>
>
> /jack silvey
>
>
> __________________________________________________
> Do You Yahoo!?
> LAUNCH - Your Yahoo! Music Experience
> http://launch.yahoo.com
> --
> Please see the official ORACLE-L FAQ: http://www.orafaq.com
> --
> Author: Jack Silvey
> INET: jack_silvey_at_yahoo.com
>
> Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051
> San Diego, California -- Public Internet access / Mailing Lists
> --------------------------------------------------------------------
> To REMOVE yourself from this mailing list, send an E-Mail message
> to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
> the message BODY, include a line containing: UNSUB ORACLE-L
> (or the name of mailing list you want to be removed from). You may
> also send the HELP command for other information (like subscribing).
>

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: John Kanagaraj
  INET: john.kanagaraj_at_hds.com

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

Received on Thu May 23 2002 - 20:13:23 CDT