Oracle-L: RE: Statistical sampling and representative stats collection

From: Jack Silvey <jack_silvey_at_yahoo.com>
Date: Thu, 23 May 2002 19:28:22 -0800
Message-ID: <F001.0046A4FC.20020523192822@fatcity.com>

John,

Thanks for the great input. I think at this time we are working our way towards a invalidate/load/rebuild compute index strategy, so estimate is probably not in the picture for our indexes. Since the data is ordered in the index and does not have to be sorted during an analyze, a compute should not be that expensive relative to a table compute, plus doing it during the rebuild adds less overhead to the analyze, or so I have read in meat link.

One question - you mention that an index analyze provides beter data distribution. Could you discuss what you found in more detail?

thx,

/jack

John Kanagaraj <john.kanagaraj_at_hds.com> wrote:
> Jack,
>
> I conducted some tests of ANALYZE ESTIMATE vs
> COMPUTE for my IOUG paper and
> arrived at the following: (cut-and-paste of relevant
> parts of the paper)
>
> --- Begin Quote ---
> MYTH: "COMPUTE IS BETTER THAN ESTIMATE"
> This one generates an endless debate actually, so we
> will not take a firm
> stand either way. Rather, we will present some
> figures that throw some light
> on the issue and allow us to step back and look at
> the situation. The
> problem with COMPUTE is that it has to scan the
> entire table, sort it and
> figure out the exact data distribution. On the other
> hand, ESTIMATE steps
> through samples of the data, sorts and analyzes only
> a portion of the data.
>
> In a recent test for the effectiveness of COMPUTE
> versus ESTIMATE on a
> static clone of a reasonably large Oracle Apps
> database, the statistics were
> generated and stored for both COMPUTE and ESTIMATE.
> The Database consisted
> of about 3,300 tables and 6,000 indexes and occupied
> approximately 120 Gb.
> The ESTIMATE percentage was defaulted to 10% and no
> activity other than
> ANALYZE was allowed on this clone during the entire
> period. Table statistics
> including row count, average row length and blocks
> occupied were analyzed.
> This showed that there were some differences in row
> count and average row
> length on 321 of these tables. Row count differences
> ranged from a value of
> 53 row less in the ESTIMATE of a table containing
> 205,743 rows (0.025%) all
> the way up-to a count difference of 101,704 in
> 13,311,090 rows (0.76%). Even
> assuming a difference of a maximum of 5% in these
> scenarios, you are not far
> off the goal. Further analysis showed that a smaller
> average row length
> coupled with a small table produced larger
> variations than was usually seen.
>
>
> The differences however, were far more pronounced in
> Indexes - differences
> of upto 300% were noticed. Further analysis showed
> that this was related to
> the percentage of deleted leaf rows in the index. If
> this percentage is
> high, the possibility of ESTIMATE going wrong was
> also high, as the
> deletions are not factored in correctly. This was
> especially true if the
> deletions occurred in leaf blocks that were probably
> not involved in the
> ESTIMATE. When the deleted leaf rows was low or even
> nil within the index,
> the percentage difference was much lower, in the
> range of 4 to 5%.
>
> The real myth killer is the cost of COMPUTE versus
> ESTIMATE - COMPUTE
> required 66,553,308 reads versus 38,951,158 reads
> for ESTIMATE - almost 70%
> more reads for COMPUTE. The sorting involved in
> determining the averages and
> data distribution was a clincher - COMPUTE processed
> 4,263,724,259 rows in
> sorting operations while ESTIMATE sorted just
> 18,025,069 - i.e. about 235%
> more rows were sorted for the COMPUTE operation. The
> last nail in the coffin
> was the time taken to COMPUTE statistics - about 36
> hours against the time
> to ESTIMATE of just 12 hours.
>
> While the figures speak for themselves, we will
> offer some general advice to
> the cautious: ESTIMATE on tables and COMPUTE on
> Indexes. Columns are
> analyzed by default, but serve no useful purpose
> other than showing data
> spread. Hence, you could ANALYZE only Tables and
> Indexed columns alone. An
> identified list of 'small' tables could also be
> COMPUTED rather than
> ANALYZED. This advice is given because ESTIMATE on a
> table comes close as
> far as row count goes, while COMPUTE on Indexes
> generates a more accurate
> picture of both data distribution as well as object
> size statistics. Testing
> the effectiveness of COMPUTE versus ANALYZE is
> simple and provides you with
> figures that you can use to decide the strategy for
> your situation.
>
> Before we move to the next topic, keep in mind that
> an ANALYZE/ESTIMATE with
> a sample size greater than or equal to 50% will
> result in COMPUTE.
>
> --- End Quote ---
>
> The problem is that this simple mathematical model
> looks only at object
> sizes and did not look at Column spread and
> sensitivity. However, I believe
> that the combination of ESTIMATE on Tables and
> COMPUTE on Indexes would
> catch most of it.
>
> As always, YMMV!
>
> John Kanagaraj
> Oracle Applications DBA
> DBSoft Inc
> (W): 408-970-7002
>
> The manuals for Oracle are here:
> http://tahiti.oracle.com
> The manual for Life is here:
> http://www.gospelcom.net
>
> ** The opinions and statements above are entirely my
> own and not those of my
> employer or clients **
>
>
> > -----Original Message-----
> > From: Jack Silvey [mailto:jack_silvey_at_yahoo.com]
> > Sent: Tuesday, May 21, 2002 2:44 PM
> > To: Multiple recipients of list ORACLE-L
> > Subject: Statistical sampling and representative
> stats collection
> >
> >
> > Hi all,
> >
> > Did some investigation about statistical sampling
> this
> > weekend since we are going to optimize our analyze
> > process soon, and would like some input from all
> you
> > orabrains on this one.
> >
> > I opened a TAR with Oracle asking about the
> sampling
> > algorithm of stats collection, and they assured me
> it
> > was random.
> >
> > The goal of analyze...estimate is to collect stats
> > that are representative of the data population as
> a
> > whole using a given sample set. Since analyzing
> tables
> > takes up resources (does sorts to order the data
> for
> > investigation) the fewer rows you use in estimate,
> the
> > less system resources you use and the faster the
> > analyze will go.
> >
> > Since our goal is to get as small a sample as
> possible
> > and still have stats that are representative, my
> > contention is that we could start by finding what
> the
> > margin of error will be for each sample size and
> gauge
> > our tolerance for it.
> >
> > One standard way to calculate margin of error for
> a
> > given sample is by using this formula:
> >
> > M = 1/SQRT(N)
> >
> > where:
> > M = margin of error
> > N=sample size
> >
> > So, if we can tolerate stats that have a 1% a
> margin
> > of error (will deviate from representative of the
> > whole population by 1%), our sample size should be
> > 10,000 rows.
> >
> > Also, a corollary (not a toyota corollary, though)
> to
> > this would be that the more rows you add to your
>
=== message truncated ===

Do You Yahoo!?
LAUNCH - Your Yahoo! Music Experience
http://launch.yahoo.com

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Jack Silvey
  INET: jack_silvey_at_yahoo.com

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

Received on Thu May 23 2002 - 22:28:22 CDT