Jack,
Isn't the problem with this concept that it doesn't take into consideration
how skewed the data is? Statistically significance would be relevant to
perfectly distributed data but wouldn't you need a higher percentage of
data for significance in more highly skewed data?
Just something to consider.
Cherie Machler
Oracle DBA
Gelco Information Network
Jack Silvey
<jack_silvey_at_y To: Multiple recipients of list ORACLE-L <ORACLE-L_at_fatcity.com>
ahoo.com> cc:
Sent by: Subject: Re: Statistical sampling and representative stats
root_at_fatcity.c collection
om
05/21/02 06:19
PM
Please respond
to ORACLE-L
Hi Rafiq,
We have been using 35 percent on our warehouse, even
on our fact partitions. Now that I have thought about
it for a while, that seems like a lot given the volume
of data. If a representative sample can be gathered
with 10,000 or 50,000 or 100,000 rows, and our fact
partitions have millions of rows each, seems like we
could go 1% on our analyze and it would be within
acceptable tolerances.
/jack
- Mohammad Rafiq <rafiq9857_at_hotmail.com> wrote:
> The most of the list memeber agrees on estimate with
> 30%....
>
> Regards
> Rafiq
>
>
>
>
> Reply-To: ORACLE-L_at_fatcity.com
> To: Multiple recipients of list ORACLE-L
> <ORACLE-L_at_fatcity.com>
> Date: Tue, 21 May 2002 13:43:33 -0800
>
> Hi all,
>
> Did some investigation about statistical sampling
> this
> weekend since we are going to optimize our analyze
> process soon, and would like some input from all you
> orabrains on this one.
>
> I opened a TAR with Oracle asking about the sampling
> algorithm of stats collection, and they assured me
> it
> was random.
>
> The goal of analyze...estimate is to collect stats
> that are representative of the data population as a
> whole using a given sample set. Since analyzing
> tables
> takes up resources (does sorts to order the data for
> investigation) the fewer rows you use in estimate,
> the
> less system resources you use and the faster the
> analyze will go.
>
> Since our goal is to get as small a sample as
> possible
> and still have stats that are representative, my
> contention is that we could start by finding what
> the
> margin of error will be for each sample size and
> gauge
> our tolerance for it.
>
> One standard way to calculate margin of error for a
> given sample is by using this formula:
>
> M = 1/SQRT(N)
>
> where:
> M = margin of error
> N=sample size
>
> So, if we can tolerate stats that have a 1% a margin
> of error (will deviate from representative of the
> whole population by 1%), our sample size should be
> 10,000 rows.
>
> Also, a corollary (not a toyota corollary, though)
> to
> this would be that the more rows you add to your
> sample, the closer to representative your sample
> will
> be. So, in order to test whether your sample is
> representative enough, you could analyze using
> either
> estimate 49% or compute, take a snapshot of the
> stats,
> and then compare the stats from a 10,000 row
> estimate
> to those. Then, add rows to your estimate until you
> are satisfied with the stats.
>
> This of course is a pie in the sky mathematical
> model,
> but seems like a reasonable place to start with
> testing.
>
> Input? Input? Buhler? Buhler?
>
>
> /jack silvey
>
>
> __________________________________________________
> Do You Yahoo!?
> LAUNCH - Your Yahoo! Music Experience
> http://launch.yahoo.com
> --
> Please see the official ORACLE-L FAQ:
> http://www.orafaq.com
> --
> Author: Jack Silvey
> INET: jack_silvey_at_yahoo.com
>
> Fat City Network Services -- (858) 538-5051 FAX:
> (858) 538-5051
> San Diego, California -- Public Internet
> access / Mailing Lists
>
> To REMOVE yourself from this mailing list, send an
> E-Mail message
> to: ListGuru_at_fatcity.com (note EXACT spelling of
> 'ListGuru') and in
> the message BODY, include a line containing: UNSUB
> ORACLE-L
> (or the name of mailing list you want to be removed
> from). You may
> also send the HELP command for other information
> (like subscribing).
>
>
>
>
> MOHAMMAD RAFIQ
>
>
>
> Join the world's largest e-mail service with MSN
> Hotmail.
> http://www.hotmail.com
>
> --
> Please see the official ORACLE-L FAQ:
> http://www.orafaq.com
> --
> Author: Mohammad Rafiq
> INET: rafiq9857_at_hotmail.com
>
> Fat City Network Services -- (858) 538-5051 FAX:
> (858) 538-5051
> San Diego, California -- Public Internet
> access / Mailing Lists
>
> To REMOVE yourself from this mailing list, send an
> E-Mail message
> to: ListGuru_at_fatcity.com (note EXACT spelling of
> 'ListGuru') and in
> the message BODY, include a line containing: UNSUB
> ORACLE-L
> (or the name of mailing list you want to be removed
> from). You may
> also send the HELP command for other information
> (like subscribing).
Do You Yahoo!?
LAUNCH - Your Yahoo! Music Experience
http://launch.yahoo.com
--
Please see the official ORACLE-L FAQ: http://www.orafaq.com
--
Author: Jack Silvey
INET: jack_silvey_at_yahoo.com
Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051
San Diego, California -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from). You may
also send the HELP command for other information (like subscribing).
--
Please see the official ORACLE-L FAQ: http://www.orafaq.com
--
Author:
INET: Cherie_Machler_at_gelco.com
Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051
San Diego, California -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from). You may
also send the HELP command for other information (like subscribing).
Received on Wed May 22 2002 - 07:48:25 CDT