Oracle-L: RE: Statistical sampling and representative stats collection

From: DENNIS WILLIAMS <DWILLIAMS_at_LIFETOUCH.COM>
Date: Wed, 22 May 2002 07:23:29 -0800
Message-ID: <F001.00467A1A.20020522072329@fatcity.com>

Jack, Raj

I agree. The main point I feel is that if you follow statistics theory, which a good part of our modern technology relies on, you will test a fixed number of samples, rather than a percentage of the table rows.

For a small table, you may have to sample the entire table to get results that work. As a real-life example, you wouldn't sample 10 US Senators and expect those results to be accurate. No, you would simply survey each Senator (and hope they don't change their mind). Similarly, don't just sample 30 percent of a 1,000-row table. Sample all 1,000 rows.

For a large table, sampling a percentage would oversample and be a wasted effort. If you have a million-row table and a hundred-million-row table, the same sample size will produce results nearly as accurate for both. That is why you don't see nearly as many state political polls. It is nearly as much expense to accurately sample the citizens in a state as it is to sample all the citizens in the US

Someone asked about skewed data. Well, that is the reason you perform a RANDOM sample. That is the key point, and what produces many real-life statistical failures. A classic example is the Truman-Dewey presidential race in 1948. The pioneer pollsters used random samples of phone numbers and confidently predicted Dewey's victory. What they neglected was that wealthier people had telephones in greater proportion than poor people. So their sample was skewed, which produced bad results. Here, we're betting on Oracle's statement that the sample is truly random.

Now, if you want a more accurate result, you will sample more. But you aren't increasing the sample size because the table is larger, but to increase the accuracy. And to compensate for any other inaccuracies.

Just a thought, if you're responsible for a data warehouse, you may want to consider studying some basic statistics. Unfortunately most computer science curriculums don't require a class in statistics. In fact, since polls form a lot of our political discussion, it wouldn't hurt to require all citizens to have some statistical training. It might make it harder for politicians to mis-construe statistical results. However, it is hard enough to get people just to vote, so I suppose that one isn't going to fly.

Dennis Williams
DBA
Lifetouch, Inc.
dwilliams_at_lifetouch.com

-----Original Message-----
Sent: Wednesday, May 22, 2002 9:39 AM
To: Multiple recipients of list ORACLE-L

Jack,

Nielsen Ratings (the TV Rating company) monitors about 5000 people (and their TV watching habits) to supply ratings for all the shows on most of the networks for the whole United States. So, as long as you have a working and proven statistical model, and a good sample, it works. How do I know, ever seen anyone challenging Nielsen Ratings for a show? I haven't.

Raj

Rajendra Jamadagni MIS, ESPN Inc.
Rajendra dot Jamadagni at ESPN dot com
Any opinion expressed here is personal and doesn't reflect that of ESPN Inc.

QOTD: Any clod can have facts, but having an opinion is an art!
--

Please see the official ORACLE-L FAQ: http://www.orafaq.com
--

Author: DENNIS WILLIAMS
INET: DWILLIAMS_at_LIFETOUCH.COM

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------

To REMOVE yourself from this mailing list, send an E-Mail message to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in the message BODY, include a line containing: UNSUB ORACLE-L (or the name of mailing list you want to be removed from). You may also send the HELP command for other information (like subscribing). Received on Wed May 22 2002 - 10:23:29 CDT