Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
Home -> Community -> Mailing Lists -> Oracle-L -> RE: Statistical sampling and representative stats collection
John,
Great work. I do have some comments and questions, but don't take it as being critical. Your work on this is greatly appreciated. My comments/question are in-line and only asked so that the list can bounce around some more ideas and thoughts.
> --- Begin Quote ---
> MYTH: "COMPUTE IS BETTER THAN ESTIMATE"
> This one generates an endless debate actually, so we will not take a firm
> stand either way. Rather, we will present some figures that throw
> some light
> on the issue and allow us to step back and look at the situation. The
> problem with COMPUTE is that it has to scan the entire table, sort it and
> figure out the exact data distribution. On the other hand, ESTIMATE steps
> through samples of the data, sorts and analyzes only a portion of
> the data.
True, if we can get good stats that result in effective plans by using ESTIMATE, by all means go that route since a COMPUTE is much more expensive.
>
> In a recent test for the effectiveness of COMPUTE versus ESTIMATE on a
> static clone of a reasonably large Oracle Apps database, the
> statistics were
> generated and stored for both COMPUTE and ESTIMATE. The Database consisted
> of about 3,300 tables and 6,000 indexes and occupied approximately 120 Gb.
My question here is would this database and your findings be applicable to other databases and the nature of their data? I wouldn't think it would be since the characteristics can be so different, but, if I'm reading you correctly, you aren't saying that ESTIMATE is always the only way. But, the title "Myth: "Compute is better than Estimate". Well, we have all seen cases where compute worked out better (or 10% or 1% might have worked better than the 30% specified). So maybe it's just a minor quibble (no Jack Silvey, not a Tribble) and if the title included "...is ALWAYS better...", then I would totally agree with it being a myth. We have all been there, and I worked with a group recently who had the luxury of doing full computes every Sunday. Someone accidentally analyzed the schema at 30% on Monday and a lot of things went down the toilet. Going back to COMPUTE fixed things. Then again, maybe a 10% ESTIMATE would have fixed things. Jack and I both work with a guy who has talked about COMPUTE resulting in undesired plans, 10% did as well. They got the desired plans by going to 1%. So, even if one agrees that we don't necessarily have to COMPUTE, and in many (probably most?) cases we don't, there is still a lot of testing to be done to find the "best" estimate percent, and this could very well be different for various objects. And I think that's the battle we all face -- what is the best sampling percentage. And right now, it still seems to be done on a trial and error basis. I have some ideas on how one might attack this in an automated fashion, but it's still a *very* rough idea that I need to bounce off a few cohorts.
<SNIP>
> While the figures speak for themselves, we will offer some
> general advice to
> the cautious: ESTIMATE on tables and COMPUTE on Indexes. Columns are
> analyzed by default, but serve no useful purpose other than showing data
> spread. Hence, you could ANALYZE only Tables and Indexed columns alone.
If I read this correctly, are you saying we only need to gather table stats and the column stats for indexed columns, and that there is no practical use for these stats on non-indexed columns? If so, I disagree on this point, even if it is general advice and not a rule. Stats on non-indexed columns can play a *large* role in CBO decisions. I'm not going to go into details and examples here illustrating that, but those stats can still help decide the driving table, the join methods between tables, etc. I built a sample case some time back to illustrate the importance of gathering these non-indexed column stats. Now, it might not be important for all systems, but if you are ever using indexed columns, and, still specifying criteria on non-indexed columns, the gathering of stats on the non-indexed columns could be *very* important. I can send you more details back-channel if you are interested.
> An
> identified list of 'small' tables could also be COMPUTED rather than
> ANALYZED. This advice is given because ESTIMATE on a table comes close as
> far as row count goes, while COMPUTE on Indexes generates a more accurate
> picture of both data distribution as well as object size
> statistics. Testing
> the effectiveness of COMPUTE versus ANALYZE is simple and
> provides you with
> figures that you can use to decide the strategy for your situation.
Ok, it sounds like you aren't saying a one size fits all.
>
> Before we move to the next topic, keep in mind that an
> ANALYZE/ESTIMATE with
> a sample size greater than or equal to 50% will result in COMPUTE.
>
> --- End Quote ---
>
> The problem is that this simple mathematical model looks only at object
> sizes and did not look at Column spread and sensitivity. However,
> I believe
> that the combination of ESTIMATE on Tables and COMPUTE on Indexes would
> catch most of it.
>
> As always, YMMV!
And your approach very well could take care of most cases for many people. It's an interesting idea and something certainly worth playing around with.
Once again, thanks for throwing this info out there.
>
> John Kanagaraj
> Oracle Applications DBA
> DBSoft Inc
> (W): 408-970-7002
-- Please see the official ORACLE-L FAQ: http://www.orafaq.com -- Author: Larry Elkins INET: elkinsl_at_flash.net Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051 San Diego, California -- Public Internet access / Mailing Lists -------------------------------------------------------------------- To REMOVE yourself from this mailing list, send an E-Mail message to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in the message BODY, include a line containing: UNSUB ORACLE-L (or the name of mailing list you want to be removed from). You may also send the HELP command for other information (like subscribing).Received on Fri May 24 2002 - 18:58:20 CDT