Re: analyze index

From: Richard Foote <richard.foote_at_bigpond.nospam.com>
Date: Wed, 26 Jan 2005 11:28:32 GMT
Message-ID: <A9LJd.135384$K7.133006@news-server.bigpond.net.au>

<xhoster_at_gmail.com> wrote in message
news:20050125160438.340$4R_at_newsreader.com...

> "Richard Foote" <richard.foote_at_bigpond.nospam.com> wrote:

>> "Holger Baer" <holger.baer_at_science-computing.de> wrote in message
>> news:ct4udr$d3p$1_at_news.BelWue.DE...
>> > Oradba Linux wrote:
>> >> Why would analyze index command read table segments?
>> >>
>> >>
>> >
>> > Because it has to?
>> >
>> > Or how should it be able to determine the clustering_factor?
>> >

>>

>> Hi Holger

>>

>> Actually, it doesn't need to visit the table in order to calculate the CF
>> as it has all the info it needs within the rowids contained in the index.

>>

>> Basically, as it reads through the rowids in the leaf pages, if the block
>> id (or of course file no) changes from one index entry to the next, the
>> CF is incremented. If consecutive index entries point to the same block,
>> the CF is not incremented.

>
> Hi Richard,
>
> This reminds me of something I've been wondering for a while.
> Do you think this is the best way for them to compute cluster factors?
> If the table blocks pointed to by an index block (in index order) looked
> like this (just to carry things to the extreme):
>
> 1,2,1,2,1,2,1,2,1,2,1,2,1,2
>
> vs this,
>
> 1,1,1,1,1,1,1,2,2,2,2,2,2,2
>
> They would get very different cluster factors, but it seems like they
> are both very well clustered.
>

Hi Xho

That's a very good point you make. Some random comments I would make are:

The purpose of the CF is to approximate how many actual table blocks would need to be accessed for a given range scan (the CF being of course the full range scan). The reason for this is so that the CBO can have a reasonable stab at costing the number of PIOs required to perform the scan. However, two factors that is doesn't consider at all well are the effects of caching due to previous accesses of these required blocks by other statements and the effects of caching of these blocks previously by the statement itself. Obviously, if the time interval between reading of the same table block is sufficiently low (as in your example), then it's probable that the block will still be cached when next required. So I agree it's an area of possible inaccuracy on the part of the CBO.
In reality, such a pattern as you've described is unlikely and would require a rather manufactured insertion method by the application. That said, factors such as specific deletion/insertion patterns, the possible effects of multiple freelists and the possible effects of ASSM could possibly lead to similar CF patterns.
The CF is only really an estimate, a guide so to speak that the CBO can use to calculate the likely cost of a range scan. It's also an "average", the average cost of performing a full range scan and as with all averages, it's potentially inaccurate or potentially inaccurate "at times". What may be "inefficient" for one range in an index, may be "efficient" for another range. Take for example an index that in part looks like:

...
1,2,3,4,6,7,8,9,1,2,3,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,7,8,9,1,2,3,4,6,7,8,9 ...

It might have quite an awful CF overall, however, there may be pockets within the index that has a good CF (eg. the section that points to block 5 where such a query would be relatively "efficient").

And of course, all queries are potentially different. Some may access blocks that have just been cached anyways be another application, some may access rows that are well clustered together, some may not, it depends on all these factors. Therefore, having a CF that is interpreted by the CBO as a global, *average* value, for a whole bunch of different queries with different characteristics, that use different portions of the index, is probably as reasonable as one would expect considering all the factors.
In your example, the number of rows needing to be accessed is relevant (actually, it's highly relevant in all cases). But take a range scan that the CBO estimates requires *two* rows. In the first example, that's *two* distinct blocks that need to be accessed. In the second, it's just *one* block. therefore, for a small index scan, the CF has done it's job correctly. For a larger scan (say 10 rows), then it becomes less accurate in that it's really only accessing 2 distinct blocks soon after each other rather than 10 distinct blocks as it estimates. Again, it all depends.
Also the load on the system and the sizing of the buffer caches is potentially relevant as it impacts the "duration" between which accesses to the same blocks may still be cached. How long should this "interval" be so that the CF considers accesses to the same block likely to be cached and perhaps count only once? Not an easy question to answer as of course, the system load and buffer cache sizes can be changed at any time. So, it all depends
The almost "perfect" CF that the CBO would desire is one which took into consideration the "different/distinct" number of blocks visited by a *specific* range scan and the subset of those blocks that are not cached. This is impossible for just *one* number to represent. Also, at the time of parsing, the CBO doesn't know the specific range(s) of values the query will access, the portion(s) of the index each specific query will access and the caching characteristics of the blocks the query is accessing.
As there's so much guessing going on with the CBO anyways and considering that such guesses are generally in the order of acceptability for the CBO to do a reasonable job, IMHO I think the CF in general is as accurate and effective in doing it's job as it needs to be in the majority of cases.

Hope it all makes sense !!

Cheers

Richard Received on Wed Jan 26 2005 - 05:28:32 CST