<xhoster_at_gmail.com> wrote in message
news:20050125160438.340$4R_at_newsreader.com...
> "Richard Foote" <richard.foote_at_bigpond.nospam.com> wrote:
>> "Holger Baer" <holger.baer_at_science-computing.de> wrote in message
>> news:ct4udr$d3p$1_at_news.BelWue.DE...
>> > Oradba Linux wrote:
>> >> Why would analyze index command read table segments?
>> >>
>> >>
>> >
>> > Because it has to?
>> >
>> > Or how should it be able to determine the clustering_factor?
>> >
>>
>> Hi Holger
>>
>> Actually, it doesn't need to visit the table in order to calculate the CF
>> as it has all the info it needs within the rowids contained in the index.
>>
>> Basically, as it reads through the rowids in the leaf pages, if the block
>> id (or of course file no) changes from one index entry to the next, the
>> CF is incremented. If consecutive index entries point to the same block,
>> the CF is not incremented.
>
> Hi Richard,
>
> This reminds me of something I've been wondering for a while.
> Do you think this is the best way for them to compute cluster factors?
> If the table blocks pointed to by an index block (in index order) looked
> like this (just to carry things to the extreme):
>
> 1,2,1,2,1,2,1,2,1,2,1,2,1,2
>
> vs this,
>
> 1,1,1,1,1,1,1,2,2,2,2,2,2,2
>
> They would get very different cluster factors, but it seems like they
> are both very well clustered.
>
Hi Xho
That's a very good point you make. Some random comments I would make are:
- The purpose of the CF is to approximate how many actual table blocks would
need to be accessed for a given range scan (the CF being of course the full
range scan). The reason for this is so that the CBO can have a reasonable
stab at costing the number of PIOs required to perform the scan. However,
two factors that is doesn't consider at all well are the effects of caching
due to previous accesses of these required blocks by other statements and
the effects of caching of these blocks previously by the statement itself.
Obviously, if the time interval between reading of the same table block is
sufficiently low (as in your example), then it's probable that the block
will still be cached when next required. So I agree it's an area of possible
inaccuracy on the part of the CBO.
- In reality, such a pattern as you've described is unlikely and would
require a rather manufactured insertion method by the application. That
said, factors such as specific deletion/insertion patterns, the possible
effects of multiple freelists and the possible effects of ASSM could
possibly lead to similar CF patterns.
- The CF is only really an estimate, a guide so to speak that the CBO can
use to calculate the likely cost of a range scan. It's also an "average",
the average cost of performing a full range scan and as with all averages,
it's potentially inaccurate or potentially inaccurate "at times". What may
be "inefficient" for one range in an index, may be "efficient" for another
range. Take for example an index that in part looks like:
...
1,2,3,4,6,7,8,9,1,2,3,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,7,8,9,1,2,3,4,6,7,8,9
...
It might have quite an awful CF overall, however, there may be pockets
within the index that has a good CF (eg. the section that points to block 5
where such a query would be relatively "efficient").
- And of course, all queries are potentially different. Some may access
blocks that have just been cached anyways be another application, some may
access rows that are well clustered together, some may not, it depends on
all these factors. Therefore, having a CF that is interpreted by the CBO as
a global, *average* value, for a whole bunch of different queries with
different characteristics, that use different portions of the index, is
probably as reasonable as one would expect considering all the factors.
- In your example, the number of rows needing to be accessed is relevant
(actually, it's highly relevant in all cases). But take a range scan that
the CBO estimates requires *two* rows. In the first example, that's *two*
distinct blocks that need to be accessed. In the second, it's just *one*
block. therefore, for a small index scan, the CF has done it's job
correctly. For a larger scan (say 10 rows), then it becomes less accurate in
that it's really only accessing 2 distinct blocks soon after each other
rather than 10 distinct blocks as it estimates. Again, it all depends.
- Also the load on the system and the sizing of the buffer caches is
potentially relevant as it impacts the "duration" between which accesses to
the same blocks may still be cached. How long should this "interval" be so
that the CF considers accesses to the same block likely to be cached and
perhaps count only once? Not an easy question to answer as of course, the
system load and buffer cache sizes can be changed at any time. So, it all
depends
- The almost "perfect" CF that the CBO would desire is one which took into
consideration the "different/distinct" number of blocks visited by a
*specific* range scan and the subset of those blocks that are not cached.
This is impossible for just *one* number to represent. Also, at the time of
parsing, the CBO doesn't know the specific range(s) of values the query will
access, the portion(s) of the index each specific query will access and the
caching characteristics of the blocks the query is accessing.
- As there's so much guessing going on with the CBO anyways and considering
that such guesses are generally in the order of acceptability for the CBO to
do a reasonable job, IMHO I think the CF in general is as accurate and
effective in doing it's job as it needs to be in the majority of cases.
Hope it all makes sense !!
Cheers
Richard
Received on Wed Jan 26 2005 - 05:28:32 CST