precise question: use a histogram here? [message #133850] |
Mon, 22 August 2005 07:19 |
sebastianR
Messages: 33 Registered: August 2005
|
Member |
|
|
select min(cnt), max(cnt), avg(cnt), stddev(cnt)
from ( select key_val, count(*) cnt from <table> group by key_val )
MIN(CNT) | MAX(CNT) | AVG(CNT) | STDDEV(CNT)
1 | 409 | 1,35973268298468 | 0,999210428890999
I know what each column of this result represents, but I'm not sure if I can interpret the right way.
In my opinion it is not useful to make a histogram for this table/column.
By the way, there are about 800 000 distinct rows in the table.
Am I right, wrong or is this undecideable regarding this information?
In case I should make a histogram, can you recommend a bucketsize?
Thanks,
Sebastian
|
|
|
Re: precise question: use a histogram here? [message #133854 is a reply to message #133850] |
Mon, 22 August 2005 07:46 |
smartin
Messages: 1803 Registered: March 2005 Location: Jacksonville, Florida
|
Senior Member |
|
|
I welcome comments from others on this, but I don't think you want to evaluate the usefulness of a histogram based on the count of the number of values of each value. Instead, you want to base it on the column value itself.
In other words, don't solely do a count group by query on key_val. Instead, get the actual minimum and maximum values of key_val itself. For instance, your query shows a max count of 409. But you don't know what that 409 corresponds to. You don't know the range of values of key_val. All you know is that some undertermined key_val has 409 entries.
But in general, my theory, since I don't know of any harmful effects of having histograms on columns, other than the time required to gather them and the space to store them, is that the more information you can provide to the CBO the better. That means go ahead and gather them unless your statistics gathering window is experiencing major time pressure.
Then, you can use the actual gathered statistics to evaluate the data distribution, and tweak the bucket size if needed. I've never had a problem using a large size like 250, although if there are known to only be a set number of unique values that is much smaller then you could use a much smaller bucket size.
|
|
|