Re: Periodic Stats Collection -- CBO Stats Myth?

From: Sriram Kumar <k.sriramkumar_at_gmail.com>
Date: Fri, 3 Oct 2008 15:36:19 -0400
Message-ID: <c515faee0810031236m6390c9cm7622d988a5ac63ba@mail.gmail.com>


Thanks again tim. I will certainly try this when I am back to my shop next week. Currently I am at a customer site and would be limited to experiment a few things.

I found a SQL that tries to decode the exiting hex values in DBA_TAB_COL_STATISTICS to get an approximation of what the current values are but the output of this SQL for date colums are not valid.

http://www.oramoss.com/wiki/index.php/Cbo

Thanks for the pointers and Have a Great Weekend

Cheers

Sriram Kumar

On Fri, Oct 3, 2008 at 3:12 PM, Tim Gorman <tim_at_evdbt.com> wrote:

> Sriram,
>
> If I had time and wanted to find out, I would SQL trace (level 12) a run of
> a DBMS_STATS.GATHER_TABLE_STATS job with ESTIMATE_PERCENT=>10 and METHOD_OPT=>'
> FOR COLUMNS *<column-name>*', and find the underlying SELECT statement run
> by DBMS_STATS on the *column-name* in question. Then, I'd copy/paste the
> text of that SQL statement into SQL*Plus and run it, and see what data is
> returned by the query. And that would provide the same information that
> DBMS_STATS is using to calculate "high value" for the column.
>
> Although I don't have time for this right now, perhaps you do? If so,
> would you post your results back to the list?
>
> If I had to guess, then I would say that sampling will result in inaccurate
> values for column high-value. Just how inaccurate is part of the
> game-theory that we all use when deciding between COMPUTE STATISTICS and
> ESTIMATE STATISTICS. But that's just my guess...
>
> Thanks!
>
> -Tim
>
>
>
>
> Sriram Kumar wrote:
>
> Hi Tim,
>
> Thanks for your inputs. Appriciate your inputs.
>
> I had one more question. If we are performing sampling ( lets say 10
> percent), how does the stats gatherer estimate the high value especially for
> date columns?.
>
> Best Regards
>
> Sriram Kumar
>
>
>
>
> On Wed, Oct 1, 2008 at 11:20 AM, Tim Gorman <tim_at_evdbt.com> wrote:
>
>> For changing just the high-value column-level statistics, don't waste time
>> and resources re-gathering on the table, but instead set them manually using
>> DBMS_STATS.SET_COLUMN_STATS. You can just query the current SYSDATE value
>> or query the NEXTVAL of the sequence in question to know the current
>> high-value, so finding the values to set is cheap.
>>
>> You'll have to use the PREPARE_COLUMN_VALUES procedure to put the
>> histogram endpoint(s) into PL/SQL tables for the SET_COLUMN_STATS call
>> first, so there is some involved PL/SQL coding involving collections of
>> record types.
>>
>> Hope this helps...
>>
>>
>> Sriram Kumar wrote:
>>
>> Hi William/Andrew,
>>
>> Thanks for your valuable inputs.
>>
>> I totally agree with you william on the potential impact of outdated
>> low_value and high_value on date columns and columns that are generated
>> using sequences.
>>
>> I have a follow up question on how this data is populated. If I had done
>> an estimate percent say 10%, how does CBO guess the high value?.
>>
>> Best Regards
>>
>> Sriram Kumar
>>
>>
>>
>>
>> On Wed, Oct 1, 2008 at 6:54 AM, William Robertson <
>> william_at_williamrobertson.net> wrote:
>>
>>> I've found that dates are more likely to present a problem because
>>> sequentially generated keys are usually highly selective anyway. Say a
>>> financial application loads three million trades a day. If the stats are not
>>> updated the optimizer will think a given TRADE_ID has 0 occurrences (since
>>> it's after the column's high endpoint) and round that up to 1, which may
>>> well be correct anyway. However when you run a query for this week's trades
>>> on a Friday and it thinks there are none when in fact there are 15 million
>>> then you have a much bigger problem. That's my experience anyway.
>>>
>>>
>>> 2008/10/1 Niall Litchfield <niall.litchfield_at_gmail.com>
>>>
>>>> dates and sequence driven key columns both exhibit this behaviour.
>>>>
>>>> On 30/09/2008, William Robertson <william_at_williamrobertson.net> wrote:
>>>> > I'm not sure how using or not using histograms makes any fundamental
>>>> > difference to your approach.
>>>> >
>>>> > One thing that comes to mind is the use of date predicates. If the
>>>> > optimizer knows the highest sale date you have (for example) is 30
>>>> Sept
>>>> > 2008, you are fine today and maybe some of next week, but in a few
>>>> > months' time any queries for "last week's sales" may get
>>>> unrealistically
>>>> > low cardinality estimates leading to inappropriate index/nested loop
>>>> > access paths, Cartesian joins and so on.
>>>> >
>>>> > If you use partitioning and you have date columns as
>>>> > partition/subpartition keys you could have even more fun in store.
>>>> >
>>>> >
>>>> > -----Original message-----
>>>> > From: Sriram Kumar
>>>> > Date: 30/9/08 22:41
>>>> >> Hi Folks,
>>>> >>
>>>> >> We had migrated a OLTP/Batch hybrid 9i RBO application to 10g
>>>> >> and after rounds of tuning the application now is working at its
>>>> best.
>>>> >> We had experimented with various sampling rates for statistics for
>>>> >> various tables and now I feel we are in a optimal point of
>>>> performance
>>>> >> where CBO is picking up the right Indexes.
>>>> >>
>>>> >> 1) We do not use histograms
>>>> >> 2) From here on the application volumes are bound increase
>>>> proportionately
>>>> >> 3) No new tables would be added in production.
>>>> >> 4) As of now, We have locked the schema stats for the application
>>>> >> schemas. The default scheduler job collects stats for all other
>>>> schemas
>>>> >>
>>>> >> Since the increase in volume of data is quite proportional to the
>>>> >> existing data, what would be the compelling reason to periodically
>>>> >> recollect statistics for the tables that already have good execution
>>>> >> plans?
>>>> >>
>>>> >> e.g lets say that there is a table with million records and with few
>>>> >> indexes. We have collected the stats for this table and indexes with
>>>> >> million records and all the access paths to this table has been
>>>> >> validated that the CBO is picking up the correct optimal index in
>>>> >> every case.
>>>> >>
>>>> >> Now the volume of the table increases to 2 million and from
>>>> >> application point of view, the access paths that were valid for 1
>>>> >> million would be the valid for 2 million as well.
>>>> >>
>>>> >> Given this scenario, is it required to collect statistics for 2
>>>> >> million as well?. I opine that it would not be required but some of
>>>> my
>>>> >> colleagues feel that we should collect statistics periodically but I
>>>> >> am not able to get a valid reason for periodically recollecting the
>>>> >> stats?. Is this one of the myths?
>>>> >>
>>>> >> I would agree to periodically recollect stats if we are using
>>>> >> histograms but we are not using histograms. Any other reason that
>>>> >> would need a periodic statistic gathering?
>>>>
>>>
>>>
>>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Oct 03 2008 - 14:36:19 CDT

Original text of this message