Large Tables, Bad Indexes and Fake Statistics

From: Don Seiler <don_at_seiler.us>
Date: Tue, 19 Jun 2007 12:07:00 -0500
Message-ID: <716f7a630706191007w46a2a31sd2b00649c1601dcc@mail.gmail.com>

I may have mentioned this before, but the archives only showed me a tangential reference [1].

The setup:
* Oracle 10.2.0.2 on RHEL 3.

Table FOO is range partitioned by date, total of 4 monthly partitions (rolling window)
There are two indexes on FOO, the PK has 10 fields (say A-J), and a second index has 5 fields (A,C,D from PK and two others K, L).
Statistics are gathered automatically via 10g default gather_stats_job

The problem:
A partition of FOO is bulk loaded around 10 times a month. The partition is created a few days before it will first be loaded into. We have a recurring problem when, after the partition is loaded into (2-3 million records per load), queries on FOO that have A, B, C, D in the WHERE clause will start using the second index. The fact that this second index doesn't use column B means it is walking over a HUGE set of records, practically a complete partition scan.

It seems like an out-of-whack statistics issue. After I gather stats on the table (takes 2.5 hours) the queries then use the PK as desired. To fight this, we had just this month started calling dbms_stats.set_table_stats and set_index_stats to indicate a full month's worth of stats before loading. Then for the rest of the day we could fool Oracle, until the nightly stats job then set the real stats. However this didn't work yesterday. I'm not sure why, but I did notice that the dba_tab_statistics table showed a sample_size for my user-set stats, when I don't believe there was one before.

The solution:
I'm placing blame first on the PK. A 10-field primary key is absurd, especially when 82% of the queries (logged via FGA) only go against those first four fields. This starts a whole natural-vs-synthetic key argument that only dictatorial control will resolve. Anyway, my theory is that the CBO recognizes that, after the load, the data is drastically different than the statistics. Now the touchy-feely part of the theory is that the CBO then says all bets are off and looks at the two indexes and sees that the second index is much, much smaller and will be much easier to deal with. And that's why it chooses that one. Obviously this isn't very technical and my developers aren't buying into it. What traces can I do to spell out what the CBO is thinking and why it is making certain choices?

So, avoiding a change to the PK (as much as it kills me), the long term proposed solution is to load data into a staging table, build indexes and gather statistics and then use partition exchange to bring all of that online in the "real" table. However, that would required a sub-partitioning of the FOO table by load units, so we can't just do it right off.

My near term work-around is to build a non-unique index on those first four columns. In my opinion we'd have just this one and not that PK, but such things are not up to me now.

I'm up for any theories or proposals at this point. My frustration level with this design and the refusal to fault it, is at an all-time high.

[1] http://www.freelists.org/archives/oracle-l/01-2007/msg00995.html

-- 
Don Seiler
oracle blog: http://ora.seiler.us
ultimate: http://www.mufc.us
--
http://www.freelists.org/webpage/oracle-l

Received on Tue Jun 19 2007 - 12:07:00 CDT