Re: Boss is asking for 'root cause analysis'

From: Niall Litchfield <n-litchfield_at_audit-commission.gov.uk>
Date: Wed, 30 Apr 2003 15:29:20 +0100
Message-ID: <3eafddc0$0$4844$ed9e5944@reading.news.pipex.net>

Unless you have some sort of performance monitoring data, I suspect that finding the root cause will be impossible.

If you have , say, statspack or ultbstat/utlestat reports covering the relevant period, or explain plans for the problem queries before and after it will be pretty difficult to see what was going on.

I'd strongly suspect the dataload causing the access plans to become inappropriate but without seeing the access plans it is pretty well impossible to tell.

The other question to ask is wether the reporting tools/queries had been changed/upgraded in anyway.

-- 
Niall Litchfield
Oracle DBA
Audit Commission UK
"Ed Stevens" <nospam_at_noway.nohow> wrote in message
news:agivavojrihsnt07pg6d6tu0vvvphi8hu1_at_4ax.com...


> Platform -- Oracle 8.0.6 EE on NT4 -- data warehouse app

>

> Last week app users raised an issue where several report generating

> queries that had been running in 1 to 5 minutes were suddenly running

> several hours before failing with a "time out"  (their term -- I never

> saw it).

>

> Stats were current, no structural changes had been made to the db,

> init parms had not been touched in over a year.  The db is shutdown

> nightly for a cold backup.  The day after the last reported "good" run

> of these reports there had been a normal monthly data load that added

> a modest number of rows to the major tables.  By modest, I mean a few

> tens of thousands to tables with 8.5 million rows.  These tables did

> have about 2.5 million chaned rows, but it is not clear when or

> exactly how they became chained.  There are no variable length

> columns.  Most columns are 'nulls allowed' but it is believed that in

> practice none of them are ever actually null.

>

> We asked a couple of DBAs at another affiliate location on this

> account to take a look and see what they thought.  Their only response

> was 'when dealing with data warehouse apps, we always set

> STAR_TRANSFORMATION_ENABLED=TRUE.  We tried it and the problem went

> away.  When pressed for more details, they could only say "we've

> always done it that way."

>

> Now my manager is pressing for a root cause analysis and preventative

> measures.  When the question is asked regarding why did things

> suddenly fall over and break, the only response the DBA that found the

> solution can give is that perhaps we passed some sort of threshold.

> Now I'm being pressed to determine if that is the case, and if so,

> exactly what was the threshold that was passed; and if not crossing a

> threshold, then what *was* the problem.

>

> I'm not sure it *can* be determined at this time what went wrong, but

> I have to exhaust all avenues in attempting to satisfy the questions

> being asked.  So, would anyone have any insights or care to comment on

> any of the above?

Received on Wed Apr 30 2003 - 09:29:20 CDT