Very Strange and Interesting Oracle Issue [message #353425] |
Mon, 13 October 2008 17:44 |
nsteblay
Messages: 2 Registered: October 2008
|
Junior Member |
|
|
We have a relatively simple query that is being executed from a large WebSphere application (JDBC). The query has a great query plan (low cost) and normally executes quickly (sub second). Intermittently some state changes within the database environment where the query will generate 1000x the I/O to complete, yes, 1000x the I/O. If we look at the query while executing poorly it shows the same query plan and cost, yet the AWR report shows logical reads through the roof for the query.
It doesn't appear to be a data issue or an application issue - we don't see any type of data input error patterns, race conditions, excessive looping, etc. This simple query is part of a large application and will run great for days - being executed 1000s of times. When the problem starts all instances of the query generate 1000x more logical reads than normal. This one query quickly becomes +90% of all I/O being generated. With heavy loads on the system we eventually have to bring the application down, restarting the database server to fix the problem. It then goes away and may not surface again for days. We haven't been able to correlate any event to when the bug re-appears.
The odd thing is this bug only affects this one query. It doesn't seem to impact any of the other queries on the system. Unfortunately one of the most hit pages on the system executes this query.
I have never seen a bug like this in all my years working with databases. How can the same data transactions be run (we've tested this as we can simulate the bug in our test environment) and one time the query runs correctly, at other times it performs horrendously with extremely excessive I/O? How is it that what appears to be a back-end system bug only impacts one specific and simple query?
We have an open call with Oracle and will be digging deeper over the next few days. I was just wondering if anyone else has experienced a simlar problem?
|
|
|
Re: Very Strange and Interesting Oracle Issue [message #354088 is a reply to message #353425] |
Thu, 16 October 2008 06:56 |
rleishman
Messages: 3728 Registered: October 2005 Location: Melbourne, Australia
|
Senior Member |
|
|
Get your DBA to enable Oracle Trace next time it happens and gather trace files including BIND variables and wait events.
Also try running a baseline query from your SQL client whilst the problem is happening and trace that.
Check the results of the trace in TKPROF.
Ross Leishman
|
|
|
|
Re: Very Strange and Interesting Oracle Issue [message #354114 is a reply to message #353425] |
Thu, 16 October 2008 08:12 |
nsteblay
Messages: 2 Registered: October 2008
|
Junior Member |
|
|
We did find the issue with the help of Oracle. It turns out that the query was executing under a different execution plan. Apparently execution plans are cached in the Shared Pool and reused. The first time the query is executed is when the execution plan is created and cached for subsequent executions. Occasionally execution plans are invalidated in the Shared Pool cache and need to be regenerated. Oracle considers statistics as well as bind variable types and values when creating an execution plan. In our case if a particular value was passed to one of the bind variables it resulted in a very poor execution plan being chosen by Oracle. If this happened for the first execution of the query (or after invalidation) all subsequent executions of the query were slow. We tested this out and it confirmed Oracle's observation. We are looking at plan stabilization now to correct the problem. Thanks for the replies.
|
|
|