Tuning question - low cpu_time, high elapsed_time

From: Steve B <BigBoote66_at_hotmail.com>
Date: 11 Oct 2004 13:56:30 -0700
Message-ID: <67bcf80a.0410111256.44e0d0fa@posting.google.com>

Hi all,

We're running a 9.2.0.4.0 instance on a 4 processor Solaris box, and I'm running up against an optimization mystery. Our application involves a variety of operations - constant insertion & querying, although the performance we're mainly concerned about are the queries.

Here's the deal: ranging from about 1/second to 10 times per second, we are running a query that a main table (Analysis) to 1 or 2 dimension tables, and about a third of the time joins against a detail table (Coefficient). Analysis has about 100K rows & Coefficient has about 30 million rows. If we join to Coefficient, we typically retrieve 1-10 rows in Coefficient for each row in Analysis. All values being returned are NUMBER columns. The constraints on analysis can be broad or narrow, but the clients executing the queries never return more than the first rows Analysis rows of the query (we only care about recent data). All queries are done with dynamic SQL due to the nature of the constraints (there is no one query with placeholders that would run efficiently for all cases, hints are dynamically computed based on the query, etc), so we always incur parse time.

Most of the time these queries run efficiently - less than 500 buffer_gets and 0 disk_reads, occassionally peaking to 4000 gets (5% of the time or less) and maybe 10-15% of the time we get 5-20 disk reads, and the elapsed_time on these queries is less than .15 seconds. Maybe a few times a day we have some queries that will ultimately need to be optimized (>20,000 gets). However, occasionally - maybe 3% of all queries run - we get elapsed times that exceed 5 seconds, but the other statistics on the query read good.

For example, here are some stats from some of our bad queries, as well as good ones (the last two rows):

BUFFER_GETS DISK_READS ELAPSED_TIME SHARABLE_MEM PERSISTENT_MEM RUNTIME_MEM
----------- ---------- ------------ ------------ --------------

675 1 6782682 86744 2792 210976

623 1 6734890 87947 2792 210976

577 1 6507345 95338 1472 115304

2094 50 4796384 93897 1288 169480

308 5 483892 94505 1472 115304

419 53 456903 91949 1288 169480

SORTS FETCHES ROWS_PROCESSED OPTIMIZER_COST CPU_TIME CHILD_LATCH
---------- ---------- -------------- -------------- ----------

4 2 33 2177 100000 5

4 2 33 2177 110000 1

4 2 33 239352 120000 3

1 10 288 28 250000 4

4 2 6 239352 110000 3

1 10 288 28 100000 5

All other stats are the same. As you can see, for the "bad" queries, all the statistics other than elapsed_time fall with the same range as the good ones.

During the times when these unfortunate queries are running, there are other statements executing on the box, perhaps somewhat more than at other times, but we have yet to correlate any load patterns against the periods of slowness. Additionally, we are running using resource_groups, and the resource plan is set to favor these particular queries.

For anyone's curiosity, here's an example of one of these queries:

SELECT /*+ ORDERED */
sc.IdentifierId
, sc.Coefficient
, fact.ExposureAnalysisId
, DataSet

FROM ExposureToolModel etm

, ExposureTool et
, ExposureAnalysis fact
, ExposureAnalysisData ead
, ScalarCoefficient sc

WHERE fact.ExposureToolId = 301
  AND fact.ScriptCategoryId = 100
  AND fact.RecipeId = 604
  AND fact.MaskId = 40733
  AND fact.LayerId = 230
  AND fact.RouteId = 27233
  AND fact.OperationId = 36086
  AND fact.LotId IN(SELECT l.LotId
                     FROM dual fact
                        , Lot l
                    WHERE l.DeviceId = 6552
                      AND l.TechnologyId = 3845)

  AND fact.ExposureToolId = et.ExposureToolId(+)   AND et.ExposureToolModelId = etm.ExposureToolModelId(+)   AND etm.ExposureToolType = 'scanner'
  AND et.ExposureToolId = 301
  AND etm.VendorMake = 'asml'
  AND etm.VendorModel = 'AT/1100'

  AND fact.ExposeTime + 1 - 1 >= '2004.08.12 16:36:50'
  AND fact.ExposeTime + 1 - 1 <= '2004.10.11 16:36:50'
  AND fact.ExposeHash >= 1686
  AND fact.ExposeHash <= 1746
  AND fact.FileSpec = 'Feedback'
  AND fact.ProcessStatus = 'D'

AND ead.DataSet IS NULL
AND fact.ExposureAnalysisId = ead.ExposureAnalysisId AND sc.DataSetId = ead.DataSetId
ORDER BY fact.ExposeTime DESC

, fact.ExposureAnalysisId DESC;

Pretty typical, and I can tell you that this particular query returns very few rows - the subquery in the where clause returns a single row, and all columns are indexed (bitmap indexes, in fact, for all the values that have numeric constraints and have relatively few distinct values).

We have run statspack against the instance as well, but I'm not seeing any smoking guns in the tea leaves. The alert_log for the instance doesn't show any unusual activity (log switching, etc), at the time of these operations, and we are not swapping memory.

My only other guess as to what might be going on - as the Analysis table has indexes on it, it may be that an insert into that table is happening just as we are querying it, causing the bitmap indexes to be rebuilt. Could it be that the table is exclusively locked for an instant during this step and the select queries are blocked by the insert process?

Does anyone out there have any ideas as to where I might start looking to figure out why we seem to be getting this blocking behavior?

Thanks,

-Steve Received on Mon Oct 11 2004 - 15:56:30 CDT

Tuning question - low cpu_time, high elapsed_time - what gives?