Locating a bad disk on Solaris w/Veritas

From: Steve B <BigBoote66_at_hotmail.com>
Date: 25 Oct 2004 14:18:53 -0700
Message-ID: <67bcf80a.0410251318.44848790@posting.google.com>

Hi all,

I posted a few weeks ago about some intermittent/inconsistent performance problems we were having for queries that didn't make sense compared to the reported i/o usage for these queries. I've since learned how to use the extended trace facility (10046/12), and I've learned more about the problem, but I've run up against another wall.

What I've found is that our app is spending an inordinate amount of time fetching just one block from the disk. For example, in some slow running candidates, I've seen lines like this:

PARSING IN CURSOR #4 len=1125 dep=0 uid=24 oct=3 lid=24 tim=678997477414 hv=3589042826 ad='cceb3b00' SELECT /*+ ORDERED */sc.IdentifierId, sc.Coefficient, fact.ExposureAnalysisId, DataSet FROM ExposureToolModel etm , ExposureTool et , ExposureAnalysis fact, ExposureAnalysisDat a ead , ScalarCoefficient sc WHERE fact.ExposureToolId = 183 AND fact.RecipeId = 28959 AND fact.MaskId = 23514 AND fact.LayerId = 4130 AND fact.RouteId = 27327 AND fact.Operat ionId = 51086 AND fact.LotId IN (SELECT l.LotId FROM dual fact, Lot l WHERE l.DeviceId = 6332 AND l.TechnologyId = 4049 ) AND fact.ExposureToolId = et.ExposureToolId (+) AND e t.ExposureToolModelId = etm.ExposureToolModelId (+) AND etm.ExposureToolType= 'scanner' AND et.ExposureToolId = 183 AND etm.VendorMake= 'asml' AND etm.VendorModel= 'AT/750B' AN D fact.ExposeTime + 1 - 1 >= '2004.08.24 11:17:34' AND fact.ExposeTime + 1 - 1 <= '2004.10.23 11:17:34' AND fact.ExposeHash >= 1698 AND fact.ExposeHash <= 1758 AND fact.Exposur eAnalysisId = ead.ExposureAnalysisId AND fact.FileSpec= 'Feedback' AND fact.ScriptCategoryId = 100 AND fact.ProcessStatus = 'D' AND ead.DataSet IS NULL AND sc.DataSetId = ead.D ataSetId ORDER BY fact.ExposeTime desc, fact.ExposureAnalysisId desc END OF STMT
PARSE #4:c=50000,e=46837,p=0,cr=9,cu=0,mis=1,r=0,dep=0,og=4,tim=678997477399 WAIT #4: nam='SQL*Net message to client' ela= 4 p1=1413697536 p2=1 p3=0
WAIT #4: nam='SQL*Net message from client' ela= 1110 p1=1413697536 p2=1 p3=0
EXEC #4:c=0,e=1135,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=4,tim=678997480425 WAIT #4: nam='SQL*Net message to client' ela= 9 p1=1413697536 p2=1 p3=0
*** 2004-10-23 11:18:10.028
****** LOOK HERE *****
WAIT #4: nam='db file sequential read' ela= 15714497 p1=4 p2=94824 p3=1
****** LOOK HERE *****

WAIT #4: nam='db file sequential read' ela= 9172 p1=3 p2=404516 p3=1
WAIT #4: nam='db file sequential read' ela= 9519 p1=4 p2=198168 p3=1
WAIT #4: nam='db file sequential read' ela= 8957 p1=3 p2=823622 p3=1
WAIT #4: nam='db file sequential read' ela= 3159 p1=4 p2=857461 p3=1
WAIT #4: nam='db file sequential read' ela= 7666 p1=3 p2=415264 p3=1
WAIT #4: nam='db file sequential read' ela= 10128 p1=4 p2=849608 p3=1
WAIT #4: nam='db file sequential read' ela= 1123429 p1=3 p2=1082960

p3=1

WAIT #4: nam='db file sequential read' ela= 7294 p1=3 p2=825174 p3=1
WAIT #4: nam='db file sequential read' ela= 13880 p1=4 p2=857419 p3=1
WAIT #4: nam='db file sequential read' ela= 9227 p1=3 p2=415334 p3=1
WAIT #4: nam='db file sequential read' ela= 21533 p1=4 p2=853244 p3=1
WAIT #4: nam='db file sequential read' ela= 7562 p1=3 p2=1084013 p3=1
WAIT #4: nam='db file sequential read' ela= 9635 p1=3 p2=785154 p3=1
WAIT #4: nam='db file sequential read' ela= 17597 p1=4 p2=198519 p3=1
WAIT #4: nam='db file sequential read' ela= 10863 p1=4 p2=857465 p3=1
WAIT #4: nam='db file sequential read' ela= 26288 p1=3 p2=415301 p3=1
WAIT #4: nam='db file sequential read' ela= 21145 p1=4 p2=852599 p3=1
WAIT #4: nam='db file sequential read' ela= 14026 p1=3 p2=1084347 p3=1
WAIT #4: nam='db file sequential read' ela= 9997 p1=3 p2=1185112 p3=1

FETCH #4:c=30000,e=17079350,p=20,cr=222,cu=0,mis=0,r=1,dep=0,og=4,tim=679014560511 WAIT #4: nam='SQL*Net message from client' ela= 1838 p1=1413697536 p2=1 p3=0
WAIT #4: nam='SQL*Net message to client' ela= 3 p1=1413697536 p2=1 p3=0
FETCH #4:c=0,e=286,p=0,cr=0,cu=0,mis=0,r=32,dep=0,og=4,tim=679014572292 WAIT #4: nam='SQL*Net message from client' ela= 8478 p1=1413697536 p2=1 p3=0
WAIT #4: nam='SQL*Net message to client' ela= 3 p1=1413697536 p2=1 p3=0
FETCH #4:c=0,e=279,p=0,cr=0,cu=0,mis=0,r=31,dep=0,og=4,tim=679014590654 WAIT #4: nam='SQL*Net message from client' ela= 10568 p1=1413697536 p2=1 p3=0
WAIT #4: nam='SQL*Net message to client' ela= 3 p1=1413697536 p2=1 p3=0
FETCH #4:c=0,e=282,p=0,cr=0,cu=0,mis=0,r=32,dep=0,og=4,tim=679014601772 WAIT #4: nam='SQL*Net message from client' ela= 10105 p1=1413697536 p2=1 p3=0
WAIT #4: nam='SQL*Net message to client' ela= 3 p1=1413697536 p2=1 p3=0
FETCH #4:c=0,e=150,p=0,cr=0,cu=0,mis=0,r=4,dep=0,og=4,tim=679014612292 WAIT #4: nam='SQL*Net message from client' ela= 37155 p1=1413697536 p2=1 p3=0
XCTEND rlbk=0, rd_only=1
STAT #4 id=1 cnt=100 pid=0 pos=1 obj=0 op='SORT ORDER BY (cr=222 r=20 w=0 time=17079583 us)'
STAT #4 id=2 cnt=100 pid=1 pos=1 obj=6633 op='TABLE ACCESS BY INDEX ROWID SCALARCOEFFICIENT (cr=222 r=20 w=0 time=17078134 us)' STAT #4 id=3 cnt=111 pid=2 pos=1 obj=0 op='NESTED LOOPS (cr=212 r=13 w=0 time=15895139 us)'
STAT #4 id=4 cnt=10 pid=3 pos=1 obj=0 op='NESTED LOOPS (cr=180 r=8 w=0 time=15812944 us)'
STAT #4 id=5 cnt=10 pid=4 pos=1 obj=0 op='HASH JOIN (cr=148 r=0 w=0 time=15006 us)'
STAT #4 id=6 cnt=10 pid=5 pos=1 obj=9661 op='TABLE ACCESS BY INDEX ROWID EXPOSUREANALYSIS (cr=117 r=0 w=0 time=7222 us)' STAT #4 id=7 cnt=73 pid=6 pos=1 obj=0 op='BITMAP CONVERSION TO ROWIDS
(cr=44 r=0 w=0 time=5674 us)'

STAT #4 id=8 cnt=1 pid=7 pos=1 obj=0 op='BITMAP AND (cr=44 r=0 w=0 time=5542 us)'
STAT #4 id=9 cnt=1 pid=8 pos=1 obj=30418 op='BITMAP INDEX SINGLE VALUE DN9EXPOSUREANALYSIS (cr=4 r=0 w=0 time=104 us)' STAT #4 id=11 cnt=24 pid=5 pos=2 obj=0 op='HASH JOIN (cr=31 r=0 w=0 time=5220 us)'

STAT #4 id=12 cnt=1 pid=11 pos=1 obj=6592 op='TABLE ACCESS BY INDEX
ROWID EXPOSURETOOL (cr=2 r=0 w=0 time=52 us)'
STAT #4 id=13 cnt=1 pid=12 pos=1 obj=7493 op='INDEX UNIQUE SCAN

PKEXPOSURETOOL (cr=1 r=0 w=0 time=29 us)' STAT #4 id=14 cnt=24 pid=11 pos=2 obj=0 op='MERGE JOIN CARTESIAN
(cr=29 r=0 w=0 time=1884 us)'

STAT #4 id=15 cnt=24 pid=14 pos=1 obj=0 op='VIEW (cr=27 r=0 w=0 time=1378 us)'
STAT #4 id=16 cnt=24 pid=15 pos=1 obj=0 op='SORT UNIQUE (cr=27 r=0 w=0 time=1334 us)'
STAT #4 id=17 cnt=24 pid=16 pos=1 obj=0 op='MERGE JOIN CARTESIAN
(cr=27 r=0 w=0 time=1099 us)'

STAT #4 id=18 cnt=24 pid=17 pos=1 obj=6600 op='TABLE ACCESS BY INDEX ROWID LOT (cr=24 r=0 w=0 time=692 us)'

STAT #4 id=19 cnt=24 pid=18 pos=1 obj=0 op='BITMAP CONVERSION TO
ROWIDS (cr=4 r=0 w=0 time=320 us)'
STAT #4 id=20 cnt=1 pid=19 pos=1 obj=0 op='BITMAP AND  (cr=4 r=0 w=0

time=283 us)'

STAT #4 id=21 cnt=1 pid=20 pos=1 obj=0 op='BITMAP CONVERSION FROM
ROWIDS (cr=2 r=0 w=0 time=125 us)'
STAT #4 id=23 cnt=1 pid=20 pos=2 obj=0 op='BITMAP CONVERSION FROM
ROWIDS (cr=2 r=0 w=0 time=120 us)'
STAT #4 id=25 cnt=24 pid=17 pos=2 obj=0 op='BUFFER SORT (cr=3 r=0 w=0

time=264 us)'
STAT #4 id=26 cnt=1 pid=25 pos=1 obj=222 op='TABLE ACCESS FULL DUAL
(cr=3 r=0 w=0 time=94 us)'

STAT #4 id=27 cnt=24 pid=14 pos=2 obj=0 op='BUFFER SORT (cr=2 r=0 w=0 time=305 us)'
STAT #4 id=28 cnt=1 pid=27 pos=1 obj=6594 op='TABLE ACCESS BY INDEX ROWID EXPOSURETOOLMODEL (cr=2 r=0 w=0 time=37 us)'

STAT #4 id=29 cnt=1 pid=28 pos=1 obj=7500 op='INDEX UNIQUE SCAN
AKEXPOSURETOOLMODEL (cr=1 r=0 w=0 time=15 us)'
STAT #4 id=30 cnt=10 pid=4 pos=2 obj=6584 op='TABLE ACCESS BY INDEX

ROWID EXPOSUREANALYSISDATA (cr=32 r=8 w=0 time=15797744 us)'

STAT #4 id=31 cnt=10 pid=30 pos=1 obj=7471 op='INDEX RANGE SCAN
PKEXPOSUREANALYSISDATA (cr=22 r=4 w=0 time=15744043 us)'
STAT #4 id=32 cnt=100 pid=3 pos=2 obj=7590 op='INDEX RANGE SCAN

PKSCALARCOEFFICIENT (cr=32 r=5 w=0 time=81727 us)'

Note that we're paying almost 16 seconds to fetch a single block from the disk.

The problem is, I'm struggling to figure out why. At the time of this call, there is nothing else going on in our application that is accessing the db - certainly nothing that could block the system for over 10 seconds at a time. I need to find what it is that's standing in between Oracle and the bits on the platters of our disks. Is it a bad disk? Some Veritas weirdness? Some job that's starting up that we've overlooked?

We're running Oracle 9.2.0.4.0 on Solaris 8 and our RAID is managed by Veritas. Is there some sort of low-level tools we can run that would determine if we have some bum disks or if there is suspicious access of the disks at certain times? I've used sar, but I don't know how to get it to give information that is substantially different from iostat
(i.e. just aggregated statistics & averages).

Thanks,
-Steve Received on Mon Oct 25 2004 - 16:18:53 CDT