Re: Locating a bad disk on Solaris w/Veritas

From: Steve B <BigBoote66_at_hotmail.com>
Date: 27 Oct 2004 07:12:16 -0700
Message-ID: <67bcf80a.0410270612.7316197e@posting.google.com>

rhugga_at_yahoo.com (Keg) wrote in message news:<6c795a35.0410261720.4bb79c5d_at_posting.google.com>...
>
> 'vxdisk list' will show disks that veritas has failed. If nothing
> there then look at 'iostat -En' which will show hard/soft errors. If
> there aren't any errors listed for the device here, than the problem
> is likely somewhere else. (note that iostat will register errors from
> faulty cables, scsi bus, fibre, etc....)
>
> Also, have you looked at mpstat and iostat while that query is
> running? That query is quite large and might be thrashing away at the
> disk. (are index/data seperated on diff devices, etc.. and all that
> stuff) As you can see from your trace there is quite a bit going on
> with that query, multiple I/O's and sorts going on. Without seeing
> more info, I would guess that oracle is churning away with the query
> and then returning the result which is why it seems it is 'taking 16
> seconds to read a block'.
>
> If iostat doesn't display any soft/hard errors, it is almost certainly
> safe to assume there is no hardware problem with the controller, disk,
> or bus. (at least accoring to veritas support and from my experience I
> have not seen anything to contradict what they have told me)
>
> My suggested course of action would be first look at 'iostat -En'. If
> all is well there, start logging mpstat and iostat (man iostat && man
> mpstat) while that query is running. Watch your data disk, index disk,
> and temp disk during the query's execution.
>
> Here is a sample of what iostat -En will return:
> sd1185 Soft Errors: 1 Hard Errors: 0 Transport Errors: 0
> Vendor: IBM Product: 1742 Revision: 0520 Serial No:
> Size: 54.47GB <54473523200 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> Illegal Request: 1 Predictive Failure Analysis: 0
>
> Note that a few soft errors are to be expected, the hard errors are
> the ones you need to worry about.
>
> -rhugga

Thanks for the advice. Last things first: The query itself is not particularly costly, even though the sql is big. You can see from the first FETCH #4 line that it's only generating 222 buffer gets and 20 disk reads (the cr= and p= stats, respectively). The query uses bitmap indexes, which always makes the plans look more complicated. Running the query at other times, even with more disk reads (for example, after instance startup) results in great performance. This problem is an intermittant one.

As far as the other commands, vxdisk list shows only this:

DEVICE       TYPE      DISK         GROUP        STATUS
c0t0d0s2     sliced    rootdisk_1   rootdg       online
c0t1d0s2     sliced    mirboot_disk  rootdg       online
c1t0d0s2     sliced    orac1t0      oracledg     online
c1t1d0s2     sliced    orac1t1      oracledg     online
c1t2d0s2     sliced    orac1t2      oracledg     online
c1t3d0s2     sliced    orac1t3      oracledg     online
c1t4d0s2     sliced    orac1t4      oracledg     online
c1t5d0s2     sliced    orac1t5      oracledg     online
c1t8d0s2     sliced    orac1t8      oracledg     online
c1t9d0s2     sliced    orac1t9      oracledg     online
c1t10d0s2    sliced    orac1t10     oracledg     online
c1t11d0s2    sliced    orac1t11     oracledg     online
c1t12d0s2    sliced    orac1t12     oracledg     online
c1t13d0s2    sliced    orac1t13     oracledg     online
c2t0d0s2     sliced    orac2t0      oracledg     online
c2t1d0s2     sliced    orac2t1      oracledg     online
c2t2d0s2     sliced    orac2t2      oracledg     online
c2t3d0s2     sliced    orac2t3      oracledg     online
c2t4d0s2     sliced    orac2t4      oracledg     online
c2t5d0s2     sliced    orac2t5      oracledg     online
c2t8d0s2     sliced    orac2t8      oracledg     online
c2t9d0s2     sliced    orac2t9      oracledg     online
c2t10d0s2    sliced    orac2t10     oracledg     online
c2t11d0s2    sliced    orac2t11     oracledg     online
c2t12d0s2    sliced    orac2t12     oracledg     online
c2t13d0s2    sliced    orac2t13     oracledg     online

It looks okay to me.

iostat -En shows a few hard errors:

{bash} iostat -En | grep 'Hard Err'

c0t0d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c0t1d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c0t6d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c1t0d0          Soft Errors: 0 Hard Errors: 1 Transport Errors: 1 
c1t1d0          Soft Errors: 0 Hard Errors: 2 Transport Errors: 1 
c1t2d0          Soft Errors: 0 Hard Errors: 1 Transport Errors: 0 
c1t3d0          Soft Errors: 0 Hard Errors: 1 Transport Errors: 0 
c1t4d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c1t5d0          Soft Errors: 0 Hard Errors: 1 Transport Errors: 0 
c1t8d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c1t9d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c1t10d0         Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c1t11d0         Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c1t12d0         Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c1t13d0         Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c2t0d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c2t1d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c2t2d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c2t3d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c2t4d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c2t5d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c2t8d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c2t9d0          Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c2t10d0         Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c2t11d0         Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c2t12d0         Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
c2t13d0         Soft Errors: 1 Hard Errors: 0 Transport Errors: 0

In the trace file above, the two files that are getting read from (files 3 & 4) live on some of the disks that are reporting hard errors. #3 is on a raid 10 volume composed of disks c1t2d0, c1t3d0, c2t2d0 and c2t3d0. #4 is on a raid 10 composed of c1t4d0, c1t5d0, c2t4d0 and c2t5d0.

I saw these errors earlier, but assumed they were not the source of this problem because there are only 1 or 2 per disk, and we're seeing these problems at least once every hour or two, and the box has been up for weeks. Am I wrong in this assumption?

In any case, I'm turning on iostat & vmstat logging with timestamps, and resuming my 10046 traces on my critical sessions, and I'll see if I can capture more information on what's going on during the slow period.

-S Received on Wed Oct 27 2004 - 09:12:16 CDT