Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
Home -> Community -> Usenet -> c.d.o.server -> Re: Locating a bad disk on Solaris w/Veritas
rhugga_at_yahoo.com (Keg) wrote in message news:<6c795a35.0410261720.4bb79c5d_at_posting.google.com>...
>
> 'vxdisk list' will show disks that veritas has failed. If nothing
> there then look at 'iostat -En' which will show hard/soft errors. If
> there aren't any errors listed for the device here, than the problem
> is likely somewhere else. (note that iostat will register errors from
> faulty cables, scsi bus, fibre, etc....)
>
> Also, have you looked at mpstat and iostat while that query is
> running? That query is quite large and might be thrashing away at the
> disk. (are index/data seperated on diff devices, etc.. and all that
> stuff) As you can see from your trace there is quite a bit going on
> with that query, multiple I/O's and sorts going on. Without seeing
> more info, I would guess that oracle is churning away with the query
> and then returning the result which is why it seems it is 'taking 16
> seconds to read a block'.
>
> If iostat doesn't display any soft/hard errors, it is almost certainly
> safe to assume there is no hardware problem with the controller, disk,
> or bus. (at least accoring to veritas support and from my experience I
> have not seen anything to contradict what they have told me)
>
> My suggested course of action would be first look at 'iostat -En'. If
> all is well there, start logging mpstat and iostat (man iostat && man
> mpstat) while that query is running. Watch your data disk, index disk,
> and temp disk during the query's execution.
>
> Here is a sample of what iostat -En will return:
> sd1185 Soft Errors: 1 Hard Errors: 0 Transport Errors: 0
> Vendor: IBM Product: 1742 Revision: 0520 Serial No:
> Size: 54.47GB <54473523200 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> Illegal Request: 1 Predictive Failure Analysis: 0
>
> Note that a few soft errors are to be expected, the hard errors are
> the ones you need to worry about.
>
> -rhugga
Thanks for the advice. Last things first: The query itself is not particularly costly, even though the sql is big. You can see from the first FETCH #4 line that it's only generating 222 buffer gets and 20 disk reads (the cr= and p= stats, respectively). The query uses bitmap indexes, which always makes the plans look more complicated. Running the query at other times, even with more disk reads (for example, after instance startup) results in great performance. This problem is an intermittant one.
As far as the other commands, vxdisk list shows only this:
DEVICE TYPE DISK GROUP STATUS c0t0d0s2 sliced rootdisk_1 rootdg online c0t1d0s2 sliced mirboot_disk rootdg online c1t0d0s2 sliced orac1t0 oracledg online c1t1d0s2 sliced orac1t1 oracledg online c1t2d0s2 sliced orac1t2 oracledg online c1t3d0s2 sliced orac1t3 oracledg online c1t4d0s2 sliced orac1t4 oracledg online c1t5d0s2 sliced orac1t5 oracledg online c1t8d0s2 sliced orac1t8 oracledg online c1t9d0s2 sliced orac1t9 oracledg online c1t10d0s2 sliced orac1t10 oracledg online c1t11d0s2 sliced orac1t11 oracledg online c1t12d0s2 sliced orac1t12 oracledg online c1t13d0s2 sliced orac1t13 oracledg online c2t0d0s2 sliced orac2t0 oracledg online c2t1d0s2 sliced orac2t1 oracledg online c2t2d0s2 sliced orac2t2 oracledg online c2t3d0s2 sliced orac2t3 oracledg online c2t4d0s2 sliced orac2t4 oracledg online c2t5d0s2 sliced orac2t5 oracledg online c2t8d0s2 sliced orac2t8 oracledg online c2t9d0s2 sliced orac2t9 oracledg online c2t10d0s2 sliced orac2t10 oracledg online c2t11d0s2 sliced orac2t11 oracledg online c2t12d0s2 sliced orac2t12 oracledg online c2t13d0s2 sliced orac2t13 oracledg online
It looks okay to me.
iostat -En shows a few hard errors:
{bash} iostat -En | grep 'Hard Err'
c0t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c0t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c0t6d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c1t0d0 Soft Errors: 0 Hard Errors: 1 Transport Errors: 1 c1t1d0 Soft Errors: 0 Hard Errors: 2 Transport Errors: 1 c1t2d0 Soft Errors: 0 Hard Errors: 1 Transport Errors: 0 c1t3d0 Soft Errors: 0 Hard Errors: 1 Transport Errors: 0 c1t4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c1t5d0 Soft Errors: 0 Hard Errors: 1 Transport Errors: 0 c1t8d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c1t9d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c1t10d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c1t11d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c1t12d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c1t13d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c2t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c2t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c2t2d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c2t3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c2t4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c2t5d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c2t8d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c2t9d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c2t10d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c2t11d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c2t12d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 c2t13d0 Soft Errors: 1 Hard Errors: 0 Transport Errors: 0
In the trace file above, the two files that are getting read from (files 3 & 4) live on some of the disks that are reporting hard errors. #3 is on a raid 10 volume composed of disks c1t2d0, c1t3d0, c2t2d0 and c2t3d0. #4 is on a raid 10 composed of c1t4d0, c1t5d0, c2t4d0 and c2t5d0.
I saw these errors earlier, but assumed they were not the source of this problem because there are only 1 or 2 per disk, and we're seeing these problems at least once every hour or two, and the box has been up for weeks. Am I wrong in this assumption?
In any case, I'm turning on iostat & vmstat logging with timestamps, and resuming my 10046 traces on my critical sessions, and I'll see if I can capture more information on what's going on during the slow period.
-S Received on Wed Oct 27 2004 - 09:12:16 CDT