IO Contention on the Redo Logs

From: Pat <pat.casey_at_service-now.com>
Date: Wed, 14 Oct 2009 13:58:16 -0700 (PDT)
Message-ID: <7014718d-528c-4ca1-b1b9-54e64838dc2f_at_u36g2000prn.googlegroups.com>



I've been trying to troubleshoot a troublesome (sic) performance issue on one of our busiest Oracle servers and I was hoping somebody here might have some insight into what I'm seeing.

Every "now and then" (its not predictable), when the box is under a whole lot of IO load (lots of read/write activity), the whole box bogs down incredibly and I get about a 3 minute "hang". If you look at the wait tree, everybody is waiting on the log_file_parallel_write.

Problem is, if I look at my SAR reports (or vmstat) during one of these 3 minute hangs, the IOs on the box drop to the floor e.g. we do a lot less IOs during a hand than before and after. We're seeing maybe 25k blocks/sec in/out before and after, and drop down to 25-50 blocks/ sec in/out during the hang.

We've engaged Oracle support on this, and, while they're not certain then know what's going on, they have pointed to generally poor performance of the IO subsystem when writing REDO logs.

Right now, the entire database has everything mounted on a single Fiber Channel LUN o /u01. Even the REDO logs are on that same LUN.

One of the things the storage guys have been pointing to is that there's a single IO queue on our QLogic cards per lun, so my REDO traffic is, in fact, fighting its way down the same scheduler queue as my normal data blocks and they've suggested provisioning a new pair of smaller luns, one for each half of the REDO log group.

Another thing that's been suggested is that I switch the RedHat IO scheduler from CFS to NOOP and just let the HBA and SAN handle the block reordering. I'm dubious about this one though since I'm not seeing a bottleneck on the host scheduler and I have to assume there's some benefit to the block reordering going on here.

So I suppose my questions to the group are:

  1. Has anybody else seen similar "hangups" with the characteristic lack of IO throughput I identified above?
  2. If you're deploying Oracle on a SAN, what, in your experience, is the optimal layout of files on LUNs? I know how I lay things out on DASD, but the rules in the SAN world look to be subtly different.
  3. Does anybody have any experience tweaking the RedHat IO schedulers? What are folks experience with the different options?

Particulars:
Oracle: 10.2.0.4
Host: 8 cores (intel) 32G
OS: RedHat EL 5
Storage: Netapp 3040
HBA: QLogic Received on Wed Oct 14 2009 - 15:58:16 CDT

Original text of this message