Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
![]() |
![]() |
Home -> Community -> Mailing Lists -> Oracle-L -> Re: really slow RMAN backups
Steve,
Here are a few thoughts -- for what they're worth. I'm sure others on this list can offer much better feedback.
Okay, so, silly questions out of the way, here are the observations I promised earlier...
You said:
> I don't have any experience with netapp and want to see if there are
> some known issues with it.
One comes to mind. With (redhat) Linux, is not possible to do asynchronous I/O against NetApp storage. Not if you're using NFS, anyway. This can have huge implications to I/O performance, especially if you happen to be assuming that you are (capable of) doing Async I/O...
You said:
> I don't know why they chose directio (1 dbwr) instead of async. they
> may not have anything to do with it, but it's the first time I saw
> them set on a RAC database.
Lack of async I/O could be a major factor here. Here's the bad news: "they" chose not to use Async I/O because it is not available (i.e. not possible) with NFS-on-redhat-linux. Not much of a choice, really...
All of your I/O is being done synchronously. And this can lead to serious bottlenecks. (Mostly on writes, though.)
You said:
> I ran an awr report and "RMAN backup & recovery I/O" was the top
> waiter with an avg wait of 134 ms.
Average wait of 134 ms? That's about 7 (synchronous) I/Os per second. At 8KB per I/O (you didn't tell us DB_BLOCK_SIZE) that's about 56KB/s, or around 200MB/hr. Obviously, you're not bottlenecked (completely) on this all of the time -- your backups would take 2,000+ hours, not 20+ hours.
I don't know much about this particular wait (obviously). I would want to understand what it means a lot better before really running with this, but that 134ms average wait does not sound (at all) promising.
So, you're backing up a 500GB database. To do it in 10 hours (that's a lot) you need to sustain 50GB/hr -- end to end -- just for the backups. That's around 15MB per second. That could mean (something vaguely like) reading from the NetApp at 15MB/s, writing to the flashback recovery area (also on the NetApp?) at 15MB/s, reading again from the flashback recovery area at 15MB/s, transmitting backup over the nework to the media manager at 15MB/s, staging the backup data to disk at 15MB/s, destaging the backup data from disk at 15MB/s, and (finally!) writing to tape at 15MB/s. All concurrently!
So, depending on the answers to the "silly" questions above, I count somwhere up to 6 or 7 traversals of your IP network, for a total of 100MB/s (1000Mbits/s), total NetApp throughput (just for backups) of maybe 90MB/s. How much (sustained) I/O can it do?
You may want to consider DBWR_IO_SLAVES for your database. This is probably not (directly) related to backups, but you didn't tell us what else your database has been waiting on. In any event, environments where ASYNC I/O is unavailable (yours is one) are the rare cases where DBWR_IO_SLAVES can be warranted.
And if you haven't already, you may want to look into TAPE_IO_SLAVES, too...
On 8/21/06, Steve Perry <sperry_at_sprynet.com> wrote:
>
> This was just passed to me, but I thought I'd check with the group to
> see if anyone else has experienced this slowness.
>
>
>
> RMAN backups (2 tape channels) take forever on this system. forever
> means 20+ hours.
>
> the view v$backup_sync_io shows the effective bytes per second at 2
> or 3 MB per second. nothing above 5MB per second.
> v$backup_async_io doesnt' show anything.
>
> Setup.
> 500GB database on a netapp filer (40+ disks, don't know the model)
> with ASM
> 32-bit 10.2.0.1
> 2 - node RAC EE cluster
> rhel3
> 2 cpu
> 1 GB swap
> 4GB ram
> 600 MB SGA (small and uses the automatic memory management)
> flash recovery area is on
> DG is setup for 2 different databases
> mtu sizes of all NICs are set to 1500 (since it's netapp, they might
> prefer something else)
> legato is the media manager
>
> I looked at the init.ora settings and besides the small sga,
> disk_asynch_io = false
> filesystemio_option = directIO
> large_pool_size = 52M
>
> I don't know why they chose directio (1 dbwr) instead of async. they
> may not have anything to do with it, but it's the first time I saw
> them set on a RAC database.
>
> I ran an awr report and "RMAN backup & recovery I/O" was the top
> waiter with an avg wait of 134 ms. the class is "system io".
> other things are an index with 19 million get buffs during 2 hour
> snap shot.
> I see a few slow access times 300ms avg. read time, but there are
> only 200 or so reads against it. Most of the access times are less
> than 20ms.
> I don't know if the problem is contention with other jobs, config
> parameter or hardware.
>
> I checked a similar system (db ver, 2 node rac, asm) that gets
> 80-90MB per second for it's backup.
> it's on the SAN and uses async.
> I haven't looked at the awr report from it.
>
> any suggestions?
> --
> http://www.freelists.org/webpage/oracle-l
>
>
>
-- Cheers, -- Mark Brinsmead Staff DBA, The Pythian Group http://www.pythian.com/blogs -- http://www.freelists.org/webpage/oracle-lReceived on Mon Aug 21 2006 - 21:45:44 CDT
![]() |
![]() |