Application of archived logs on physical standby very slow

From: Martin Bach <development_at_the-playground.de>
Date: Wed, 27 Jan 2010 17:13:28 +0000
Message-ID: <4B607438.3040702_at_the-playground.de>



Dear listers,

I'd like to add another "interesting" problem to today's interesting RMAN problem. I have quite a bit of output from diagnostic tools, hopefully this doesn't got too badly mangled...

Our alerting system fired when my physical standby database went out of sync with production. DR and production share the same storage (an entry level MSA) and are multipathed. I could observe this on the hosts (both RHEL 5.2 32bit, don't ask):

Wed Jan 27 10:42:42 2010
RFS[2]: Archived Log: '/u03/oradata/dr/arch/1_49264_596779339.dbf' Primary database is in MAXIMUM PERFORMANCE mode Wed Jan 27 10:47:55 2010
Media Recovery Log /u03/oradata/dr/arch/1_49252_596779339.dbf Wed Jan 27 10:55:27 2010
Media Recovery Log /u03/oradata/dr/arch/1_49253_596779339.dbf Wed Jan 27 11:01:46 2010

Online redo logs are 500M in size and so are most archived logs, and as you can see it takes 5-10 minutes to apply just one, which is consistent.

top - 11:25:37 up 57 days, 17:34, 13 users, load average: 0.55, 1.05, 1.19 Tasks: 209 total, 1 running, 208 sleeping, 0 stopped, 0 zombie Cpu0 : 2.8%us, 1.9%sy, 0.0%ni, 51.9%id, 43.5%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 1.9%us, 0.9%sy, 0.0%ni, 42.6%id, 54.6%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 0.9%us, 0.0%sy, 0.0%ni, 98.1%id, 0.9%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 0.0%us, 0.9%sy, 0.0%ni, 99.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8310064k total, 8019968k used, 290096k free, 36456k buffers Swap: 8152948k total, 70124k used, 8082824k free, 7568912k cached

   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21647 oracle 15 0 2178m 48m 31m S 3 0.6 0:00.41 ora_mrp0_dr

  7496 oracle 15 0 2145m 1.5g 1.5g S 2 18.4 190:47.10 ora_p000_dr
  7500 oracle 15 0 2145m 1.5g 1.5g S 2 18.4 199:38.12 ora_p002_dr
  2580 root 10 -5 0 0 0 S 1 0.0 17:42.07
[kjournald]

  7498 oracle 16 0 2145m 1.5g 1.5g D 1 18.4 189:40.17 ora_p001_dr

The only processes on cpu are MRP0 and its parallel slaves. Funny enough, iostat doesn't seem to see a problem:

avg-cpu: %user %nice %system %iowait %steal %idle

            0.00 0.00 0.25 24.75 0.00 75.00

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
cciss/c0d0 0.00 4.00 0.00 8.00 0.00 48.00 12.00

    0.01 0.88 0.50 0.40
cciss/c0d1 0.00 5.00 0.00 18.00 0.00 92.00 10.22

    0.02 0.89 0.06 0.10
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00

    0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00

    0.00 0.00 0.00 0.00

sdc               0.00   211.00  0.00 189.00     0.00  1600.00    16.93 
     0.49    2.57   2.57  48.60
sdd               0.00   203.00  0.00 183.00     0.00  1544.00    16.87 
     0.50    2.71   2.74  50.10
sde               0.00     0.00  0.00  0.00     0.00     0.00     0.00 
    0.00    0.00   0.00   0.00
sdf               0.00     0.00  0.00  0.00     0.00     0.00     0.00 
    0.00    0.00   0.00   0.00
dm-0              0.00     0.00  0.00  0.00     0.00     0.00     0.00 
    0.00    0.00   0.00   0.00
dm-1              0.00     0.00  0.00  0.00     0.00     0.00     0.00 
    0.00    0.00   0.00   0.00
dm-2              0.00     0.00  0.00 786.00     0.00  3144.00     8.00 
     2.09    2.64   1.26  98.70
dm-3              0.00     0.00  0.00  0.00     0.00     0.00     0.00 
    0.00    0.00   0.00   0.00
dm-4              0.00     0.00  0.00  0.00     0.00     0.00     0.00 
    0.00    0.00   0.00   0.00
dm-5              0.00     0.00  0.00 786.00     0.00  3144.00     8.00 
     2.09    2.65   1.26  98.80
dm-6              0.00     0.00  0.00 23.00     0.00    92.00     8.00 
    0.02    0.83   0.04   0.10
dm-7              0.00     0.00  0.00  0.00     0.00     0.00     0.00 
    0.00    0.00   0.00   0.00
dm-8              0.00     0.00  0.00  0.00     0.00     0.00     0.00 

    0.00 0.00 0.00 0.00

Service time is pretty decent I'd say and the SAN administrators tell me that everything on the array is fine. I checked for AIO and that's also used:

[oracle]$ cat /proc/slabinfo | grep ^k

kioctx                25     60    192   20    1 : tunables  120   60 
  8 : slabdata      3      3      0
kiocb                 64     90    128   30    1 : tunables  120   60 
  8 : slabdata      3      3      0

Data files are scattered over mount points /u02-/u06, and there are no multipathing (or other I/O related errors) reported in dmesg and /var/log/messages. Archive logs are received on the DR box in time, and there are no gaps. Our sys admin tells me that device mapper multipath is happily distributing IO round robin. Oh yes, and it's Oracle 10.2.0.4 32bit.

Now does anyone have an idea on how to proceed?

Regards,

Martin

-- 
Martin Bach
OCM 10g
http://martincarstenbach.wordpress.com
http://www.linkedin.com/in/martincarstenbach
--
http://www.freelists.org/webpage/oracle-l
Received on Wed Jan 27 2010 - 11:13:28 CST

Original text of this message