RE: Application of archived logs on physical standby very slow
Date: Mon, 1 Feb 2010 17:07:42 +0200
Message-ID: <OF538CE2CB.8D3DF0CD-ONC22576BD.0052F3B4-C22576BD.0053211C_at_seb.lt>
A followup:
Event "PX Deq: Test for msg" is actually a top most event but I suspect it is kind of idle event actually. Looks like it is polling stuff: are you done, are you done,....
Please consider the environment before printing this e-mail
Laimutis.Nedzinsk as_at_seb.lt Sent by: To oracle-l-bounce_at_f ORACLE-L <oracle-l_at_freelists.org> reelists.org cc Subject 2010.02.01 17:03 RE: Application of archived logs on physical standby very slow Please respond to Laimutis.Nedzinsk as_at_seb.lt
Thank you for your input very much
>but on the subject of MRP scan/read of logs they say "This part of the
recovery is easily handled by the single MRP process".
I tried to generate some redo, checked again in controled environment and
yes, it may be so. Top events are "checkpoint completed" and "db file
parallel write"
Looks like it's time to search for hot disks, etc.
Brgds, Laimis
Please consider the environment before printing this e-mail
Luca Canali <Luca.Canali_at_cern .ch> To Sent by: ORACLE-L <oracle-l_at_freelists.org> oracle-l-bounce_at_f cc reelists.org Subject RE: Application of archived logs on 2010.01.29 12:39 physical standby very slow Please respond to Luca.Canali_at_cern. ch
Hi,
What do you see in V$RECOVERY_PROGRESS as "Active Apply Rate"? (see also
discussion in
http://jarneil.wordpress.com/2008/07/15/monitoring-your-standby-with-vrecovery_progress/
)
I have 35655 KB/s on one of my systems for example.
I had a look at the discussion on redo apply tuning in Larry Carpenter's (et al.) book at page 119 [one can find it on google books too in case paper in not available :)] they mention several diagnostic and tuning techniques, but on the subject of MRP scan/read of logs they say "This part of the recovery is easily handled by the single MRP process".
Cheers,
L.
PS: in terms of further investigations
http://www.oracle.com/technology/deploy/availability/pdf/MAA_WP_10gRecoveryBestPractices.pdf
seems to suggest setting Log_Archive_Trace=8192 for some 'advance troubleshooting'
-----Original Message-----
From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org]
On Behalf Of Laimutis.Nedzinskas_at_seb.lt
Sent: 29 January 2010 10:15
To: ORACLE-L
Subject: Re: Application of archived logs on physical standby very slow
I have the same issue.
vmstat shows low CPU, iostat is low but standby starts lagging by 50GB of redo when primary generates redo at some 15-20MB/second rate.
redo validation (alter system dump logfile validate...) shows 150-200 MB/sec read rate for redo.
My primary suspect is redo reader process. As far as I know it is only one process which reads and naturally it is a bottleneck.
vmstat may show next to nothing because our OS sees 64 CPUs (which is not but todays multicore, multithreaded, multi-whatever architectures translates into that many "CPU's" )
If one and the only reader takes 100% CPU then vmstat must show ~100/64 percent which is exactly what I get.
Is the reader tunable ? I doubt it....
Please consider the environment before printing this e-mail
Niall Litchfield <niall.litchfield _at_gmail.com> To Sent by: Martin Bach oracle-l-bounce_at_f <development_at_the-playground.de> reelists.org cc ORACLE-L <oracle-l_at_freelists.org> Subject 2010.01.28 15:00 Re: Application of archived logs on physical standby very slow Please respond to niall.litchfield_at_ gmail.com
What does the 10046 trace for the MRP process show up? You may well also endup doing os level tracing with strace or similar.
On Thu, Jan 28, 2010 at 9:15 AM, Martin Bach <development_at_the-playground.de
> wrote:
Hi Niall!
couldn't agree more. I seem to have the dilemma where my storage array seems to be in good working order and my OS doesn't report any unusually high IO service times. I'm wondering where else I could look for clues as to why, trying to follow a methodical approach to troubleshooting. The load is clearly caused by Oracle processes, shutting the standby down reduces the iowaits to 0 (but that doesn't really solve my problem :) Regards, Martin! On 01/27/2010 07:06 PM, Niall Litchfield wrote: I'd have thought this problem was eminently suited to method-r. On 1/27/10, Johnson, William L (TEIS)< WLJohnson_at_tycoelectronics.com> wrote: What do you have your parallel_execution_message_size set to on the standby database? -----Original Message----- From: oracle-l-bounce_at_freelists.org [mailto: oracle-l-bounce_at_freelists.org] On Behalf Of Martin Bach Sent: Wednesday, January 27, 2010 12:13 PM To: ORACLE-L Subject: Application of archived logs on physical standby very slow Dear listers, I'd like to add another "interesting" problem to today's interesting RMAN problem. I have quite a bit of output from diagnostic tools, hopefully this doesn't got too badly mangled... Our alerting system fired when my physical standby database went out of sync with production. DR and production share the same storage (an entry level MSA) and are multipathed. I could observe this on the hosts (both RHEL 5.2 32bit, don't ask): Wed Jan 27 10:42:42 2010 RFS[2]: Archived Log: '/u03/oradata/dr/arch/1_49264_596779339.dbf' Primary database is in MAXIMUM PERFORMANCE mode Wed Jan 27 10:47:55 2010 Media Recovery Log /u03/oradata/dr/arch/1_49252_596779339.dbf Wed Jan 27 10:55:27 2010 Media Recovery Log /u03/oradata/dr/arch/1_49253_596779339.dbf Wed Jan 27 11:01:46 2010 Online redo logs are 500M in size and so are most archived logs, and as you can see it takes 5-10 minutes to apply just one, which is consistent. top - 11:25:37 up 57 days, 17:34, 13 users, load average: 0.55, 1.05, 1.19 Tasks: 209 total, 1 running, 208 sleeping, 0 stopped, 0 zombie Cpu0 : 2.8%us, 1.9%sy, 0.0%ni, 51.9%id, 43.5%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 1.9%us, 0.9%sy, 0.0%ni, 42.6%id, 54.6%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.9%us, 0.0%sy, 0.0%ni, 98.1%id, 0.9%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 0.9%sy, 0.0%ni, 99.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8310064k total, 8019968k used, 290096k free, 36456k buffers Swap: 8152948k total, 70124k used, 8082824k free, 7568912k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21647 oracle 15 0 2178m 48m 31m S 3 0.6 0:00.41 ora_mrp0_dr 7496 oracle 15 0 2145m 1.5g 1.5g S 2 18.4 190:47.10 ora_p000_dr 7500 oracle 15 0 2145m 1.5g 1.5g S 2 18.4 199:38.12 ora_p002_dr 2580 root 10 -5 0 0 0 S 1 0.0 17:42.07 [kjournald] 7498 oracle 16 0 2145m 1.5g 1.5g D 1 18.4 189:40.17 ora_p001_dr The only processes on cpu are MRP0 and its parallel slaves. Funny enough, iostat doesn't seem to see a problem: avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.25 24.75 0.00 75.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util cciss/c0d0 0.00 4.00 0.00 8.00 0.00 48.00 12.00 0.01 0.88 0.50 0.40 cciss/c0d1 0.00 5.00 0.00 18.00 0.00 92.00 10.22 0.02 0.89 0.06 0.10 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 211.00 0.00 189.00 0.00 1600.00 16.93 0.49 2.57 2.57 48.60 sdd 0.00 203.00 0.00 183.00 0.00 1544.00 16.87 0.50 2.71 2.74 50.10 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 786.00 0.00 3144.00 8.00 2.09 2.64 1.26 98.70 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-5 0.00 0.00 0.00 786.00 0.00 3144.00 8.00 2.09 2.65 1.26 98.80 dm-6 0.00 0.00 0.00 23.00 0.00 92.00 8.00 0.02 0.83 0.04 0.10 dm-7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Service time is pretty decent I'd say and the SAN administrators tell me that everything on the array is fine. I checked for AIO and that's also used: [oracle]$ cat /proc/slabinfo | grep ^k kioctx 25 60 192 20 1 : tunables 120 60 8 : slabdata 3 3 0 kiocb 64 90 128 30 1 : tunables 120 60 8 : slabdata 3 3 0 Data files are scattered over mount points /u02-/u06, and there are no multipathing (or other I/O related errors) reported in dmesg and /var/log/messages. Archive logs are received on the DR box in time, and there are no gaps. Our sys admin tells me that device mapper multipath is happily distributing IO round robin. Oh yes, and it's Oracle 10.2.0.4 32bit. Now does anyone have an idea on how to proceed? Regards, Martin -- Martin Bach OCM 10g http://martincarstenbach.wordpress.com. http://www.linkedin.com/in/martincarstenbach -- http://www.freelists.org/webpage/oracle-l -- http://www.freelists.org/webpage/oracle-l -- Martin Bach OCM 10g http://martincarstenbach.wordpress.com. http://www.linkedin.com/in/martincarstenbach
-- Niall Litchfield Oracle DBA http://www.orawin.info -- http://www.freelists.org/webpage/oracle-l -- http://www.freelists.org/webpage/oracle-l -- http://www.freelists.org/webpage/oracle-l -- http://www.freelists.org/webpage/oracle-lReceived on Mon Feb 01 2010 - 09:07:42 CST