RE: Application performance hit when performing archived log backups
Date: Tue, 17 May 2016 06:12:57 -0400
Message-ID: <025e01d1b024$b0a61280$11f23780$_at_rsiz.com>
- Are your applications chatty? If you trace an application, do you see an increase in the sum of the durations of sqlnet waits while the backup is running?
- Do the data persistence stacks (memory to i/o controller or network to disk) collide amongst archived log location, archive log destination, database files, and temp locations? If you trace an application, do you see an increase in the sum of durations of database reads?
I would rule those two out first, but please notice that both start with tracing an application that runs differently (faster when not doing backup, slower when doing backup).
So the trace will show what is different. When you have the luxury of ops staff having reported slowdowns of specific things at specific times please enjoy benefit that you can measure instead of guessing quite easily.
mwf
-----Original Message-----
From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org]
On Behalf Of Ryan January
Sent: Monday, May 16, 2016 1:47 PM
To: Listserv Oracle
Subject: Application performance hit when performing archived log backups
I've got a strange one that, as of yet, I've not been able to resolve. 1/8 rack Exadata v4 (x4-2) , 11.2.0.4, OEL 6. High capacity disks, flash configured as cache, rather than independent disks.
The DB is backing an in-house java/tomcat OLTP-ish application. Ops staff reported slowdowns at specific times. App slow down verified via StatsD collected performance metrics. Watching those metrics we identified the issue to be common across the vast majority of DB calls, and roughly 30 separate application instances. (separate app/web servers per instance, all pointing to differing sets of application schemas within the same DB)
Poking through ASH data the only commonality I found was an increase in
system IO, which ultimately ended up being RMAN. We were performing very
simple archived log backups via rman with a parallelism of 4. Backup sets
are being pushed uncompressed across a 1Gb link to DataDomain mounted via
NFS.
App call times went from 1s normally to over 30s at their worst. Backing
that parallelism off to 1 resulted in a smaller spike with an increased
duration. Application performance falls off very sharply as the backup
begins, with a slower 2-3 minute exponential increase in performance as the
backup completes. The same execution plan has been verified for frequent
SQL statements during both good and bad times.
There is nothing specific to individual SQL statements which appears to be
problematic, all application calls during the time suffer. I've not yet
found any spikes in metrics that explain such a profound performance impact.
AWR reports during the time of slowdown seem similar to those when the
backup is not being performed. We've not been gathering data long enough to
determine if this is entirely new behavior, or if it's been getting
progressively worse over time.
Next steps for me: integrating AWR snapshots before and after the backup completes to narrow the window AWR is focused on. I'm also working on automating 10046 trace for further analysis and comparison.
Short of that, has anyone seen similar behavior? I would have never anticipated archived log backups to have this impact, but the timing matches up perfectly with 100% repeatability. If not the root cause, it's at least a contributor that we've identified as a fact. I would appreciate any suggestions of where we should consider continuing our troubleshooting, or guidance on what metrics we should gather.
Thank you,
Ryan--
http://www.freelists.org/webpage/oracle-l
--
http://www.freelists.org/webpage/oracle-l
Received on Tue May 17 2016 - 12:12:57 CEST