Application performance hit when performing archived log backups
Date: Mon, 16 May 2016 12:46:31 -0500
Message-Id: <4A7DF8BD-384F-4D98-9A73-86D3120BF72C_at_gmail.com>
I've got a strange one that, as of yet, I've not been able to resolve. 1/8 rack Exadata v4 (x4-2) , 11.2.0.4, OEL 6. High capacity disks, flash configured as cache, rather than independent disks.
The DB is backing an in-house java/tomcat OLTP-ish application. Ops staff reported slowdowns at specific times. App slow down verified via StatsD collected performance metrics. Watching those metrics we identified the issue to be common across the vast majority of DB calls, and roughly 30 separate application instances. (separate app/web servers per instance, all pointing to differing sets of application schemas within the same DB)
Poking through ASH data the only commonality I found was an increase in system IO, which ultimately ended up being RMAN. We were performing very simple archived log backups via rman with a parallelism of 4. Backup sets are being pushed uncompressed across a 1Gb link to DataDomain mounted via NFS. App call times went from 1s normally to over 30s at their worst. Backing that parallelism off to 1 resulted in a smaller spike with an increased duration. Application performance falls off very sharply as the backup begins, with a slower 2-3 minute exponential increase in performance as the backup completes. The same execution plan has been verified for frequent SQL statements during both good and bad times. There is nothing specific to individual SQL statements which appears to be problematic, all application calls during the time suffer. I've not yet found any spikes in metrics that explain such a profound performance impact. AWR reports during the time of slowdown seem similar to those when the backup is not being performed. We've not been gathering data long enough to determine if this is entirely new behavior, or if it's been getting progressively worse over time.
Next steps for me: integrating AWR snapshots before and after the backup completes to narrow the window AWR is focused on. I'm also working on automating 10046 trace for further analysis and comparison.
Short of that, has anyone seen similar behavior? I would have never anticipated archived log backups to have this impact, but the timing matches up perfectly with 100% repeatability. If not the root cause, it's at least a contributor that we've identified as a fact. I would appreciate any suggestions of where we should consider continuing our troubleshooting, or guidance on what metrics we should gather.
Thank you,
Ryan--
http://www.freelists.org/webpage/oracle-l
Received on Mon May 16 2016 - 19:46:31 CEST