Fwd: Re: Why I don't like RMAN repositories

From: Tim Gorman <tim_at_evdbt.com>
Date: Wed, 11 Dec 2013 11:06:39 -0700
Message-ID: <52A8A9AF.2030702_at_evdbt.com>



...oh yes, and I forgot to mention a big "+1" to Andy Klock for mentioning the analytic aspect of an RMAN recovery catalog.

One good practical example of that approach is the "estimate_mttr.sql" script that I have posted on my website at "http://www.evdbt.com/scripts", which uses a single mongo-licious SQL statement to use the time to backup the components needed for recovery in order to estimate the time-to-restore for a database. Of course, the biggest weakness of the script is the assumption that time-to-backup is the same as time-to-restore, but it was a fun to pull together and an eye-opening view into the ways that the data in the recovery catalog database could be turned into useful analytic information.

Just like the AWR repository, the recovery catalog repository contains lots of great information not revealed by the standard functionality of the product(s) that use it.

You can't improve that which you don't measure.

Thanks Andy!

  • Original Message -------- Subject: Re: Why I don't like RMAN repositories Date: Wed, 11 Dec 2013 10:59:56 -0700 From: Tim Gorman <tim_at_evdbt.com> Reply-To: tim_at_evdbt.com Organization: Evergreen Database Technologies, Inc. To: oracle-l_at_freelists.org

Please pardon the lengthy response...

Part of my perspective comes from being an independent contractor and only working short-term engagements with various customers. In that capacity, I get panic calls from folks who are in a nasty recovery situation, and more on that soon later in this long-winded response...

The other part of my perspective is being the U.S. rep for the DUDE utility from Kurt van Meerbeeck in Belgium. Again, in this capacity, I get panicky calls and demands for immediate resolution by folks who got nailed one way or another, and are now up against the last resort for saving their data. Kurt does the real work, and I just handle the business end, but one of the final parts of any DUDE engagement is an attempt to equip the victim with the realization that they need help in better securing their data. Often it falls on deaf ears as they move on to the next crisis, sometimes they pay attention. C'est la vie.

OK. So about 10 years ago I get a call from a friend who is a DBA at a company at which I've previously worked. On the phone, his voice is faint and hoarse as he has been working continuously round-the-clock for over a week on the recovery of a production Oracle9i (or 8i? I forget) database supporting E-Biz 11.0.3. Yes, the database crashed eight days ago, he has been working continuously that whole time, and right now he is defeated and exhausted (not necessarily in that order). The CIO (a notorious screamer) has been snarling and screaming in the background the whole time; he is hoarse too. Everyone has that thousand-yard stare. They called because they feel they are on the verge of losing everything, and they're right.

So working slowly, knowing that anything I did could snap the slender thread holding everything together, I started assessing the situation. In the end, it turned out that the problem was a datafile that had been off-lined and on-lined a month prior; I don't recall the exact details, since this happened about 10 years ago. The parameter CONTROL_FILE_RECORD_KEEP_TIME was still at the default of "7" (days), so the poor DBA kept banging into the same wall where he needed to recover archivelog and datafiles going back over a month, but the control file only had 7 days (or less) of that information. We had no recovery catalog database. The sysadmins for tape management said, "Yeah, we've got everything going back 3 months, but you gotta tell me the file-name so I can find it and retrieve it". No, of course we had no way of knowing what the RMAN-generated names were, and we didn't feel like wasting time figuring it out.

So, in the end, what I did was first make an image copy of everything in case I screwed up, which is always the best first step of any recovery process. Once that was done, the current control was pretty much useless because it had been 8 days since the crash and the retention was defaulting to 7 days according to CONTROL_FILE_RECORD_KEEP_TIME. But I could still find control file backups in there, so I restored the oldest control file available, which was about 6-7 days old. Then, I switched to using that just-restored control file, and used the more-complete recovery catalog within it to restore as much as possible in datafiles and archivelog files. Then, I restored the oldest available control file from there (i.e. 12-13 days old), and repeated the same process, over and over, going back 30+ days beyond the point-in-time of the messed-up datafile, restoring archivelogs and the oldest available control file each time. Eventually, we had restored enough and were able to start a recovery, and brought the database back online. Yes, the added complexity was limited disk space for restored files, so it was complicated. Moreover, the entire process was slowed down because I was paranoid about over-writing *anything* so I made sure I had copies of everything before I took any action whatsoever. Further, the screaming CIO had to mollified repeatedly with frequent status reports. It was one of those days much longer than 24 hours, but it got done.

Now, two things are proven in this story. The first is that you don't need to set CONTROL_FILE_RECORD_KEEP_TIME to a higher value, and that you don't need a recovery catalog database, even if you're trying to effect a recovery using stuff from over a month ago.

But, it also proves that it would have been a easier and faster if we had recovery catalog information going back 30+ days *somewhere*.

So, as I said in an earlier response, the issue under discussion really is not technical at all. Rather, it is a matter of personal and corporate choice how one uses (or doesn't use) those technical choices.

Augment the belt with suspenders.

Or don't. It's a choice.

On 12/11/2013 2:20 AM, Nuno Souto wrote:
> On 11/12/2013 7:25 AM, Rich Jesse wrote:
>
>
> > I was just thinking the same thing. My RMAN backups are to the FRA
> > on disk, which are then picked up by "tape" (Tivoli) nightly. For
> > the new production box, Tivoli will be picking up the FRA additions
> > (arch logs) a few times during the day as well.
>
>
> Same here. Arch logs into fra every 2nd hour for large change volume
> dbs. otherwise twice a day.
> And no catalog anywhere.
>
> >
> > RMAN is configured with a recovery window of 1 day. Our Tivoli tapes
> > get shipped offsite daily, leaving the disk cache for short-term
> > recovery.
>
> Again, same. Except for a db where I use the SAN-based asynch
> replication of the FRA to get it quickly into our DR site, where I can
> mount a snapshot of it in development and do refreshes of a 4TB DW
> acceptance environment in less than 4 hours.
>
>
> > "The Plan" has been to manually restore from Tivoli back to the FRA
> on disk,
> > catalog the restored files, then restore/recover from RMAN. Perhaps a
> > little more manual work, but it works well for us.
>
> For us restoring the backups of controlfiles is all that's needed in
> most cases.
> Only every once in a while do I need additional re-cataloging.
>
> --
> Cheers
> Nuno Souto
> dbvision_at_iinet.net.au
>

--
http://www.freelists.org/webpage/oracle-l
Received on Wed Dec 11 2013 - 19:06:39 CET

Original text of this message