Re: A few questions regarding Dataguard Faststart Failover

From: Craig Hagan <hagan_at_cih.com>
Date: Thu, 30 Sep 2010 10:39:04 -0400
Message-ID: <AANLkTi=RxWPRG1bp2aGJ0yZCC3BMKidD__eu+SFd4z9M_at_mail.gmail.com>



2010/9/30 Zhu,Chao <zhuchao_at_gmail.com>

>
> So we have a few questions regarding this:
> 1. We already have dataguard configured for most of our database (
> 10.2.0.3/4); Now we want to use dataguard FSFO; Is this part of the
> dataguard license and do we need to pay extra for that?
>
>

I'm not sure how the licensing works, this would be a question for your oracle sales rep.

> 2. Is the production mature already(it come out in 10.2 i believe); We plan
> to use it on 11g database only (11.2 and 11.1.0.7); Clustering is something
> typical DBA not familiar with(compared with VSC type of HA for Unix guys)
>
>

I've been using fast start failover in production at a name site with large volumes of traffic since 10.2.0.2. As long as you configure it correctly and have the latest DG megapatch, you should be fine.

> 3 . How does it work in real-life production? Any company widely using it?
> I saw notes from a Amazon DBA on
> http://www.nocoug.org/download/2009-05/DBA%27s_Guide_to_Physical_Dataguard_II.pptxtalking about FSFO; Not sure about their real-life experience running that
> kind of solution;
>
>

I know Ahbid, and run systems similar to his.

First off some background as to how I've seen it run:

  1. primary/standby are physically distant (different datacenters, but fairly close geographically, speed of light/network latency/bandwidth isn't a concern).
  2. primary/standby do not share storage with eachother
  3. observer systems are deliberately run in a 3rd site/datacenter, and is explicitly not located in the same datacenter as either the primary or standby

Given that, the single largest issue that I've seen with fast start (10.2, 11.1) is misconfiguration. Even subtle errors which will allow the primary/standby to be configured and fsf enabled can result in reinstatement to fail after an event. I ended up building a tool to emit configurations that we were happy with in production to eliminate this form of error.

A few odds and ends from several years of use, nb: don't be scared by some of these as a lot of things have been patched/fixed by oracle.

  • If your system generates a lot of redo, you're going to want to pay attention to things like # of log archive processes and the parameter max_connections (default of 1 is a bit low).
  • I've seen after a failover/reinstatement that I've occasionally had to re-register log sequence 1 of the new thread on the "new" standby and/or bystanders, make sure you do this at the right time (when the standby is asking for the nonexistant/next sequence from the old resetlogsid).
  • In 10.2.02 (there is a patch, i believe it is also be in the DG megapatch), I've seen quirks with flashback where it would claim to be on, but not actually be generating much/any flashback logs. Its pretty obvious if you run into this: if your recovery area should be 10G, and you see two files for a few kilobytes and the db has been up for a few months, it probably is a concern.
  • for an unplanned flip, fsf will only fail over if the primary/standby can't talk to each other and the standby is synchronized and can talk with the observer. this means that if your primary hits an event (memory pressure, certain types of hardware/os faults) that freeze/mess up the db, but leave it just sufficiently alive that the standby thinks it is up, it won't fail. The same can also result in desynchronization
  • I've seen issues where very odd/freak network events or hardware faults on the standby result in lgwr terminating the primary. This was mostly in 10.2.0.2
  • for 11.x, be careful of user sessions on the standby if you're also running active dataguard as they may delay the transition from standby to primary as oracle terminates those sessions.
  • DO NOT use mts sessions for dataguard, and be careful with live implementations of mts on a system using DG, you can really piss off the broker/fast start/and DG. otoh, it is pretty easy to fix this on the fly, too. much easier to explicitly specify dedicated sessions for the tnsnames entries used for your broker sessions to prevent this sort of silliness.
  • if you run into odd things, you may want to seriously consider rebuilding your broker configuration, do make sure that all standby systems have been reinstated before doing this.
  • Don't play games with standby dbs -- by that, I mean rebuilding a broker config and tossing in a new controlfile to work around a failed re-instatement. Either rebuild the standby from backup, or work with support to make sure that your actions truly are safe and won't result in a ORA-03020 or worse later on.
  • If you have a complicated network, make sure that the FastStartFailoverThreshold is a bit longer than the time it takes spanning tree to recompute (work with your network engineers on this). You probably don't want a switch reconfiguration which will resolve itself in 5-45seconds to trip a failover which will take that time plus additional time for the other side to finish the failover.
  • failed/aborted failovers can be annoying to clean up :)
  • user initiated failovers in 11.x are cool; just remember to restart and reinstate the old primary.
    • craig .- ... . -.-. .-. . - -- . ... ... .- --. .
                            Craig I. Hagan
                           hagan(at)cih.com

    "Tout ce qui est exagéré est insignifiant.": ("All that is exaggerated is insignificant.")

                            Talleyrand

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Sep 30 2010 - 09:39:04 CDT

Original text of this message