RE: SAC NORAD .... how to break it?

From: Peter Barnett <pnbarne_at_shep.bcbso.com>
Date: Tue, 17 Oct 2000 09:26:37 -0700 (PDT)
Message-Id: <10652.119495@fatcity.com>

We had a computer crash last week because of excessive paging. When the sysadmins brought it back up both disks in a mirror were corrupt. They thought that they had successfully recovered them on Friday. On Sunday, the oncall DBA spent most of the day putting one of the databases on this computer back together again.

How/why excessive paging can cause a mirror to fail is beyond my understanding. I also, do not know how to test it in advance.

Pete Barnett
Oracle Database Administrator
Regence BlueCross BlueShield
pnbarne_at_regence.com

On Mon, 16 Oct 2000, Steve Orr wrote:

> Apart from the obvious it's going to vary so you'll need to create your own
> checklist of things to test for your particular environment. You'll also
> want to test that you have good monitoring in place. I once thought I was
> protected with mirrored redo logs only to find out one drive had failed a
> month before and the sysadmin wasn't monitoring the mirror. You should
> probably start by taking your sysadmin out to lunch.
>
> Steve
>
>
> -----Original Message-----
> Linda
> Sent: Monday, October 16, 2000 1:50 PM
> To: Multiple recipients of list ORACLE-L
>
>
> Steve/Anyone -
>
> We're going to be doing HA in the near future. I'm sure I can come up with
> obvious tests, like pull a disk, turn off a machine.... But do you have a
> 'break it' checklist? (sorry if this has been asked before) Especially how
> it affects Oracle performance. I want this thing to run smoothly.
>
> Unless of course we have a 'real' test of Norad in which case I'm close
> enough that I don't care!
>
> Linda
>
> -----Original Message-----
> Sent: Monday, October 16, 2000 11:36 AM
> To: Multiple recipients of list ORACLE-L
> systems
>
>
> Actually, NORAD was designed to survive a direct hit as was capable during
> the time it was build. However, with more accurate delivery systems now it
> is conceivable that a missle could navigate part way through the entrance
> tunnel so as to make the facility inoperable. Then there are multiple direct
> hits...
>
> But of course, none of this has been tested and sadly, this is often the
> case with HA 24X7 systems. You need sufficient pre-production quiet time to
> test your HA solution. I call it the "pseudo sledge hammer" testing period.
> Have you ever taken a drive out of your RAID and replaced it to see how long
> it takes for resilvering and what happens to I/O performance? How much time
> does it take to test the entire HA implementation and how much time will you
> be given? The trouble is that you get all this expensive equipment in the
> data center and install Oracle then damagement is anxious to get the entire
> application up and running ASAP and asks you to take short cuts or just
> trust that everything will work. But really you haven't finished the job
> until you've reasonably tested everything end to end.
>
> IMHO,
> Steve Orr
>
>
> -----Original Message-----
> Sent: Monday, October 16, 2000 7:06 AM
> To: Multiple recipients of list ORACLE-L
> systems
>
>
> That's why they say that SAC/NORAD ( Strategic
> Air Command HQ, North American Defense ) buried
> deep into a mountain in Colorado is a "single point
> of failure" for the US NationalDefense:
>
> All it takes is a direct hit by one nuclear
> bomb to bring down the whole facility! :-)
>
> In the words of the Marathon Man's tormentor:
>
> "Is it safe?"
>
> <evil laughter>
>
>
> -----Original Message-----
> Sent: Friday, October 13, 2000 7:45 PM
> To: Multiple recipients of list ORACLE-L
>
>
> Sorry Ross. Yes I am familiar with enterprise
> class storage systems.
>
> It still isn't HA.
>
> It only takes one bumbling SA ( or DBA ) to bring
> the system down, one neanderthalic techie in the
> computer room to push the 'OFF' switch.
>
> Simultaneous failure of both of the controllers for
> an array, or of enough disks to bring the array down
> are not unheard of.
>
> Jared
>
> On Fri, 13 Oct 2000, Mohan, Ross wrote:
>
> > I have to say this "disk is a single point of failure"
> > is jangling to the cognitive logic subsystem.
> >
> > Why?
> >
> > Well, the disk farms i have seen have redundant controllers,
> > with redundant channels, TRIPLE power supplies, at least a
> > single mirror with dual porting. There's your "single" disk
> > point of failure for you.
> >
> > Now, try this: Take your two "redundant" nodes....put them
> > in a really really big rack and then inside ONE big box. <G>
> >
> > Are the two nodes ( which now have at least redundant CPUs,
> > power supplies, etc. ) a "single point of failure"?
> >
> > Come on, guys, if you've worked with this stuff a bunch you know:
> >
> > (a) properly configured diskfarms have a great MTBF, better
> > than the other hardware, and
> > (b) to REALLY answer Mary's class of questions, you need to
> > calculate MTBFs and MTTRs.
> >
> > The rest is armchair clustering!
> >
> > hope this pertains,
> >
> > Ross Mohan
> >
> > p.s. HA is the latest marketspeak for "failover" or "redundant" or
> > whatever...
> > please try to browse a copy of "In Search of Clusters" by Gregory Pfister
> > from
> > IBM. It's a cult classic, a helluva fun read, and one of the best
> > thought-out
> > technical books i have ever seen, period.
> >
> >
> > -----Original Message-----
> > Sent: Thursday, October 12, 2000 2:00 PM
> > To: Multiple recipients of list ORACLE-L
> >
> >
> >
> > Mary,
> >
> > OPS is not an HA solution. While you may still have
> > an instance running if a node goes down, the storage
> > medium is still a single point of failure.
> >
> > Jared
> >
> > On Thu, 12 Oct 2000, Ruiz, Mary A (CAP, CDI) wrote:
> >
> > > I need a little advice. We have a fairly new (< 1 year) 8.1.5 instance
> > to
> > > support my company's internet business. We recently changed our network
> > > solutions provider and now my management wants to achieve a higher level
> > of
> > > redundancy than it currently does with mirrored disks. The solution
> being
> > > proposed by my Sysadmin is an Oracle Parallel Server solution. Some
> > > background is in order here - we have always shut our databases down at
> > > night for backups. I am not highly skilled in backup and recovery
> > although
> > > I tried some of the hot backup techniques from this list and was able to
> > > recover successfully to another server. I noticed that the course
> offered
> > > by Oracle in OPS has backup and recovery as well as performance tuning
> as
> > > pre-requisites, which indicates to me that OPS could be extremely
> > > challenging. Also, I have read mainly unfavorable comments about OPS
> from
> > > this list, but most of those comments were based on the Oracle 7
> > > implementations (High administrative costs, difficult to implement,
> etc.).
> >
> > >
> > > Have things improved with Oracle 8i ? Is OPS worth pursuing? Or should
> I
> > > convince my management that extra $$ spent in, say, a hot standby
> database
> > > is well worth it? Is there any other solution that would not involve a
> > > second set of disks, rather a second database on the same set of disks
> ??
> > >
> > > Thanks in advance,
> > > Mary Ruiz / Atlanta
> > >
> > > --
> > > Please see the official ORACLE-L FAQ: http://www.orafaq.com
> also send the HELP command for other information (like subscribing).
>
> --
> Please see the official ORACLE-L FAQ: http://www.orafaq.com
> --
> Author: Steve Orr
> INET: sorr_at_arzoo.com
>
> Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051
> San Diego, California -- Public Internet access / Mailing Lists
> --------------------------------------------------------------------
> To REMOVE yourself from this mailing list, send an E-Mail message
> to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
> the message BODY, include a line containing: UNSUB ORACLE-L
> (or the name of mailing list you want to be removed from). You may
> also send the HELP command for other information (like subscribing).
> --
> Please see the official ORACLE-L FAQ: http://www.orafaq.com
> --
> Author: Seley, Linda
> INET: LSeley_at_IQNavigator.com
>
> Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051
> San Diego, California -- Public Internet access / Mailing Lists
> --------------------------------------------------------------------
> To REMOVE yourself from this mailing list, send an E-Mail message
> to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
> the message BODY, include a line containing: UNSUB ORACLE-L
> (or the name of mailing list you want to be removed from). You may
> also send the HELP command for other information (like subscribing).
>
> --
> Please see the official ORACLE-L FAQ: http://www.orafaq.com
> --
> Author: Steve Orr
> INET: sorr_at_arzoo.com
>
> Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051
> San Diego, California -- Public Internet access / Mailing Lists
> --------------------------------------------------------------------
> To REMOVE yourself from this mailing list, send an E-Mail message
> to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
Received on Tue Oct 17 2000 - 11:26:37 CDT