Re: Recommendation for "cheap" HA solution

From: Tanel Poder <tanel.poder.003_at_mail.ee>
Date: Thu, 24 Jul 2003 17:56:02 +0300
Message-Id: <26007.339444@fatcity.com>

This is a multi-part message in MIME format.

------=_NextPart_000_0895_01C3520C.D8C17C70 Content-Type: text/plain;

charset="windows-1257"
Content-Transfer-Encoding: quoted-printable

Recommendation for "cheap" HA solutionHi!

When not having any cluster manager, there's 3 main issues what you have = to deal with:

hearbeat - verifying whether primary node (or other nodes) are alive
storage - making failed node's storage accessible to backup node
connectivity - allowing clients to transparently connect to another = node (usually IP address transfer)
If you got max 15 min switchover time, it could be done with a script = on secondary or monitoring node, which actually verifies whether your = Oracle database is running (the easiest is just connect and select * = from dual in a script every minute). If either connect or select fails, = you take appropriate steps described below
storage - both of your servers have to see the disks of course. If = you got SAN over fibre, it's no problem to share disks between several = servers. With some external SCSI arrays it shouldn't be a problem either = to share between 2 servers.=20 When your heartbeat mechanism detects that your primary Oracle service = isn't running, it first tries to kill Oracle and unmount file systems on = primary server - two nodes writing to the same data without coordination = will cause a mess. The kill & unmount could be done by secondary or = monitoring server using rsh, ssh or whatever remote exec mechanism. If = remote exec doesn't work, then we rely on ping, to see whether the = primary host is alive. When it isn't alive, we are free to mount file = system on secondary node and start the instance (of course the primary = node should not have automatic instance startup scripts in it's rc.d). = The problematic issue is, when ping show primary instance as alive, but = remote exec to shutdown&unmount fails. This is the place, where cluster = managers should be better than home-made high-availability solutions.
For connectivity to be directed to backup node you either try to = transfer the IP in a script (I don't know solaris commands by heart, you = could just have two sets of network config files as well). Other = solution would be to play around with tnsnames.ora entries, which always = have primary host IP first in address list and secondary host as second. = Also you got to set fail_over or smth like that parameter in tnsnames.

If you do have lot's of storage space or do not have sharable storage, = then go with standby databases (can be done with standard ed. too) and = forget about issues in point 2.

So, you got to do some planning and scripting for that, but cheap HA for = simple systems is very possible. I personally like these simple = solutions over expensive software packages, guards, agents, which often = bring an additional layer of complexity to sysadmins jobs and don't = always work as expected themselves. But of course, these home-made = one-weekend solutions aren't appropriate everywhere...

Tanel.

Original Message -----=20 From: Daiminger, Helmut=20 To: Multiple recipients of list ORACLE-L=20 Sent: Thursday, July 24, 2003 5:49 PM Subject: Recommendation for "cheap" HA solution

Hi!=20

We are looking into establishing some sort of high availability = solution here. We are running 9.2.0 on Sun Fire 280 (2 processors).

Since we are on a tight budget, we are looking into various solutions = for HA.=20

One option would be to use Sun Cluster Server or Veritas Cluster = Server. If one box fails, the db just fails over to the other node. The = problem is that we don't have a cluster guy here...

The other Option would be to use RAC, but this is the most expensive = solution, I guess...=20

Does anybody use any other HA solution that is affordable? Failover = time should be less than 15 minutes, although "frequent" outages (i.e. = once a month or so) are tolerable.

Don't blame me for these requirements; it was not my idea...=20

This is 9.2.0 on Sun Solaris.=20

Thanks,=20
Helmut=20

------=_NextPart_000_0895_01C3520C.D8C17C70 Content-Type: text/html;

charset="windows-1257"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Recommendation for "cheap" HA solution</TITLE>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dwindows-1257">
<META content=3D"MSHTML 6.00.2800.1141" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#e0e0e0>
<DIV><FONT face=3DArial size=3D2>Hi!</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<DIV><FONT face=3DArial size=3D2>When not having any cluster manager, =
there's 3 main=20
issues what you have to deal with:</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<DIV><FONT face=3DArial size=3D2>1) hearbeat - verifying whether primary =
node (or=20
other nodes) are alive</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>2) storage - making failed node's =
storage=20
accessible to backup node</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>
<DIV><FONT face=3DArial size=3D2>3) connectivity - allowing clients to =
transparently=20
connect to another node (usually IP address transfer)</FONT></DIV>
<DIV> </DIV></FONT></DIV>
<DIV><FONT face=3DArial size=3D2>1) If you got max 15 min switchover =
time, it could=20
be done with a script on secondary or monitoring node, which actually = verifies=20
whether your Oracle database is running (the easiest is=20 just connect and select * from dual in a script every minute). If = either=20
connect or select fails, you take appropriate steps described = below</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>2) storage - both of your servers have =
to see=20
the disks of course. If you got SAN over fibre, it's no problem to = share=20
disks between several servers. With some external SCSI arrays it = shouldn't be a=20
problem either to share between 2 servers. </FONT></DIV>
<DIV><FONT face=3DArial size=3D2>When your heartbeat mechanism detects =
that your=20
primary Oracle service isn't running, it first tries to kill Oracle and = unmount=20
file systems on primary server - two nodes writing to the same data = without=20
coordination will cause a mess. The kill & unmount could be done by=20 secondary or monitoring server using rsh, ssh or whatever remote exec = mechanism.=20
If remote exec doesn't work, then we rely on ping, to see whether the = primary=20
host is alive. When it isn't alive, we are free to mount file system on=20 secondary node and start the instance (of course the primary node should = not=20
have automatic instance startup scripts in it's rc.d). The problematic = issue is,=20
when ping show primary instance as alive, but remote exec to=20 shutdown&unmount fails. This is the place, where cluster managers = should be=20
better than home-made high-availability solutions.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>3) For connectivity to be directed to =
backup node=20
you either try to transfer the IP in a script (I don't know solaris = commands by=20
heart, you could just have two sets of network config files as well). = Other=20
solution would be to play around with tnsnames.ora entries, which always = have=20
primary host IP first in address list and secondary host as second. Also = you got=20
to set fail_over or smth like that parameter in tnsnames.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<DIV><FONT face=3DArial size=3D2>If you do have lot's of storage space =
or do not=20
have sharable storage, then go with standby databases (can be done with = standard=20
ed. too) and forget about issues in point 2.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<DIV><FONT face=3DArial size=3D2>So, you got to do some planning and =
scripting for=20
that, but cheap HA for simple systems is very possible. I = personally=20
like these simple solutions over expensive software packages, guards, = agents,=20
which often bring an additional layer of complexity to sysadmins jobs = and don't=20
always work as expected themselves. But of course, these home-made = one-weekend=20
solutions aren't appropriate everywhere...</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<DIV><FONT face=3DArial size=3D2>Tanel.</FONT></DIV>
<BLOCKQUOTE dir=3Dltr=20

style=3D"PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; = BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">   <DIV style=3D"FONT: 10pt arial">----- Original Message ----- </DIV>   <DIV=20
  style=3D"BACKGROUND: #e4e4e4; FONT: 10pt arial; font-color: = black"><B>From:</B>=20
  <A title=3DHelmut.Daiminger_at_KirchGruppe.de=20   href=3D"mailto:Helmut.Daiminger_at_KirchGruppe.de">Daiminger, Helmut</A> =
</DIV>

  <DIV style=3D"FONT: 10pt arial"><B>To:</B> <A = title=3DORACLE-L_at_fatcity.com=20
  href=3D"mailto:ORACLE-L_at_fatcity.com">Multiple recipients of list = ORACLE-L</A>=20
  </DIV>
  <DIV style=3D"FONT: 10pt arial"><B>Sent:</B> Thursday, July 24, 2003 = 5:49=20
  PM</DIV>
  <DIV style=3D"FONT: 10pt arial"><B>Subject:</B> Recommendation for = "cheap" HA=20
  solution</DIV>
  <DIV><FONT face=3DArial size=3D2></FONT><FONT face=3DArial =

size=3D2></FONT><BR></DIV>
  <P><FONT size=3D2>Hi!</FONT> </P>
  <P><FONT size=3D2>We are looking into establishing some sort of high=20

  availability solution here. We are running 9.2.0 on Sun Fire 280 (2=20   processors).</FONT></P>
  <P><FONT size=3D2>Since we are on a tight budget, we are looking into = various=20
  solutions for HA.</FONT> </P>
  <P><FONT size=3D2>One option would be to use Sun Cluster Server or = Veritas=20
  Cluster Server. If one box fails, the db just fails over to the other = node.=20
  The problem is that we don't have a cluster guy here...</FONT></P>   <P><FONT size=3D2>The other Option would be to use RAC, but this is = the most=20
  expensive solution, I guess...</FONT> </P>   <P><FONT size=3D2>Does anybody use any other HA solution that is = affordable?=20
  Failover time should be less than 15 minutes, although "frequent" = outages=20
  (i.e. once a month or so) are tolerable.</FONT></P>   <P><FONT size=3D2>Don't blame me for these requirements; it was not my =

idea...</FONT> </P> Received on Thu Jul 24 2003 - 09:56:02 CDT