Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
![]() |
![]() |
Home -> Community -> Usenet -> c.d.o.server -> Re: Split-brain among HACMP cluster and Oracle9RAC
Arne S wrote:
> Background:
> Part of our production environment is based on RS/6000 technology, with
> HACMP and Oracle9RAC as products on top. We have 4 p570's (4-ways),
> running AIX 5.3ML03, HACMP version 5.2 and OracleRAC version 9.2.0.7.
> These machines are spread across 2 server rooms (about 300meters
> distance). HACMP is configured witch concurrent disk access for Oracle
> db-files on raw devices. Also we have configured HACMP with both IP and
> NON-IP heartbeat (NON-IP heartbeat over SAN-disks). Oracle's
> interconnect are configured as part of HACMP configuration. The total
> number of databases/instances are about 20/80.
>
> My problem:
> During a test failover (the network in one serverrom goes down) I
> observed that all Oracle databases went to "freezed" condition. As far
> as I know, this is not correct. I have problem to find out why, but my
> guess is that Oracle is waiting for some "network down" or "node down"
> from HACMP before Oracle do some action. This will not happend, because
> HACMP is talking to all 4 nodes over NON-IP network over the SAN disks
> in such situation. When I shut down these 2 "isolated" machines, all
> Oracle databases went down (lmon died). I had to start all databases
> manually on the 2 "surviving" nodes. After startup I could access the
> databases as normal.
>From the HACMPredbook
...
The non-IP networks are direct connections (point-to-point) between
nodes, and
do not use IP for heartbeat messages exchange, and are therefore less
prone to
IP network elements failures. If these network types are used, in case
of IP
network failure, nodes will still be able to exchange messages, so the
decision is
to consider the network down and no resource group activity will take
place.
...
So the non-ip network is designed to prevent split brain situation.
You say:
> the network in one serverrom goes down
The question: What do you mean which that sentence ?
Have you been taken offline all network devices connected to the hamcp
cluster - in this case you would have a network down event and the
cluster should go down OR did you interrupt the conncetion between both
site.
In the later case you have a site failure from each cluster point of
view.
Meaning that HACMP does see that it has still a connection to its
swiches ( so the physic is okay ) but any IP communication path to the
other site is lost.
So the question arise, how shell HACMP behave ? It does not know if the other site still has a connection to the (user)network or not. So its up to you to determine which site shall stay up.
Just from my very rusty hacmp knowledge
Hajo
Received on Thu Sep 21 2006 - 15:12:26 CDT
![]() |
![]() |