Re: Split-brain among HACMP cluster and Oracle9RAC

From: Hajo Ehlers <service_at_metamodul.com>
Date: 21 Sep 2006 13:12:26 -0700
Message-ID: <1158869546.004646.320670@k70g2000cwa.googlegroups.com>

Arne S wrote:
> Background:
> Part of our production environment is based on RS/6000 technology, with
> HACMP and Oracle9RAC as products on top. We have 4 p570's (4-ways),
> running AIX 5.3ML03, HACMP version 5.2 and OracleRAC version 9.2.0.7.
> These machines are spread across 2 server rooms (about 300meters
> distance). HACMP is configured witch concurrent disk access for Oracle
> db-files on raw devices. Also we have configured HACMP with both IP and
> NON-IP heartbeat (NON-IP heartbeat over SAN-disks). Oracle's
> interconnect are configured as part of HACMP configuration. The total
> number of databases/instances are about 20/80.
>
> My problem:
> During a test failover (the network in one serverrom goes down) I
> observed that all Oracle databases went to "freezed" condition. As far
> as I know, this is not correct. I have problem to find out why, but my
> guess is that Oracle is waiting for some "network down" or "node down"
> from HACMP before Oracle do some action. This will not happend, because
> HACMP is talking to all 4 nodes over NON-IP network over the SAN disks
> in such situation. When I shut down these 2 "isolated" machines, all
> Oracle databases went down (lmon died). I had to start all databases
> manually on the 2 "surviving" nodes. After startup I could access the
> databases as normal.

>From the HACMPredbook

...
The non-IP networks are direct connections (point-to-point) between nodes, and
do not use IP for heartbeat messages exchange, and are therefore less prone to
IP network elements failures. If these network types are used, in case of IP
network failure, nodes will still be able to exchange messages, so the decision is
to consider the network down and no resource group activity will take place.
...

So the non-ip network is designed to prevent split brain situation.

You say:
> the network in one serverrom goes down
The question: What do you mean which that sentence ? Have you been taken offline all network devices connected to the hamcp cluster - in this case you would have a network down event and the cluster should go down OR did you interrupt the conncetion between both site.

In the later case you have a site failure from each cluster point of view.
Meaning that HACMP does see that it has still a connection to its swiches ( so the physic is okay ) but any IP communication path to the other site is lost.

So the question arise, how shell HACMP behave ? It does not know if the other site still has a connection to the (user)network or not. So its up to you to determine which site shall stay up.

Just from my very rusty hacmp knowledge
Hajo Received on Thu Sep 21 2006 - 15:12:26 CDT