Re: RAC unexpected reboot of nodes

From: DA Morgan <damorgan_at_psoug.org>
Date: Sun, 12 Mar 2006 13:26:08 -0800
Message-ID: <1142198750.781361@yasure.drizzle.com>

alek wrote:
> HI,
>
> I'm a quite new in the RAC field and I want to know if the following
> behavior is normal for such a configuration:
>
> A few weeks ago we succeeded to configure an Oracle 10.2.0.1 cluster.
> The configuration was comprised of 2 nodes and the underlying OS was
> Redhat AS4. The installation went well following all the installation
> steps mentioned into the official oracle documentation. The OCR and the
> voting disks were configured using NFS. At that time we noticed that
> from time to time one of the nodes (not always the same) was
> unexpectedly rebooted. The system or oracle logs didn't offered any
> clues therefore our conclusion was that the NFS might cause problems.
> In order to prove this we decided to configure a RAC on a single node
> just for testing purposes. The OCR, voting disks and the oracle
> software were installed on OCFS2 partitions therefore no NFS was
> involved. On this node we configured 2 oracle instances which worked
> fine for a while but, from time to time or when the server is stressed
> with intensive SQLs the entire server is rebooted. After some searching
> on metalink we found out the Bug.4741921/4556989 (36) INSTANCE
> RESTARTED AFTER SHUTDOWN ABORT IN RAC ENVIRONMENT which is fixed in
> 10.2.0.2 patch. We downloaded and installed the patch but it seems that
> the strange behavior is still there. We notice, indeed, that the
> frequency of the server reboot is lower now but we have no explanation
> for what really causes the reboot.
> Have anyone notice the same behavior on the 10.2.0.x RAC configuration?
> Are there any workarounds for this?
>
> Many thanks.

I have never seen the behaviour reported by hpuxrac and others with respect to node ejection and rebooting reported but then I do all of my work on NetApps with the connection string supplied by NetApp.

I would suggest you monitor the network for outages as the behaviour you describe is expected if, for some reason, the Oracle clusterware believes it can no longer see a resource. And that resource might be public, memory interconnect, or the storage device.

-- 
Daniel A. Morgan
http://www.psoug.org
damorgan_at_x.washington.edu
(replace x with u to respond)

Received on Sun Mar 12 2006 - 15:26:08 CST