RE: Oracle RAC nodes eviction question
Date: Wed, 13 Aug 2014 21:36:24 +0000
Message-ID: <AF02C941134B1A4AB5F61A726D08DCED0E09CDD4_at_USA7109MB012.na.xerox.net>
Thanks Riyaj and Martin.
So, based on your responses, it seems that if either the Grid binaries or the Grid log files become inaccessible, that node will be evicted. This does not coincide with the testing that I have done in the past but it does coincide with the recent event where after the NAS head failed, all RAC nodes were rebooted. This is how we had tested it in the past and saw no impact:
Prior to going live last year, we conducted destructive tests on the same hardware on which production was going to go live. We used the same storage NAS head that production was going to use. We then took a copy of production, which was a single-instance at that time but was going to be RAC’d on the new hardware and RAC’d it across four nodes. At that point we had a copy of production, RAC’d across four nodes and looking exactly like what production was going to look like. We then conducted a lot of destructive tests including the following:
- To test the resilience of Oracle RAC in the event of NAS head failure, we ran the following tests while the Grid and the database were up and running: (a) We failed over the NAS head to its standby counterpart via controlled failover and RAC stayed up (b) We induced panic to force the NAS head failover and the environment stayed up.
- On one of the RAC nodes, we pulled both cables of the LACP/bonded NIC which was used to mount storage for binaries, voting disks, etc., and left it like that for over 15 minutes. I was expecting an ejection primarily because the voting disks were not available on this node but nothing happened.
This is why I am a bit confused and trying to figure out why I am seeing different results.
Thanks
From: Riyaj Shamsudeen [mailto:riyaj.shamsudeen_at_gmail.com]
Sent: Wednesday, August 13, 2014 4:32 PM
To: Hameed, Amir
Cc: oracle-l_at_freelists.org
Subject: Re: Oracle RAC nodes eviction question
Hello Amir
Losing binaries can, and most probably will, lead to node eviction. When there is a fault for an executable page in the page cache, that page need to be paged-in from the binary. If the binary is not available, then the GI processes will be killed. Death of GI processes will lead to events such as missing heartbeats etc and finally to node eviction. From 11gR2 onwards, GI is restart is tried before restarting the node. Possibly that file system may not have been available during GI restart try, so, it would have lead to eventual node restart.
This is analogous to the scenario of removing oracle binary while the database is up (in that case also, database will crash eventually).
I guess, an option to avoid node eviction due to loss of binaries mounted through NFS, is to keep the GI and RDBMS homes local, still, it has its own risk. Of course, in a big cluster environments, it is easier said than done.
Cheers
Riyaj Shamsudeen
Principal DBA,
Ora!nternals - http://www.orainternals.com<http://www.orainternals.com/> - Specialists in Performance, RAC and EBS
Blog: http://orainternals.wordpress.com/
Oracle ACE Director and OakTable member<http://www.oaktable.com/>
Co-author of the books: Expert Oracle Practices<http://tinyurl.com/book-expert-oracle-practices/>, Pro Oracle SQL, <http://tinyurl.com/ahpvms8> Expert RAC Practices 12c.<http://tinyurl.com/expert-rac-12c> Expert PL/SQL practices<http://tinyurl.com/book-expert-plsql-practices>
On Wed, Aug 13, 2014 at 12:57 PM, Hameed, Amir <Amir.Hameed_at_xerox.com<mailto:Amir.Hameed_at_xerox.com>> wrote:
Folks,
I am trying to understand the behavior of an Oracle RAC Cluster if the Grid and RAC binaries homes become unavailable while the Cluster and Oracle RAC are running. The Grid version is 11.2.0.3 and the platform is Solaris 10. The Oracle Grid and the Oracle RAC environments are on NAS with the database configured with dNFS. The storage for Grid and RAC binaries are coming from one NAS head whereas the OCR and Voting Disks (three of each) are spread over three NAS heads so that in the event that one NAS head becomes unavailable, the cluster can still access two voting disks. The recommendation for this configuration came from the storage vendor and Oracle. What we observed was that last weekend when the NAS head where the Grid and RAC binaries were mounted from went down for a few minutes, all RAC nodes were rebooted even though two voting disks were still accessible. In my destructive testing about a year ago, one of the tests run was to pull all cables of NICs that were used for kernel NFS on one of the RAC nodes but the cluster did not evict that node. Any feedback will be appreciated.
Thanks,
Amir
-- http://www.freelists.org/webpage/oracle-lReceived on Wed Aug 13 2014 - 23:36:24 CEST