Re: 10.1.0.3 RAC on Solaris 8-9

From: Alexey Sergeyev <saefido7_at_devexperts.com>
Date: Mon, 4 Oct 2004 13:40:30 +0400
Message-ID: <cjr5o6$170d$1@news.rtcomm.ru>

Hi Daniel!

I don't think that the NIC being used does matter. Because if there isn't the CRS on our hosts, Sun Cluster does resynchronization in a few seconds... The problem doesn't exists even we use 10.1.0.2 CRS - reconfiguration takes 10-20 seconds. But when we apply patch 10.1.0.3, some wierd things happen - without any parameters has been changed. See our logs:

We are resetting the first node...

After about 30 seconds the second node recognizes that: 2004-09-28 13:21:41.648 [8] >WARNING: clssnmPollingThread: node(0) missed(29) checkin(s)
2004-09-28 13:21:42.658 [8] >WARNING: clssnmPollingThread: node(0) missed(30) checkin(s)
2004-09-28 13:21:43.659 [8] >WARNING: clssnmPollingThread: node(0) missed(31) checkin(s)
2004-09-28 13:21:44.668 [8] >WARNING: clssnmPollingThread: node(0) missed(32) checkin(s)
2004-09-28 13:21:45.678 [8] >WARNING: clssnmPollingThread: Eviction started for node 0, flags 0x0001, state 3, wt4c 0

Good things - CRS is going to evicte the problem node. Now it has to synchronize the cluster:

2004-09-28 13:21:50.729 [8] >TRACE: clssnmDoSyncUpdate: Initiating sync 3 2004-09-28 13:21:50.729 [4] >TRACE: clssnmHandleSync: Acknowledging sync: src[1] seq[10] sync[3]
2004-09-28 13:21:51.208 [1] >USER: NMEVENT_SUSPEND [00][00][00][02] Here Oracle totally hangs. What is going on in these 5 minutes?????

2004-09-28 13:26:55.749 [8] >WARNING: clssnmWaitOnEvictions: Unconfirmed dead node count 1
2004-09-28 13:26:55.750 [4] >USER: clssnmHandleUpdate: SYNC(3) from node(1) completed
2004-09-28 13:26:55.750 [4] >USER: clssnmHandleUpdate: NODE(1) IS ACTIVE MEMBER OF CLUSTER
2004-09-28 13:26:56.330 [14] >USER: NMEVENT_RECONFIG [00][00][00][02] Oracle continues to work.

2004-09-28 13:26:56.331 [7] >TRACE: clssgmPeerListener: connects done (1/1)
CLSS-3000: reconfiguration successful, incarnation 3 with 1 nodes CLSS-3001: local node number 1, master node number 1

It looks like new 300 sec. timeout was introduced in 10.1.0.3, but where can i change it?

-- 
Alexey Sergeyev

"Daniel Morgan" <damorgan_at_x.washington.edu> wrote in message
news:1096519035.377922_at_yasure...


> Alexey Sergeyev wrote:

>

> > Hi

> >

> > Has anyone dealt with 10.1.0.3 RAC on Solaris 8 or 9? How long does a

> > cluster re-synchronize after a failure of one of nodes? We got an

absolutely


> > unexpected result - about 6 minutes...

> >

>

> Outrageous if properly configured. But lets look at the obvious question

> first. Who selected the NIC cards and are they certified for 10g RAC?

> The reason I ask is the "good" NIC cards have a keep-alive and try to

> reconnect. This is the worst possible thing to do with RAC. With RAC you

> want the cheapest dumbest cards you can find because you want a failure

> to kill the connection instantly. O/S may also be configured with a

> keep-alive so check that too.

>

> I routinely get sub-second fail-overs with RedHat Linux.

> -- 

> Daniel A. Morgan

> University of Washington

> damorgan_at_x.washington.edu

> (replace 'x' with 'u' to respond)

>

Received on Mon Oct 04 2004 - 04:40:30 CDT