RAC Cluster - 100% cpu on all nodes

From: Steve Perry <sperry_at_sprynet.com>
Date: Mon, 12 Jun 2006 19:11:48 -0500
Message-Id: <3FD17F1C-D5D6-468F-A187-FF83815A4C7B@sprynet.com>

I got called today about one of our RAC clusters (RHEL 4, 2 cpus, 8GB RAM, 10.2.0.1, 32-bit, ASM 2, EMC clariion cx700 storage, dual qlogic hbas). that was locked up.
the cpu on both nodes were 100%. It took several minutes to login and I could never get into sqlplus (10-15 minutes waiting). I also tried to shut it down with srvctl also but it didn't respond either.
IO was near zero - make sense. the cpu starved all other resources. no errors in the alert.logs (both nodes) for both asm and the instances - just a gap in the entries from 10am - 2pm (reboot). No new trace files.
nothing significant in the ka-zillion logs in the clusterware home. no errors in /var/log/messages

while doing a ps -ef, I saw 20+ processes of: /opt/app/oracle/product/crs10.2.0/bin/racgmain check

some were owned by root and some by oracle and everyone took about 5% cpu.
they didn't want to wait for diagnosis so they said to to reboot them both.
it came up fine, but after the reboot there was only one of the processes mentioned above.
I run the cluvfy and it passed all the tests. I ran the awr reports after from 10am to 2pm but haven't analyzed them yet.

Has anyone else experienced this with RAC? Is there a quick hit list of things you check when things go south? I'm pretty methodical and started checking the standard things, but that wasn't fast enough for these folks. What do you check when all nodes of a RAC cluster are locked up like that?

I contacted support, but I don't have much hope based on my recent experience.

Thanks,
Steve

p.s. I forgot to grab the sar data. to see what it shows. I'll do that tomorrow.

--
http://www.freelists.org/webpage/oracle-l

Received on Mon Jun 12 2006 - 19:11:48 CDT