RE: Question re racgmain processes running amok

From: William Wagman <wjwagman_at_ucdavis.edu>
Date: Fri, 21 Mar 2008 16:17:27 -0700
Message-ID: <FE043305B38A0F448F3924429D650C2A06EBE31C@VEXBE2.ex.ad3.ucdavis.edu>


Jeremy,  

Thanks for the feedback. As I mentioned a reboot of the system resolved the issue. I am keeping a watch on the system to see if it happens again but in the meantime I am also looking trough some logs to see if I can figure anything else out. At the time the problem was occurring crs_stat appeared to be hanging. I couldn't tell if it was in fact hanging or if it just was unable to get enough memory to execute as the memory was all being used.  

Bill Wagman
Univ. of California at Davis
IET Campus Data Center
wjwagman_at_ucdavis.edu
(530) 754-6208

From: Jeremy Schneider [mailto:jeremy.schneider_at_ardentperf.com] Sent: Friday, March 21, 2008 3:09 PM
To: William Wagman
Cc: oracle-l_at_freelists.org
Subject: Re: Question re racgmain processes running amok  

That's the workhorse script called by CRS to start/stop/stat resources. Find out what the parameter is (start, stop or stat) with something like this:

cat /proc/[pid####]/cmdline|tr '\000' '\n'

That'll tell us whether CRS is continually restarting ONS or just trying to "stat" it. (crs_stat can also tell you if there were failed restarts.) Then you might try to figure out what racgmain is waiting for. To start I'd look at the process status (is it 'D'? what's WCHAN from ps -l?) and the network connections (does netstat show any connections in TCP_WAIT state?). You might also get a stack trace with gdb -p and then "backtrace".

Just a few ideas... I'm really interested to hear what you turn up. :)

-Jeremy

On Fri, Mar 21, 2008 at 3:21 PM, William Wagman <wjwagman_at_ucdavis.edu> wrote:

Greetings,

The question pertains to a two node RAC cluster running Oracle 10.2.0.3.0 SE on 32-bit Linux 2.6.9-67.ELsmp. CRS, ASM & RDBMS are each in a separate home. Yesterday on node 1 I started seeing messages in the /var/log/messages file of the form...

Mar 20 07:5:34 spenser init: Id "h3" respawning too fast: disabled for 5 minutes

We did some looking around to try and determine the cause of this but didn't come up with anything immediately. There were a core dump generated in the $CRS_HOME/log/<node_name>/crsd directory at about the time we noticed this beginning. Various error messages indicating various failures (I can provide a segment) appeared at this time in the crsd.log also. At this point I didn't know what was occurring so opened an SR with Oracle.

This morning, which gathering some additional information I found that on node2 in this cluster there were a large number of racgmain processes running and the number of these processes running was increasing, all the swap space and virtually all of the memory on this node were in use. Some of the processes were running out of the CRS home and some out of the ASM home. I did some investigating to see if it would be possible to stop these processes gracefully and was unable to gather any information. Ultimately we rebooted node2 of the cluster and everything appears to be functioning as is expected at this point.

My question is what would cause the racgmain process to run amok this way. Currently ps -ef|grep racgmain shows none running on either node. I'm puzzled by this and other than information indicating that this process is part of ONS I am not able to find any further information or details. Any suggestions would be greatly appreciated.

Thanks.

Bill Wagman
Univ. of California at Davis
IET Campus Data Center
wjwagman_at_ucdavis.edu
(530) 754-6208

--
http://www.freelists.org/webpage/oracle-l






-- 
Jeremy Schneider
Chicago, IL
http://www.ardentperf.com/category/technical


--
http://www.freelists.org/webpage/oracle-l
Received on Fri Mar 21 2008 - 18:17:27 CDT

Original text of this message