RE: CRS-1615:voting device hang at 50% fatal, termination in 99620 ms
From: D'Hooge Freek <Freek.DHooge_at_uptime.be>
Date: Fri, 26 Aug 2011 17:07:48 +0200
Message-ID: <4814386347E41145AAE79139EAA39898150E4F4926_at_ws03-exch07.iconos.be>
Marco,
Date: Fri, 26 Aug 2011 17:07:48 +0200
Message-ID: <4814386347E41145AAE79139EAA39898150E4F4926_at_ws03-exch07.iconos.be>
Marco,
Your system is complaining about lost disks and io paths. I also see that your multipathing is configured so that it will wait when all paths to a disk are lost (queue_if_no_path). When that happens, the clusterware will start reporting unaccessible voting disks, while the other processes will (appear to) hang.
Can you check if you still get errors like: "tur checker reports path is down" or "kernel: end_request: I/O error,...". If not, check if they start to appear when you put some load on the io subsystem.
Regards,
Freek D'Hooge
Uptime
Oracle Database Administrator
email: freek.dhooge_at_uptime.be
tel +32(0)3 451 23 82
http://www.uptime.be
disclaimer: www.uptime.be/disclaimer
--- From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Marko Sutic Sent: vrijdag 26 augustus 2011 10:34 To: David Barbour Cc: oracle-l_at_freelists.org Subject: Re: CRS-1615:voting device hang at 50% fatal, termination in 99620 ms Hi David, /var/log/messages is stuffed with various messages and I cannot identify what is important to look for. I will attach excerpt of log file from the period during import and when failure occurred. If you notice something odd please let me know. Regards, Marko On Fri, Aug 26, 2011 at 12:14 AM, David Barbour <david.barbour1_at_gmail.com> wrote: Anything in /var/log/messages? On Thu, Aug 25, 2011 at 5:42 AM, Marko Sutic <marko.sutic_at_gmail.com> wrote: Freek, you are correct - heartbeat fatal messages are there due to the missing voting disk. I have another database up and running on second node and this database is using same ocfs2 volume for Oracle database files as the first one. This database is running without any error so I suppose that other OCFS2 volumes were accessible in the time of the failure. In this configuration are 3 voting disk files located on 3 different luns and separate OCFS2 volumes. When failure occurs two of three voting devices hang. It is also worth to mention that nothing else is running on that node except import. I simply can't figure out why two of three voting disks hang. Regards, Marko On Thu, Aug 25, 2011 at 11:08 AM, D'Hooge Freek <Freek.DHooge_at_uptime.be> wrote: Marco, I don't know the error timings for the other node, but I think the heartbeat fatal messages are coming after the first node has terminated due to the missing voting disk. This would indicate that there is no general problem with the voting disk itself, but that the problem is specific to the first node. Either the connection itself or the load or an ocfs2 bug would then be the cause of the error. Do you know if at the time of the failure the other OCFS2 volumes where still accessible? Are your voting disks placed on the same luns as your database files or are they on a separate ocfs2 volume? Regards, Freek D'Hooge Uptime Oracle Database Administrator email: freek.dhooge_at_uptime.be tel +32(0)3 451 23 82 http://www.uptime.be disclaimer: www.uptime.be/disclaimer --- From: Marko Sutic [mailto:marko.sutic_at_gmail.com] Sent: donderdag 25 augustus 2011 10:51 To: D'Hooge Freek Cc: oracle-l_at_freelists.org Subject: Re: CRS-1615:voting device hang at 50% fatal, termination in 99620 ms Errors messages from another node: 2011-08-25 10:38:33.563Received on Fri Aug 26 2011 - 10:07:48 CDT
[cssd(18117)]CRS-1612:node l01ora3 (1) at 50% heartbeat fatal, eviction in 14.000 seconds
2011-08-25 10:38:40.558
[cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in 7.010 seconds
2011-08-25 10:38:41.560
[cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in 6.010 seconds
2011-08-25 10:38:45.558
[cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in 2.010 seconds
2011-08-25 10:38:46.560
[cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in 1.010 seconds
2011-08-25 10:38:47.562
[cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in 0.010 seconds
2011-08-25 10:38:47.574
[cssd(18117)]CRS-1607:CSSD evicting node l01ora3. Details in /u01/app/crs/log/l01ora4/cssd/ocssd.log.
2011-08-25 10:39:01.579
[cssd(18117)]CRS-1601:CSSD Reconfiguration complete. Active nodes are l01ora4 .
Regards, Marko -- http://www.freelists.org/webpage/oracle-l