RE: CRS-1615:voting device hang at 50% fatal, termination in 99620 ms

From: D'Hooge Freek <Freek.DHooge_at_uptime.be>
Date: Fri, 26 Aug 2011 17:07:48 +0200
Message-ID: <4814386347E41145AAE79139EAA39898150E4F4926_at_ws03-exch07.iconos.be>

Marco,

Your system is complaining about lost disks and io paths. I also see that your multipathing is configured so that it will wait when all paths to a disk are lost (queue_if_no_path). When that happens, the clusterware will start reporting unaccessible voting disks, while the other processes will (appear to) hang.

Can you check if you still get errors like: "tur checker reports path is down" or "kernel: end_request: I/O error,...". If not, check if they start to appear when you put some load on the io subsystem.

Regards,

Freek D'Hooge
Uptime
Oracle Database Administrator
email: freek.dhooge_at_uptime.be
tel +32(0)3 451 23 82
http://www.uptime.be
disclaimer: www.uptime.be/disclaimer

---
From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Marko Sutic
Sent: vrijdag 26 augustus 2011 10:34
To: David Barbour
Cc: oracle-l_at_freelists.org
Subject: Re: CRS-1615:voting device hang at 50% fatal, termination in 99620 ms

Hi David,

/var/log/messages is stuffed with various messages and I cannot identify what is important to look for.

I will attach excerpt of log file from the period during import and when failure�occurred.

If you notice something odd please let me know.

Regards,
Marko

On Fri, Aug 26, 2011 at 12:14 AM, David Barbour <david.barbour1_at_gmail.com> wrote:
Anything in /var/log/messages?� 

On Thu, Aug 25, 2011 at 5:42 AM, Marko Sutic <marko.sutic_at_gmail.com> wrote:
Freek,

you are correct - heartbeat fatal messages are there due to the missing voting disk.

I have another database up and running on second node and this database is using same ocfs2 volume for Oracle database files as the first one.
This database is running without any error so I suppose that other OCFS2 volumes were�accessible�in the time of the failure.

In this configuration are 3 voting disk files located on 3 different luns and separate OCFS2 volumes. When failure occurs two of three voting devices hang.

It is also worth to mention that nothing else is running on that node except import.

I simply can't figure out why two of three voting disks hang.

Regards,
Marko

On Thu, Aug 25, 2011 at 11:08 AM, D'Hooge Freek <Freek.DHooge_at_uptime.be> wrote:
Marco,

I don't know the error timings for the other node, but I think the heartbeat fatal messages are coming after the first node has terminated due to the missing voting disk.

This would indicate that there is no general problem with the voting disk itself, but that the problem is specific to the first node.
Either the connection itself or the load or an ocfs2 bug would then be the cause of the error.

Do you know if at the time of the failure the other OCFS2 volumes where still accessible?
Are your voting disks placed on the same luns as your database files or are they on a separate ocfs2 volume?

Regards,

Freek D'Hooge
Uptime
Oracle Database Administrator
email: freek.dhooge_at_uptime.be
tel +32(0)3 451 23 82
http://www.uptime.be
disclaimer: www.uptime.be/disclaimer
---
From: Marko Sutic [mailto:marko.sutic_at_gmail.com]
Sent: donderdag 25 augustus 2011 10:51
To: D'Hooge Freek
Cc: oracle-l_at_freelists.org
Subject: Re: CRS-1615:voting device hang at 50% fatal, termination in 99620 ms

Errors messages from another node:

2011-08-25 10:38:33.563

[cssd(18117)]CRS-1612:node l01ora3 (1) at 50% heartbeat fatal, eviction in 14.000 seconds

2011-08-25 10:38:40.558

[cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in 7.010 seconds

2011-08-25 10:38:41.560

[cssd(18117)]CRS-1611:node l01ora3 (1) at 75% heartbeat fatal, eviction in 6.010 seconds

2011-08-25 10:38:45.558

[cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in 2.010 seconds

2011-08-25 10:38:46.560

[cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in 1.010 seconds

2011-08-25 10:38:47.562

[cssd(18117)]CRS-1610:node l01ora3 (1) at 90% heartbeat fatal, eviction in 0.010 seconds

2011-08-25 10:38:47.574

[cssd(18117)]CRS-1607:CSSD evicting node l01ora3. Details in /u01/app/crs/log/l01ora4/cssd/ocssd.log.

2011-08-25 10:39:01.579

[cssd(18117)]CRS-1601:CSSD Reconfiguration complete. Active nodes are l01ora4 .

Regards,
Marko

--
http://www.freelists.org/webpage/oracle-l

Received on Fri Aug 26 2011 - 10:07:48 CDT

This message: [ Message body ]
Next message: Stalin: "Re: High Memory Usage"
Previous message: Marko Sutic: "Re: CRS-1615:voting device hang at 50% fatal, termination in 99620 ms"
In reply to: Marko Sutic: "Re: CRS-1615:voting device hang at 50% fatal, termination in 99620 ms"
Next in thread: Marko Sutic: "Re: CRS-1615:voting device hang at 50% fatal, termination in 99620 ms"
Reply: Marko Sutic: "Re: CRS-1615:voting device hang at 50% fatal, termination in 99620 ms"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message