Re: RAC node "has a disk HB, but no network HB" but traceroute

From: Gus Spier <gus.spier_at_gmail.com>
Date: Thu, 5 Jan 2017 19:49:47 -0500
Message-ID: <CAG8xnid-5s8YcappybV-O8j7-WaFJNALxdert_fPYs0R0mM71g_at_mail.gmail.com>



http://berxblog.blogspot.com/

Is this the Martin Berger blog you referred to ??

Very coincidental that it popped up in my google feed after reading your trials and tribulations.

Regards,
Gus

On Thu, Jan 5, 2017 at 5:08 PM, Yong Huang <dmarc-noreply_at_freelists.org> wrote:

> Thanks, Justin, Jure and Martin. Martin's article is great. Interpreting
> "no network HB" as "there are 2 or more processes which missed to
> communicate" instead of a network problem is the key. That's exactly what I
> meant in the SR I opened by saying "We begin to doubt about the meaning of
> the "no network HB" message". So far the SR hasn't gone anywhere after
> uploading various types of logs.
>
> Our log does show fast increase in IP packets that need reassembly and all
> these reassemblies failed:
> $ egrep '^zzz|reassembl' <OSWatcher netstat log>
> ...
> zzz Sun Dec 18 02:01:58 CST 2016
> 555539624 reassemblies required
> 100653307 packets reassembled ok
> 60026 packet reassembles failed
> zzz Sun Dec 18 02:02:28 CST 2016
> 555545702 reassemblies required
> 100653307 packets reassembled ok
> 66103 packet reassembles failed
> zzz Sun Dec 18 02:02:58 CST 2016
> 555551748 reassemblies required
> 100653307 packets reassembled ok
> 72149 packet reassembles failed
>
> Of all the documents I found, Red Hat "IP fragmentation fails and
> fragmented packets get dropped" at
> https://access.redhat.com/solutions/1498603
> is a good one. But you have to login to read it. In short, if I understand
> the confusing Root Cause section correctly, kernel-2.6.32-477.el6 or
> RHEL6.6 has a bug that incorrectly calculates IP fragmentation memory,
> which causes false evictions (i.e. drop) of IP fragments on systems with
> many CPUs. (Our problem server has 80 CPUs. Other servers have much less.)
> Upgrade of the kernel or Red Hat release version is the solution. An easy
> workaround is to increase the fragmentation buffer size. The article says
> doubling the fragmentation thresholds is enough, i.e. from the default 4M
> to 8M. We'll set the IP fragmentation buffer low and high values to 15 and
> 16 MB per Oracle note 2008933.1. I think the counter "fragments dropped
> after timeout" in `netstat -s' is related to /proc/sys/net/ipv4/ipfrag_time
> and ours seems to be fairly stable even before the crash, I'll leave that
> parameter alone for now.
>
> Now I think I know why our OSWatcher did not report a traceroute problem
> at the last crash: the default packet size used by traceroute is only 60
> bytes. To detect the problem, we should append a packet length parameter to
> the traceroute command with a value greater than 1500, the Ethernet MTU.
>
> Yong Huang
> --
> http://www.freelists.org/webpage/oracle-l
>
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Jan 06 2017 - 01:49:47 CET

Original text of this message