Re: RAC node "has a disk HB, but no network HB" but traceroute resports no problem
Date: Wed, 4 Jan 2017 09:10:14 +0100
Message-ID: <CAC08BHJAEd=4XuzR+vOpimY4Bs0AzQoQYLWTVMNGxZ3Hfq6UKQ_at_mail.gmail.com>
Hi,
Martin Berger recently posted a blog entry which discusses an issue with
similar symptoms and its resolution:
http://berxblog.blogspot.si/2016/12/interconnect-fragmentation-kills-cluster.html
Regards,
Jure Bratina
On Wed, Jan 4, 2017 at 8:13 AM, Justin Mungal <justin_at_n0de.ws> wrote:
> "no network HB" means that the Network Heartbeat is failing for some
> reason.
>
> This is rather anecdotal, but a RAC that my co-worker is responsible for
> was evicting nodes in a similar manner (similar environment as well)
> without any evident network problems. His theory was that the heartbeats
> were failing because the interconnect was not responding fast enough, due
> to all of the existing database activity. He enabled jumbo frames and the
> problem went away. So in other words the TCP/IP stack was busy
> disassembling and reassembling frames and this caused heartbeat responses
> to get slower, and enabling jumbo frames reduced that overhead.
>
> Recommendation for the Real Application Cluster Interconnect and Jumbo
> Frames (Doc ID 341788.1)
>
> You can also investigate your timeout settings and adjust them, but Oracle
> generally doesn't recommend this and will probably just tell you to install
> 11.2.0.4 instead.
>
> CSS Timeout Computation in Oracle Clusterware (Doc ID 294430.1)
>
> On Tue, Jan 3, 2017 at 4:20 PM, Yong Huang <dmarc-noreply_at_freelists.org>
> wrote:
>
>> Oracle and GI (grid infrastructure) 11.2.0.3 on 64-bit Red Hat Linux 6.6.
>> Cisco UCS.
>>
>> Node 2 of a 2-node RAC crashed. Log ocssd.log shows:
>>
>> 2016-12-18 02:03:06.307: [ CSSD][499648256]clssnmPollingThread: node
>> d1prpcrndb1a (1) at 50% heartbeat fatal, removal in 14.760 seconds
>> 2016-12-18 02:03:06.307: [ CSSD][499648256]clssnmPollingThread: node
>> d1prpcrndb1a (1) is impending reconfig, flag 2493454, misstime 15240
>> 2016-12-18 02:03:06.307: [ CSSD][499648256]clssnmPollingThread: local
>> diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending
>> reconfig status(1)
>> 2016-12-18 02:03:06.307: [ CSSD][510686976]clssnmvDHBValidateNcopy: node
>> 1, d1prpcrndb1a, has a disk HB, but no network HB, DHB has rcfg 306434975,
>> wrtcnt, 197140394, LATS 4040636964, lastSeqNo 185041690, uniqueness
>> 1468029747, timestamp 1482048185/1112586906
>> ...[some lines snipped here]...
>> 2016-12-18 02:03:28.094: [ CSSD][510686976]clssnmvDHBValidateNcopy: node
>> 1, d1prpcrndb1a, has a disk HB, but no network HB, DHB has rcfg 306434975,
>> wrtcnt, 197140475, LATS 4040658754, lastSeqNo 197140472, uniqueness
>> 1468029747, timestamp 1482048207/1112608986
>>
>> We installed Oracle's OSWatcher and enabled traceroute for the private
>> network, which shows no error during the time:
>>
>> zzz ***Sun Dec 18 02:03:28 CST 2016
>> traceroute to dcprpcrndb1bic1 (10.114.21.3), 30 hops max, 60 byte packets
>> 1 dcprpcrndb1bic1 (10.114.21.3) 0.020 ms 0.008 ms 0.004 ms
>> traceroute to dcprpcrndb1bic2 (10.114.21.67), 30 hops max, 60 byte packets
>> 1 dcprpcrndb1bic2 (10.114.21.67) 0.020 ms 0.006 ms 0.004 ms
>> traceroute to d1prpcrndb1aic1 (10.114.21.2), 30 hops max, 60 byte packets
>> 1 d1prpcrndb1aic1 (10.114.21.2) 0.262 ms 0.259 ms 0.255 ms
>> traceroute to d1prpcrndb1aic2 (10.114.21.66), 30 hops max, 60 byte packets
>> 1 d1prpcrndb1aic2 (10.114.21.66) 0.135 ms 0.123 ms 0.110 ms
>>
>> If traceroute never reports a problem, what does "no network HB" in
>> occsd.log mean? At 02:03:28, we see both "no network HB" and successful
>> traceroute pings. This is not the first time we have this problem. The
>> network team never finds any issue, consist with the traceroute report.
>>
>> OSWatcher traceroute has only basic options:
>> traceroute -r -F <private network IP>
>> where -r means "Bypass the normal routing tables and send directly to a
>> host on an attached network". -F means "Do not fragment probe packets".
>>
>> /var/log/messages reports no problem at the time. It only starts to show
>> problems after the cluster already decides on eviction.
>>
>> Yong Huang
>> --
>> http://www.freelists.org/webpage/oracle-l
>>
>>
>>
>
-- http://www.freelists.org/webpage/oracle-lReceived on Wed Jan 04 2017 - 09:10:14 CET