Re: ALTER SYSTEM commands result in 'reliable message' wait - RAC
Date: Thu, 24 Nov 2016 21:52:14 +0000
Message-ID: <CAP0kZ-3psZ+OKKk_=-OiM55=y8AMefV8OBtk-W+X++s8GoCjvw_at_mail.gmail.com>
We are still actively pursuing the root cause for our encounter with this wait event, with a crit-1 SR open for over one month. Multiple patches attempted and still no solution has been found.
Regards,
Ruan
On Mon, Nov 21, 2016 at 7:57 PM, Deepak Sharma <sharmakdeep_oracle_at_yahoo.com
> wrote:
> We've started seeing these 'reliable message' waits a lot recently, and
> cause serious blocking on our DB.
>
> The channels in particular are "obj broadcast channel" and "RBR channel"
>
> Any clues about what they actually mean?
>
> As per "WAITEVENT: "reliable message" Reference Note (Doc ID 69088.1)"
>
> "When a process sends a message using the 'KSR' *intra-instance*
> broadcast service, the message publisher waits on this wait-event until all
> subscribers have consumed the 'reliable message' just sent."
>
> What would the "intra-instance" mean? We're not on RAC btw. DB version is
> 11.2.0.4
>
>
> On Tuesday, October 25, 2016 8:12 PM, Ruan Linehan <ruandav_at_gmail.com>
> wrote:
>
>
> Hi Martin,
>
> Thanks for your reply and apologies for the delay in response.
>
> Yes, I had come across that very blog entry a few days ago and there are
> one or two similar indications via metalink notes (e.g. Troubleshooting
> High Waits for 'Reliable Message' (Doc ID 2017390.1). Attempts to
> isolate a predominant messaging channel causing (or displaying) the issue
> didn't seem to work for me or for Oracle support unfortunately.
>
> At the moment, we are examining exporting the metadata via transportable
> tablespaces and plugging the data into a fresh newly created database.
>
> I am also creating a clone copy of the DB in an isolated cluster so as to
> facilitate all manner of troubleshooting possibilities suggested by Oracle.
> Even if we solve the issue by essentially recreating the dictionary via
> transportable tablespaces, I am adverse to letting them off the hook on RCA
> for this one...
>
> Regards,
> Ruan
>
> On Fri, Oct 21, 2016 at 8:27 PM, Martin Berger <martin.a.berger_at_gmail.com>
> wrote:
>
> Ruan,
>
> have you tried to identify the channel
> according to https://perfchron.com/2015/ 12/30/diagnosing-oracle-
> reliable-message-waits/
> <https://perfchron.com/2015/12/30/diagnosing-oracle-reliable-message-waits/>
>
>
> this might help digging deeper?
>
> Martin
>
> 2016-10-20 23:53 GMT+02:00 Ruan Linehan <ruandav_at_gmail.com>:
>
> Hi all,
>
> Long time lurker(first time poster)...
>
> I am looking for some assistance...(Or brainstorming at least).
>
> I have cloned via RMAN active duplicate an ASM based (3 x Linux OS node)
> RAC 11.2.0.4.5 database from a shared cluster to a new (Similar HW) shared
> cluster (3 x node) environment.
> * I have actually performed this a few times and reproduced the issue each
> time.
>
> The RMAN duplicate works successfully, no problems. I am able to
> start/stop the instance on the destination side post duplicate
> Initially the destination instance is single instance and then I convert
> this to a 3 x instance RAC DB.
>
> However, I am observing and able to reproduce "hanging" symptoms on this
> cloned copy of the database (which it now transpires are also reproducible
> on the source DB also). We were not aware of this problem (The issue when
> attempting to recreate it is that it does not always appear consistently,
> attempted recreation of the issue is somewhat hit/miss).
>
> For example, see below and note the awful timings for the ALTER SYSTEM...
> commands. The initialisation parameter chosen is arbitrary. I have
> reproduced the issue with different parameters. It appears to only occur
> when incorporating the "memory"/"both" clause of the command.
>
> oracle$ sqlplus / as sysdba
> SQL*Plus: Release 11.2.0.4.0 Production on Wed Oct 19 15:35:17 2016
> Copyright (c) 1982, 2013, Oracle. All rights reserved.
> Connected to:
> Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit
> Production
> With the Partitioning, Real Application Clusters, Automatic Storage
> Management, OLAP,
> Data Mining and Real Application Testing options
>
> SQL> alter system set dispatchers='' scope=spfile;
> System altered.
> SQL> set timing on
> SQL> alter system set dispatchers='' scope=spfile;
> System altered.
> Elapsed: 00:00:00.01
> SQL> alter system set dispatchers='' scope=memory;
> System altered.
> Elapsed: 00:03:22.69
> SQL> alter system set dispatchers='' scope=both;
> System altered.
> Elapsed: 00:56:52.44
> SQL>
>
> Other general operations on the database are reportedly effected also (Its
> just very easy to demonstrate the above to show the extent of the "hang").
>
> I can shut down all but one RAC instance and the problem persists (albeit
> to a lesser extent, the timings are shorter). But they are still enormous
> in contrast to the expected command completion time.
>
> I have opened an SR and nothing has been uncovered as yet. Oracle support
> are struggling to ascertain the cause to say the least. There is nothing in
> the RDBMS alert logs for each instance and nothing indicative in the ASM
> alert logs...
> An ASH report for the user performing the command indicates 100% 'reliable
> message' wait.
>
> Extract from the RDBMS instance alert log where the above example was
> demonstrated...
>
> oracle$ tail -10 alert_DPOTHPRD1.log
> Wed Oct 19 15:34:07 2016
> CJQ0 started with pid=112, OS id=50645
> Wed Oct 19 15:35:40 2016
> ALTER SYSTEM SET dispatchers='' SCOPE=SPFILE;
> Wed Oct 19 15:35:51 2016
> ALTER SYSTEM SET dispatchers='' SCOPE=SPFILE;
> Wed Oct 19 15:39:20 2016
> ALTER SYSTEM SET dispatchers='' SCOPE=MEMORY;
> Wed Oct 19 16:36:51 2016
> ALTER SYSTEM SET dispatchers='' SCOPE=BOTH;
> oracle$
>
>
> The below is also taken from a hang analyze TRC file:
>
> ============================== ==============================
> ===================
> HANG ANALYSIS:
> instances (db_name.oracle_sid): dbprodrl.dbprodrl1, dbprodrl.dbprodrl2,
> dbprodrl.dbprodrl3
> no oradebug node dumps
> os thread scheduling delay history: (sampling every 1.000000 secs)
> 0.000000 secs at [ 15:37:49 ]
> NOTE: scheduling delay has not been sampled for 0.539347 secs
> 0.000000 secs from [ 15:37:45 - 15:37:50 ], 5 sec avg
> 0.000000 secs from [ 15:36:50 - 15:37:50 ], 1 min avg
> 0.000000 secs from [ 15:33:28 - 15:37:50 ], 5 min avg
> vktm time drift history
> ============================== ==============================
> ===================
> Chains most likely to have caused the hang:
> [a] Chain 1 Signature: 'reliable message'
> Chain 1 Signature Hash: 0x55ce7f30
> [b] Chain 2 Signature: 'EMON slave idle wait'
> Chain 2 Signature Hash: 0x9fbbc886
> [c] Chain 3 Signature: 'EMON slave idle wait'
> Chain 3 Signature Hash: 0x9fbbc886
>
> And screen-grab from an AWR report...
> [image: Inline image 1]
>
> The (cloned) database is running on a shared cluster with three other
> active databases at present. None of the other databases display the same
> symptom. I have examined a multitude of 'reliable message' wait documents
> on metalink but none fit our scenario criteria. Possibly someone has prior
> experience diagnosing this particular wait event?
>
> Regards,
> Ruan Linehan
>
>
>
>
>
> --
> Martin Berger martin.a.berger_at_gmail.com
> +43 660 2978929
> _at_martinberx <https://twitter.com/martinberx>
> http://berxblog.blogspot.com
>
>
>
>
>
-- http://www.freelists.org/webpage/oracle-lReceived on Thu Nov 24 2016 - 22:52:14 CET
- image/png attachment: image.png