Re: Odd wait event freezing database
Date: Sat, 17 Apr 2021 01:06:33 +0530
Message-ID: <CAKna9Va4WDt+jTnpsqhfDsEWJhFqXK6xq=iZS=kBSqVxzFEwWw_at_mail.gmail.com>
On Sun, Apr 11, 2021 at 12:53 AM Lok P <loknath.73_at_gmail.com> wrote:
> As I read it , when a checkpoint occurs the change tracking writer i.e.
> CTWR will write the data from the CTWR DBA buffer to the BCT file, so we
> are seeing this BCT Buffer space wait + reliable message waits means the
> check point is not finishing up and getting stuck. So does it mean that
> when we issue COMMIT during that point itself the changed data
> is struggling to get written to CTWR buffer because of its so small space?
> But again if it's because of the small size of CTWR buffer being small then
> how come initiating BCT file fixing such an issue?
>
>
>
> On Sun, Apr 11, 2021 at 12:25 AM Lok P <loknath.73_at_gmail.com> wrote:
>
>> Below is the bug that seems to be exactly what we are suffering from as
>> it shows how "check point not completing" and "reliable message" are
>> associated with this BCT issue. In past we had suffered mainly from lot of
>> "BCT Buffer space waits" and then we found BCT file size was 22GB and as
>> suggested by Oracle, we endup increasing the size of
>> _bct_public_dba_buffer_size from ~128MB to ~256MB and re-initialized the
>> BCT i.e. disbaled+enabled and that fixed the issue then. The size of the
>> BCT file after reinitialization was down to ~9 GB.
>>
>> But it seems now within ~2months the BCT file has again grown back to
>> same ~22GB and this time we are seeing combination of waits like "reliable
>> message" +"BCT buffer space waits", even with that
>> _bct_public_dba_buffer_size being set as ~256MB, so it seems this doubling
>> of buffer size has not helped then rather re-initialization of BCT helped.
>> So this time we are going to do the same reinitializing of BCT to have a
>> temporary fix.But now wondering what will be a permanent fix for this issue?
>>
>> High Waits On 'block change tracking buffer space' - Checkpoint
>> Contention With BLOCK CHANGE TRACKING or RMAN Incremental Backup (Doc ID
>> 2094946.1)
>>
>> On Sat, Apr 10, 2021 at 12:53 AM Lok P <loknath.73_at_gmail.com> wrote:
>>
>>> We have verified many occurrences and it looks like each time the
>>> "reliable message" is followed by many "block change tracking buffer
>>> space" wait events. We are thinking of initiating BCT by disabling +
>>> enabling the BCT. but still unable to relate how come the BCT Buffer space
>>> can cause "reliable message" wait events making the database freeze?
>>>
>>> On Fri, Apr 9, 2021 at 5:27 PM Lok P <loknath.73_at_gmail.com> wrote:
>>>
>>>> Resending, as it seems bounced back from the list server.
>>>>
>>>> ---------- Forwarded message ---------
>>>> From: Lok P <loknath.73_at_gmail.com>
>>>> Date: Fri, Apr 9, 2021 at 5:08 PM
>>>> Subject: Re: Odd wait event freezing database
>>>> To: Jonathan Lewis <jlewisoracle_at_gmail.com>
>>>> Cc: Oracle L <oracle-l_at_freelists.org>
>>>>
>>>>
>>>> Thank you Sayan, Jonathan.
>>>>
>>>> I checked one occurrence of the issue from
>>>> dba_hist_active_sess_history. The first occurrence of "reliable message" is
>>>> logged at 4/8/2021 9:50:39.957 AM on NODE-4 against that sql along with a
>>>> few other SELECT queries. And then I fetch all the ASH data starting 9:49AM
>>>> and attached(dash_dump.xlsx) is the results from
>>>> dba_hist_active_sess_history starting 9:49 on 8th April for all the four
>>>> nodes. What I see is mainly a BCT related wait i.e. "block change tracking
>>>> buffer space" event and I am not sure how this can be related to the actual
>>>> issue. I am not seeing any such event related to DBWR or CKPT though just
>>>> before 9:50.
>>>>
>>>> I have attached the sql plan(sql_queryplan.txt) with the predicate
>>>> section for the one which i mentioned in my initial post and it seems that
>>>> is the one suffering heavily. And I am not sure if it's the cause or the
>>>> victim. I do see that BCT waits were logged against some key inserts to
>>>> transaction tables.
>>>>
>>>> Regards
>>>> Lok
>>>>
>>>> On Fri, Apr 9, 2021 at 1:01 AM Jonathan Lewis <jlewisoracle_at_gmail.com>
>>>> wrote:
>>>>
>>>>> I nearly agree with Sayan about the "long checkpoint".but I think it
>>>>> might be indirect.
>>>>> You're on 11g, so I THINK the sequence of events is something like the
>>>>> following (but Sayan may well correct me):
>>>>>
>>>>> Session wants to do direct path read (cell smart scan for Exadata).
>>>>> Session sends message to CKPT to do object checkpoint and waits for
>>>>> CKPT to return a reliable message
>>>>> CKPT sends DBWR an object checkpoint request and waits for DBWR to
>>>>> acknowledge
>>>>> Once DBWR has acknowledged, CKPT takes a KO enqueue in exclusive mode
>>>>> and sends a message to the session
>>>>> The session tries to acquire a KO enqueue in share mode (mode 4) and
>>>>> ends up waiting on CKPT.
>>>>> When CKPT finds the checkpoint it complete it releases its exclusive
>>>>> lock - which allows the session to get its lock, release it, and continue.
>>>>>
>>>>> If that's correct then it looks like CKPT has a problem getting a
>>>>> message to DBWR, or DBWR had a problem acknowledging CKPT.
>>>>> If might be revealing to check v$active_session_history for that 5
>>>>> minutes (or dba_hist_active_sess_history) to see what waits appear for CKPT
>>>>> and/or DBW% as that may give some clues. (Of course with RAC the object
>>>>> checkpoint has to propagate across all instances, so that may complicate
>>>>> the contents of ASH)
>>>>>
>>>>> Regards
>>>>> Jonathan Lewis
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, 8 Apr 2021 at 19:30, Lok P <loknath.73_at_gmail.com> wrote:
>>>>>
>>>>>> Hi All, Its version 11.2.0.4 of Oracle exadata is a 4 node RAC
>>>>>> database. We are seeing one of the query runs normally finish in a few
>>>>>> seconds but sometimes it runs for 3-4 minutes with the wait event being
>>>>>> noted as "reliable message" and during that time period things seem to
>>>>>> freeze in the database almost all the nodes getting stuck. So I am not sure
>>>>>> if this query is the cause of the slowness or the victim, but it seems
>>>>>> whenever such an issue occurred this query was getting executed from
>>>>>> multiple sessions and was running longer than expected time. No change in
>>>>>> plan happened for this query and with the same plan it used to finish in
>>>>>> seconds during other times. So wanted to understand if we are hitting any
>>>>>> bug around this wait event as this looks a bit unusual? It seems happening
>>>>>> while scanning mostly table TSFS in FULL , want to understand what's wrong
>>>>>> with scanning table TSFS?
>>>>>> Below attached is the sql monitor for the same query which is showing
>>>>>> all time(~200+ seconds) being spent on event "reliable message" only.
>>>>>>
>>>>>>
>>>>>>
-- http://www.freelists.org/webpage/oracle-lReceived on Fri Apr 16 2021 - 21:36:33 CEST