Re: High Buffer busy waits

From: Tanel Poder <tanel_at_tanelpoder.com>
Date: Tue, 1 Aug 2023 21:28:22 -0400
Message-ID: <CAMHX9JLwkoVruVPLVo4Bv-w_hO66yr5FBu4+XHfoaKbMFR6mDQ_at_mail.gmail.com>



Also, check how busy your DBWRs, CKPT and LGWR processes really are during the problem time.

You can use this syntax, if you want to measure these processes' activity over 30 seconds:

_at_snapper all 30 1 lgwr
_at_snapper all 30 1 ckpt
_at_snapper all 30 1 dbwr

(that's single node, for RAC you could use this, but not sure how well it performs, complex dynamic queries over GV$ can get tricky):

_at_snapper all 30 1 lgwr_at_*
_at_snapper all 30 1 ckpt_at_*
_at_snapper all 30 1 dbwr_at_*

If the bottom 3 queries never return or take too much time, just run the top/above ones in each instance separately.

On Tue, Aug 1, 2023 at 9:26 PM Tanel Poder <tanel_at_tanelpoder.com> wrote:

> Could be that bug, but could also be that the other database (in another
> data centre, using another storage array and SAN network) just has
> different I/O response times. So in one data center DBWR falls behind and
> causes this chain of events, while in the other one it doesn't, so
> everything looks ok there.
>
> Use _at_ash/dash_wait_chains2.sql instead of the regular script. In RAC,
> unfortunately the ASH sample_ids aren't synced (different instances do
> start at different times), so in the dash_wait_chains2.sql script, I'm
> running the connect by wait chain traversing by truncating the
> sample_timestamp to the closest 10-second time. Not perfect, but assuming
> that the RAC cluster nodes are in pretty precise time-sync, looking up what
> the cross-node blocker was doing will be correct (this only affects RAC,
> not single instance).
>
>
> But yes, it is possible for background processes to be "waiting" for
> something to happen, without deciding to get stuck/hung with a single wait
> event, but rather get into some sort of a (busy) loop, regularly checking
> for that required condition, while doing other useful work in between (or
> just being idle for a bit). Here's a quite related example where LGWR
> itself was mostly idle as far as wait events were concerned, but actually
> it was in a loop, regularly checking for checkpoint progress by reading
> controlfile contents every now and then:
>
>
> https://tanelpoder.com/posts/log-file-switch-checkpoint-incomplete-and-lgwr-waiting-for-checkpoint-progress/
>
>
> --
> Tanel Poder
> https://tanelpoder.com
>
>
> On Tue, Aug 1, 2023 at 4:56 PM yudhi s <learnerdatabase99_at_gmail.com>
> wrote:
>
>> Thank you so much.
>>
>> Actually for this database, there exists an active-active HA setup
>> (bi-directional sync is happening using golden gate replication). And we
>> saw when the application with the same workload is running in that
>> other/secondary database but we don't see such high buffer busy waits
>> during the peak period.
>>
>> We have fast_start_mttr_target value set as zero in both the databases.
>>
>> And further checking the latest one entry from dba_registry_sqlpatch , in
>> the database in which we are getting buffer busy waits is "MERGE ON
>> DATABASE RU 19.15.0.0.0 OF 29213893 30664385"
>> whereas the other/secondary database which takes that workload without
>> any issue is "MERGE ON DATABASE RU 19.19.0.0.0 OF 35233847 33526630". So
>> it's likely to be related to the highlighted bug related to buffer cache-
>> which Priit mentioned. DBWR not picking up writes for some time (Doc ID
>> 33973908.8)
>>
>> But is there any way possible through which we can be more certain , it's
>> really the same bug which is impacting us?
>>
>> Actually we are now live on the secondary database on which the issue is
>> not occurring but mostly it will come again when we will move back to the
>> primary database. However, we have AWR data and that event trend for
>> dba_hist_system_event is saying we had a higher spike in " log file
>> switch (checkpoint incomplete) " exactly during the same issue period.
>> So wondering, if there exists any other waits which trend could be verified
>> during the time to prove the point of the mentioned bug? And the bug
>> suggests "*the DB writer not picking up write for some time*", so does
>> it mean that the wait chain which we are seeing is abnormal, as its showing
>> us the ultimate blocker as "*idle *blocker 1,2395,34375
>> (oracle_at_XXXXXXXXXXX (DBW0)" , so the *idle *keyword itself pointing to
>> the bug/idle DBW0 here?
>>
>>
>> On Tue, Aug 1, 2023 at 5:19 AM Tanel Poder <tanel_at_tanelpoder.com> wrote:
>>
>>> Reading through other responses - it could of course be a bug too (when
>>> you hit a bug, all bets are off).
>>>
>>>
>>> On Mon, Jul 31, 2023 at 7:45 PM Tanel Poder <tanel_at_tanelpoder.com>
>>> wrote:
>>>
>>>> You need to make your checkpoints faster, so that they don't get into
>>>> the way of log file switches. You have a DBWR write I/O problem, not a LGWR
>>>> or buffer busy wait problem (DBWR is the ultimate blocker in the wait
>>>> chain). Increasing/adding redologs may help to reduce these log switch
>>>> waits/symptoms, but ultimately DBWR is not keeping up with syncing enough
>>>> dirty buffers to disk (so that older redologs could be overwritten).
>>>>
>>>>>

--
http://www.freelists.org/webpage/oracle-l
Received on Wed Aug 02 2023 - 03:28:22 CEST

Original text of this message