Re: intermittent long "log file sync" waits

From: Tanel Poder <tanel_at_tanelpoder.com>
Date: Wed, 29 Jan 2020 23:32:41 -0500
Message-ID: <CAMHX9J+FooJWWr896aFx7oqeMk=PKA__GaHjxjU_gsOyE1Tzjg_at_mail.gmail.com>

Comments:

First, looks like you're using all the right tools for systematic troubleshooting ;-)

As "log file sync" is not directly an I/O wait event (IO happens indirectly under the hood), but "wait for LGWR to ACK/increment redo persisting position" then there may be multiple reasons for delays. I'm repeating some of what others have said here too:
- LGWR could be waiting for remote ACK (synchronous data guard) - that should show up as a LNS* wait or something like that though
- Process slowness due to CPU overload/scheduling latency or swapping (in both cases, LGWR may do its work fast, but it takes a while for the foreground process to wake up and end the wait event - you could check if there are any reports regarding switching LGWR mode in alert.log around that time)
- Bug/missed post due to switching between LGWR progress polling vs. waiting for LGWR post
- ASM instance communication issues (ASM communication can be on the critical path when creating/changing files on ASM and even for reading, when the ASM metadata cache in your local instance doesn't have what it needs)
- So, the physical redo log file disk write is not necessarily the biggest contributor to the log file sync wait
You'll need to measure what LGWR itself was doing during the problem time
- If LGWR and the slaves were listed *idle* during the problem time (while others waited for log file sync event at the same time) *and if *there was no huge OS level CPU/memory shortage at the time, this looks like a background <-> foreground communication problem.
  - This (and any potential ASM issues) could probably be troubleshooted via KST tracing, KSR channels and X$TRACE that could help here (some of it is enabled by default ( Like I showed in a previous hacking session: https://www.youtube.com/watch?v=mkmvZv58W6w )
Paste the full *Snapper* output (of one time snapshot from the problem
time) here if you can - we'll be able to see what the slaves were doing
- You'd need to include not only LGWR SID but also any LG* slave SIDs too
- Also we'll see better LGWR delay metrics from V$SESSTAT that go beyond just wait events
- Thinking of metrics like these - snapper already handles them (if you look into the rightmost column)
  - *redo synch time overhead (usec)* - "FG wakeup overhead per log file sync"
  - *redo write time* - accounting end to end redo write time (not just the log file parallel write syscall) "per redo write"
Regarding the *ASH wait chains* error on 19c - I'm aware of this but forgot to look into it
- It has (probably) something to do with GV$ views (or PX slaves used for PDB V$ queries) and possibly the connect by I'm using
- Does it work in the CDB (if multitenant)?
- Does it work with lower optimizer features enabled?
- As we are already manually looking into what LGWR and its slaves are doing, there's no need for the wait chains script here
Linux Process Snapper - pSnapper
- So the Linux pSnapper "idle" means a somewhat different thing than Oracle "idle" - say LGWR was waiting for a Data Guard ACK, that should not be shown as Idle in Oracle, but at the Linux process level it's "just some network socket wait" and you'd need to use *-a* option to show all of them
Were LGWR/LG* processes mostly idle during the problem time?
- I wrote about this here: https://blog.tanelpoder.com/posts/log-file-switch-checkpoint-incomplete-and-lgwr-waiting-for-checkpoint-progress/
- So it may still make sense to check what LGWR/LG* processes are doing even if they don't seem to be active much (like do you see an occasional controlfile read wait by LGWR etc)
- SQL*Trace would capture all activity of LGWR, but may slow things down further, so you could also sample & spool V$SESSION_WAIT_HISTORY (that keeps a log of last 10 waits of every session in memory)

Incidentally I'm doing another hacking session tomorrow (Thursday 30 Jan _at_ 12pm ET) about complex Oracle performance problems that span multiple layers and have multiple moving parts. I can add some LGWR troubleshooting magic in the end of that (here goes my plan to fit the session in 1.5h again!). I'll post the details in my blog in a few minutes.

--
Tanel Poder
https://blog.tanelpoder.com/seminar


On Tue, Jan 28, 2020 at 2:22 PM Chris Stephens <cstephens16_at_gmail.com>
wrote:



> doesn't look like any ash data for lgwr:


>


> SQL> _at_ashtop session_id,blocking_session,event2 1=1 "TIMESTAMP'2020-01-27


> 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"

>    TOTALSECONDS    AAS      %This    SESSION_ID    BLOCKING_SESSION

>          EVENT2             FIRST_SEEN              LAST_SEEN

>  DIST_SQLEXEC_SEEN

> _______________ ______ __________ _____________ ___________________

> _____________________ ______________________ ______________________

> ____________________

>              81    0.7   82% |             1713                1710 log

> file sync         2020-01-27 15:50:54    2020-01-27 15:52:14

>         1

>               5      0    5% |             1221                     ON CPU

>                2020-01-27 15:50:59    2020-01-27 15:52:14

>     1

>               3      0    3% |              246                     ON CPU

>                2020-01-27 15:51:00    2020-01-27 15:52:00

>     1

>               3      0    3% |             1713                     ON CPU

>                2020-01-27 15:50:51    2020-01-27 15:52:27

>     2

>               1      0    1% |              123                     ON CPU

>                2020-01-27 15:51:35    2020-01-27 15:51:35

>     1

>               1      0    1% |              127                     ON CPU

>                2020-01-27 15:51:50    2020-01-27 15:51:50

>     1

>               1      0    1% |              252                2321 latch

> free            2020-01-27 15:52:28    2020-01-27 15:52:28

>       1

>               1      0    1% |              978                     ges

> remote message    2020-01-27 15:50:48    2020-01-27 15:50:48

>         1

>               1      0    1% |              983                     latch

> free            2020-01-27 15:52:28    2020-01-27 15:52:28

>       1

>               1      0    1% |             1713

> library cache lock    2020-01-27 15:50:48    2020-01-27 15:50:48

>             1

>               1      0    1% |             1831                     ON CPU

>                2020-01-27 15:50:49    2020-01-27 15:50:49

>     1

>
>


> 11 rows selected.


>
>


> SQL> _at_ashtop session_id,blocking_session,event2 session_id=1710


> "TIMESTAMP'2020-01-27 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"

>


> no rows selected


>


> On Tue, Jan 28, 2020 at 1:09 PM Noveljic Nenad <


> nenad.noveljic_at_vontobel.com> wrote:

>
>> You’re using v0.2, but  v0.6 should be the latest version. 0.2 wouldn’t
>> show idle blockers.
>>
>> I looked at the ashtop output once again. Strange that lgwr processes nor
>> any other bg processes show up significantly there. The question is, what
>> was lgwr doing while the fg processes were waiting on him.
>>
>>

--
http://www.freelists.org/webpage/oracle-l

Received on Thu Jan 30 2020 - 05:32:41 CET

This message: [ Message body ]
Next message: Howard Latham: "apex / http"
Previous message: Mark J. Bobak: "Re: [EXTERNAL] Re: GI version 18 or 19c?"
In reply to: Chris Stephens: "Re: intermittent long "log file sync" waits"
In reply to Noveljic Nenad: "RE: intermittent long "log file sync" waits"
Next in thread: Chris Stephens: "Re: intermittent long "log file sync" waits"
Reply: Chris Stephens: "Re: intermittent long "log file sync" waits"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message