Re: intermittent long "log file sync" waits
Date: Wed, 29 Jan 2020 23:32:41 -0500
Message-ID: <CAMHX9J+FooJWWr896aFx7oqeMk=PKA__GaHjxjU_gsOyE1Tzjg_at_mail.gmail.com>
Comments:
- First, looks like you're using all the right tools for systematic troubleshooting ;-)
- As "log file sync" is not directly an I/O wait event (IO happens
indirectly under the hood), but "wait for LGWR to ACK/increment redo
persisting position" then there may be multiple reasons for delays. I'm
repeating some of what others have said here too:
- LGWR could be waiting for remote ACK (synchronous data guard) - that should show up as a LNS* wait or something like that though
- Process slowness due to CPU overload/scheduling latency or swapping (in both cases, LGWR may do its work fast, but it takes a while for the foreground process to wake up and end the wait event - you could check if there are any reports regarding switching LGWR mode in alert.log around that time)
- Bug/missed post due to switching between LGWR progress polling vs. waiting for LGWR post
- ASM instance communication issues (ASM communication can be on the critical path when creating/changing files on ASM and even for reading, when the ASM metadata cache in your local instance doesn't have what it needs)
- So, the physical redo log file disk write is not necessarily the biggest contributor to the log file sync wait
- You'll need to measure what LGWR itself was doing during the problem time
- Since you're on 19c (12c or later) the LGWR has slaves too and they may be doing most of the work
- You'll need to look into both the LGWR process and any other LG* ones too
- Since you have (Oracle) Snapper running already and writing to trace you should have the data already
- The *lgwr* in the following _at_snapper ash,stats,trace 10 999999 lgwr syntax includes LGWR slaves too
- As LGWR slaves have their own sessions (and SIDs), you should run ashtop etc and report if there's any BLOCKING_SESSION value listed for LGWR and slaves
- If LGWR and the slaves were listed *idle* during the problem time
(while others waited for log file sync event at the same time) *and if *there
was no huge OS level CPU/memory shortage at the time, this looks like a
background <-> foreground communication problem.
- This (and any potential ASM issues) could probably be troubleshooted via KST tracing, KSR channels and X$TRACE that could help here (some of it is enabled by default ( Like I showed in a previous hacking session:
https://www.youtube.com/watch?v=mkmvZv58W6w )
- Paste the full *Snapper* output (of one time snapshot from the problem
time) here if you can - we'll be able to see what the slaves were doing
- You'd need to include not only LGWR SID but also any LG* slave SIDs too
- Also we'll see better LGWR delay metrics from V$SESSTAT that go beyond just wait events
- Thinking of metrics like these - snapper already handles them (if you
look into the rightmost column)
- *redo synch time overhead (usec)* - "FG wakeup overhead per log file sync"
- *redo write time* - accounting end to end redo write time (not just the log file parallel write syscall) "per redo write"
- Regarding the *ASH wait chains* error on 19c - I'm aware of this but forgot to look into it
- It has (probably) something to do with GV$ views (or PX slaves used for PDB V$ queries) and possibly the connect by I'm using
- Does it work in the CDB (if multitenant)?
- Does it work with lower optimizer features enabled?
- As we are already manually looking into what LGWR and its slaves are doing, there's no need for the wait chains script here
- Linux Process Snapper - pSnapper
- Were LGWR/LG* processes mostly idle during the problem time?
- As I kept reading the thread, it looks like ASH reported LGWR being mostly idle during the problem time
- This is not exactly the same pattern, but I've seen cases in the past where everyone's waiting for LGWR and LGWR doesn't seem to be busy and doesn't show any significant waits
- It turned out that LGWR was actually waiting for CKPT to make progress, but it wasn't synchronously waiting, but just going to sleep and "checking again later" in a loop
- I wrote about this here: https://blog.tanelpoder.com/posts/log-file-switch-checkpoint-incomplete-and-lgwr-waiting-for-checkpoint-progress/
- So it may still make sense to check what LGWR/LG* processes are doing even if they don't seem to be active much (like do you see an occasional controlfile read wait by LGWR etc)
- SQL*Trace would capture all activity of LGWR, but may slow things down further, so you could also sample & spool V$SESSION_WAIT_HISTORY (that keeps a log of last 10 waits of every session in memory)
- Were LGWR/LG* processes mostly idle during the problem time?
Incidentally I'm doing another hacking session tomorrow (Thursday 30 Jan _at_
12pm ET) about complex Oracle performance problems that span multiple
layers and have multiple moving parts. I can add some LGWR troubleshooting
magic in the end of that (here goes my plan to fit the session in 1.5h
again!). I'll post the details in my blog in a few minutes.
--
Tanel Poder
https://blog.tanelpoder.com/seminar
On Tue, Jan 28, 2020 at 2:22 PM Chris Stephens <cstephens16_at_gmail.com>
wrote:
> doesn't look like any ash data for lgwr:
>
> SQL> _at_ashtop session_id,blocking_session,event2 1=1 "TIMESTAMP'2020-01-27
> 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"
> TOTALSECONDS AAS %This SESSION_ID BLOCKING_SESSION
> EVENT2 FIRST_SEEN LAST_SEEN
> DIST_SQLEXEC_SEEN
> _______________ ______ __________ _____________ ___________________
> _____________________ ______________________ ______________________
> ____________________
> 81 0.7 82% | 1713 1710 log
> file sync 2020-01-27 15:50:54 2020-01-27 15:52:14
> 1
> 5 0 5% | 1221 ON CPU
> 2020-01-27 15:50:59 2020-01-27 15:52:14
> 1
> 3 0 3% | 246 ON CPU
> 2020-01-27 15:51:00 2020-01-27 15:52:00
> 1
> 3 0 3% | 1713 ON CPU
> 2020-01-27 15:50:51 2020-01-27 15:52:27
> 2
> 1 0 1% | 123 ON CPU
> 2020-01-27 15:51:35 2020-01-27 15:51:35
> 1
> 1 0 1% | 127 ON CPU
> 2020-01-27 15:51:50 2020-01-27 15:51:50
> 1
> 1 0 1% | 252 2321 latch
> free 2020-01-27 15:52:28 2020-01-27 15:52:28
> 1
> 1 0 1% | 978 ges
> remote message 2020-01-27 15:50:48 2020-01-27 15:50:48
> 1
> 1 0 1% | 983 latch
> free 2020-01-27 15:52:28 2020-01-27 15:52:28
> 1
> 1 0 1% | 1713
> library cache lock 2020-01-27 15:50:48 2020-01-27 15:50:48
> 1
> 1 0 1% | 1831 ON CPU
> 2020-01-27 15:50:49 2020-01-27 15:50:49
> 1
>
>
> 11 rows selected.
>
>
> SQL> _at_ashtop session_id,blocking_session,event2 session_id=1710
> "TIMESTAMP'2020-01-27 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"
>
> no rows selected
>
> On Tue, Jan 28, 2020 at 1:09 PM Noveljic Nenad <
> nenad.noveljic_at_vontobel.com> wrote:
>
>> You’re using v0.2, but v0.6 should be the latest version. 0.2 wouldn’t
>> show idle blockers.
>>
>> I looked at the ashtop output once again. Strange that lgwr processes nor
>> any other bg processes show up significantly there. The question is, what
>> was lgwr doing while the fg processes were waiting on him.
>>
>>
--
http://www.freelists.org/webpage/oracle-l
Received on Thu Jan 30 2020 - 05:32:41 CET