Re: intermittent long "log file sync" waits

From: Chris Stephens <cstephens16_at_gmail.com>
Date: Thu, 30 Jan 2020 10:27:50 -0600
Message-ID: <CAEFL0sx7fAF89Qike6PS2Jy9-N5S-w-r31XQ7kj8AEGP5FonTg_at_mail.gmail.com>



Hey Tanel,

Thanks for taking time to write this up! Will definitely be attending that webinar.

comments inline

On Wed, Jan 29, 2020 at 10:32 PM Tanel Poder <tanel_at_tanelpoder.com> wrote:

> Comments:
>
> 1) First, looks like you're using all the right tools for systematic
> troubleshooting ;-)
>

***i try!

>
> 2) As "log file sync" is not directly an I/O wait event (IO happens
> indirectly under the hood), but "wait for LGWR to ACK/increment redo
> persisting position" then there may be multiple reasons for delays. I'm
> repeating some of what others have said here too:
> - LGWR could be waiting for remote ACK (synchronous data guard) - that
> should show up as a LNS* wait or something like that though
> - Process slowness due to CPU overload/scheduling latency or swapping
> (in both cases, LGWR may do its work fast, but it takes a while for the
> foreground process to wake up and end the wait event - you could check if
> there are any reports regarding switching LGWR mode in alert.log around
> that time)
> - Bug/missed post due to switching between LGWR progress polling vs.
> waiting for LGWR post
> - ASM instance communication issues (ASM communication can be on the
> critical path when creating/changing files on ASM and even for reading,
> when the ASM metadata cache in your local instance doesn't have what it
> needs)
> - So, the physical redo log file disk write is not necessarily the
> biggest contributor to the log file sync wait
>
> *** no data guard in the env.
*** we have excess CPU. (20 cores / instance and usually only 1-5 active foreground processes but i will confirm from TFA collections) *** yesterday i was ready to blame bug. today, not so sure. i checked log file switches with a script google found me a few days ago. now thinking that was inaccurate as a script from metastink shows tones of switching so that could very well be related.
***asm communication is something i need to look into. TFA has reported some ASM issues but there's no details and i never saw anything in asm alert logs indicating an issue.

> 3) You'll need to measure what LGWR itself was doing during the problem
> time
> - Since you're on 19c (12c or later) the LGWR has slaves too and they
> may be doing most of the work
> - You'll need to look into both the LGWR process and any other LG* ones
> too
> - Since you have (Oracle) Snapper running already and writing to trace
> you should have the data already
> - The *lgwr* in the following _at_snapper ash,stats,trace 10 999999 lgwr
> syntax includes LGWR slaves too
> - As LGWR slaves have their own sessions (and SIDs), you should run
> ashtop etc and report if there's any BLOCKING_SESSION value listed for LGWR
> and slaves
> - If LGWR and the slaves were listed *idle* during the problem time
> (while others waited for log file sync event at the same time) *and if *there
> was no huge OS level CPU/memory shortage at the time, this looks like a
> background <-> foreground communication problem.
> - This (and any potential ASM issues) could probably be
> troubleshooted via KST tracing, KSR channels and X$TRACE that could help
> here (some of it is enabled by default ( Like I showed in a previous
> hacking session: https://www.youtube.com/watch?v=mkmvZv58W6w )
>
>
here is all the output i have corresponding to a 6 sec "log file sync" wait:

Experience duration: *R* = *t*1 − *t*0 = 41.336137 seconds <#profile-by-subroutine>
Experience begin time: *t*0 = 2020-01-29T13:04:07.560478 Experience end time: *t*1 = 2020-01-29T13:04:48.896615

CALL-NAME                        DURATION       %   CALLS      MEAN
MIN       MAX
------------------------------  ---------  ------  ------  --------
 --------  --------
SQL*Net message from client     26.017105   66.5%   9,886  0.002632
 0.000209  7.858051
log file sync                   10.582409   27.1%     150  0.070549
 0.000295  6.909711
EXEC                             2.088757    5.3%   9,702  0.000215
 0.000000  0.209199
DLM cross inst call completion   0.109617    0.3%     212  0.000517
 0.000005  0.080575
row cache lock                   0.070118    0.2%     125  0.000561
 0.000208  0.000972
FETCH                            0.047586    0.1%   1,029  0.000046
 0.000000  0.003056
PARSE                            0.045293    0.1%   4,778  0.000009
 0.000000  0.002692
CLOSE                            0.033453    0.1%   4,784  0.000007
 0.000000  0.000237
library cache lock               0.023614    0.1%      55  0.000429
 0.000152  0.000901
library cache pin                0.022311    0.1%      55  0.000406
 0.000153  0.000689
17 others                        0.071430    0.2%  10,772  0.000007
 0.000000 0.001487
------------------------------  ---------  ------  ------  --------
 --------  --------
TOTAL (27)                      39.111693  100.0%  41,548  0.000941
 0.000000 7.858051

6 sec wait from raw data:

  • 2020-01-29T13:04:20.724666-06:00 WAIT #0: nam='log file sync' ela= 6909711 buffer#=58996 sync scn=976089279 p3=0 obj#=-1 tim=1025228531958 WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=1025228532118 WAIT #0: nam='SQL*Net message from client' ela= 1711 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=1025228533852 BINDS #140204587734864:
i enabled 10046 on lgwr:
  • 2020-01-29T13:04:20.725994-06:00 WAIT #0: nam='rdbms ipc message' ela= 2918443 timeout=300 p2=0 p3=0 obj#=-1 tim=1025228533313 WAIT #0: nam='log file parallel write' ela= 279 files=1 blocks=17 requests=2 obj#=-1 tim=1025228533777 WAIT #0: nam='log file parallel write' ela= 171 files=1 blocks=3 requests=2 obj#=-1 tim=1025228534104 WAIT #0: nam='LGWR wait for redo copy' ela= 22 copy latch #=28 p2=0 p3=0 obj#=-1 tim=1025228534182 WAIT #0: nam='log file parallel write' ela= 183 files=1 blocks=2 requests=2 obj#=-1 tim=1025228534437 WAIT #0: nam='log file parallel write' ela= 174 files=1 blocks=2 requests=2 obj#=-1 tim=1025228534721 WAIT #0: nam='rdbms ipc message' ela= 31 timeout=8 p2=0 p3=0 obj#=-1 tim=1025228534786 WAIT #0: nam='rdbms ipc message' ela= 3233 timeout=8 p2=0 p3=0 obj#=-1 tim=1025228538038 WAIT #0: nam='log file parallel write' ela= 170 files=1 blocks=2 requests=1 obj#=-1 tim=1025228538297 WAIT #0: nam='rdbms ipc message' ela= 25 timeout=8 p2=0 p3=0 obj#=-1 tim=1025228538358 WAIT #0: nam='rdbms ipc message' ela= 11829 timeout=8 p2=0 p3=0 obj#=-1 tim=1025228550204 WAIT #0: nam='LGWR wait for redo copy' ela= 52 copy latch #=8 p2=0 p3=0 obj#=-1 tim=1025228550294 WAIT #0: nam='log file parallel write' ela= 239 files=1 blocks=7 requests=1 obj#=-1 tim=1025228550624 WAIT #0: nam='rdbms ipc message' ela= 23 timeout=6 p2=0 p3=0 obj#=-1 tim=1025228550699 WAIT #0: nam='rdbms ipc message' ela= 16750 timeout=6 p2=0 p3=0 obj#=-1 tim=1025228567467 WAIT #0: nam='log file parallel write' ela= 207 files=1 blocks=9 requests=1 obj#=-1 tim=1025228567762 WAIT #0: nam='rdbms ipc message' ela= 26 timeout=5 p2=0 p3=0 obj#=-1 tim=1025228567825 WAIT #0: nam='rdbms ipc message' ela= 16249 timeout=5 p2=0 p3=0 obj#=-1 tim=1025228584091 WAIT #0: nam='log file parallel write' ela= 213 files=1 blocks=9 requests=1 obj#=-1 tim=1025228584400 WAIT #0: nam='rdbms ipc message' ela= 48 timeout=3 p2=0 p3=0 obj#=-1 tim=1025228584502 WAIT #0: nam='rdbms ipc message' ela= 16466 timeout=3 p2=0 p3=0 obj#=-1 tim=1025228600993 WAIT #0: nam='log file parallel write' ela= 228 files=1 blocks=9 requests=1 obj#=-1 tim=1025228601393 WAIT #0: nam='rdbms ipc message' ela= 29 timeout=1 p2=0 p3=0 obj#=-1 tim=1025228601482 WAIT #0: nam='rdbms ipc message' ela= 9548 timeout=1 p2=0 p3=0 obj#=-1 tim=1025228611053 WAIT #0: nam='rdbms ipc message' ela= 7467 timeout=300 p2=0 p3=0 obj#=-1 tim=1025228618574 WAIT #0: nam='LGWR wait for redo copy' ela= 13 copy latch #=8 p2=0 p3=0 obj#=-1 tim=1025228618636 WAIT #0: nam='log file parallel write' ela= 222 files=1 blocks=9 requests=1 obj#=-1 tim=1025228618960 WAIT #0: nam='rdbms ipc message' ela= 31 timeout=300 p2=0 p3=0 obj#=-1 tim=1025228619066 WAIT #0: nam='rdbms ipc message' ela= 15946 timeout=300 p2=0 p3=0 obj#=-1 tim=1025228635045 WAIT #0: nam='log file parallel write' ela= 240 files=1 blocks=9 requests=1 obj#=-1 tim=1025228635427 WAIT #0: nam='rdbms ipc message' ela= 33 timeout=298 p2=0 p3=0 obj#=-1 tim=1025228635525 WAIT #0: nam='rdbms ipc message' ela= 15987 timeout=298 p2=0 p3=0 obj#=-1 tim=1025228651537 WAIT #0: nam='log file parallel write' ela= 255 files=1 blocks=9 requests=1 obj#=-1 tim=1025228651912 WAIT #0: nam='rdbms ipc message' ela= 51 timeout=296 p2=0 p3=0 obj#=-1 tim=1025228652034 WAIT #0: nam='rdbms ipc message' ela= 16332 timeout=296 p2=0 p3=0 obj#=-1 tim=1025228668405 WAIT #0: nam='log file parallel write' ela= 256 files=1 blocks=9 requests=1 obj#=-1 tim=1025228668804 WAIT #0: nam='rdbms ipc message' ela= 37 timeout=295 p2=0 p3=0 obj#=-1 tim=1025228668911 WAIT #0: nam='rdbms ipc message' ela= 16969 timeout=295 p2=0 p3=0 obj#=-1 tim=1025228685907 WAIT #0: nam='log file parallel write' ela= 273 files=1 blocks=9 requests=1 obj#=-1 tim=1025228686327 WAIT #0: nam='rdbms ipc message' ela= 38 timeout=293 p2=0 p3=0 obj#=-1 tim=1025228686453 WAIT #0: nam='rdbms ipc message' ela= 6908 timeout=293 p2=0 p3=0 obj#=-1 tim=1025228693390 WAIT #0: nam='log file parallel write' ela= 247 files=1 blocks=2 requests=1 obj#=-1 tim=1025228693768 WAIT #0: nam='rdbms ipc message' ela= 38 timeout=292 p2=0 p3=0 obj#=-1 tim=1025228693862j ...more of the same 264396 WAIT #0: nam='rdbms ipc message' ela= 32 timeout=52 p2=0 p3=0 obj#=-1 tim=1025231090742 264397 WAIT #0: nam='rdbms ipc message' ela= 19460 timeout=52 p2=0 p3=0 obj#=-1 tim=1025231110227 264398 WAIT #0: nam='log file parallel write' ela= 272 files=1 blocks=9 requests=1 obj#=-1 tim=1025231110633 264399 WAIT #0: nam='rdbms ipc message' ela= 37 timeout=50 p2=0 p3=0 obj#=-1 tim=1025231110748 264400 WAIT #0: nam='rdbms ipc message' ela= 15484 timeout=50 p2=0 p3=0 obj#=-1 tim=1025231126261 264401 WAIT #0: nam='log file parallel write' ela= 247 files=1 blocks=4 requests=1 obj#=-1 tim=1025231126641 264402 WAIT #0: nam='rdbms ipc message' ela= 36 timeout=49 p2=0 p3=0 obj#=-1 tim=1025231126753 264403 WAIT #0: nam='rdbms ipc message' ela= 16732 timeout=49 p2=0 p3=0 obj#=-1 tim=1025231143514 264404 WAIT #0: nam='LGWR wait for redo copy' ela= 31 copy latch #=8 p2=0 p3=0 obj#=-1 tim=1025231143603 264405 WAIT #0: nam='log file parallel write' ela= 279 files=1 blocks=9 requests=1 obj#=-1 tim=1025231143976 264406 WAIT #0: nam='rdbms ipc message' ela= 35 timeout=47 p2=0 p3=0 obj#=-1 tim=1025231144093 264407 WAIT #0: nam='rdbms ipc message' ela= 5384 timeout=47 p2=0 p3=0 obj#=-1 tim=1025231149515 264408 WAIT #0: nam='log file parallel write' ela= 255 files=1 blocks=4 requests=1 obj#=-1 tim=1025231149889 264409 WAIT #0: nam='rdbms ipc message' ela= 32 timeout=47 p2=0 p3=0 obj#=-1 tim=1025231149989 264410 264411 *** 2020-01-29T13:04:23.802774-06:00 264412 WAIT #0: nam='rdbms ipc message' ela= 460056 timeout=46 p2=0 p3=0 obj#=-1 tim=1025231610086 264413 264414 *** 2020-01-29T13:04:26.797925-06:00 264415 WAIT #0: nam='rdbms ipc message' ela= 2995027 timeout=300 p2=0 p3=0 obj#=-1 tim=1025234605238 264416 WAIT #0: nam='rdbms ipc message' ela= 10649 timeout=1 p2=0 p3=0 obj#=-1 tim=1025234616054 264417

i will get you snapper lgwr output associated w/ this interval as soon as i can find it (can't remember where i stashed it and lost trace file name .. fml)

4) Paste the full *Snapper* output (of one time snapshot from the problem
> time) here if you can - we'll be able to see what the slaves were doing
> - You'd need to include not only LGWR SID but also any LG* slave SIDs too
> - Also we'll see better LGWR delay metrics from V$SESSTAT that go beyond
> just wait events
> - Thinking of metrics like these - snapper already handles them (if you
> look into the rightmost column)
> - *redo synch time overhead (usec)* - "FG wakeup overhead per log
> file sync"
> - *redo write time* - accounting end to end redo write time (not
> just the log file parallel write syscall) "per redo write"
>
> 5) Regarding the *ASH wait chains* error on 19c - I'm aware of this but
> forgot to look into it
> - It has (probably) something to do with GV$ views (or PX slaves used
> for PDB V$ queries) and possibly the connect by I'm using
> - Does it work in the CDB (if multitenant)?
> - Does it work with lower optimizer features enabled?
> - As we are already manually looking into what LGWR and its slaves are
> doing, there's no need for the wait chains script here
>
> ***oracle has filed a bug. altering optimizer_features_enabled doesn't
help.

> 6) Linux Process Snapper - pSnapper
> - If you want to see the full "wait" states of Linux processes/threads,
> use the* -a *option.
> - It's somewhat like always showing "idle" threads too
> - But for linux apps it's harder to determine what's idle and what's
> some network RPC wait, so I currently just do what Linux "load" metric does
> - I just shows threads in Runnable and Disk/Uninterruptible sleep
> states by default (-a shows all states, including Sleep and kernel-Idle
> threads etc)
> - So the Linux pSnapper "idle" means a somewhat different thing than
> Oracle "idle" - say LGWR was waiting for a Data Guard ACK, that should not
> be shown as Idle in Oracle, but at the Linux process level it's "just some
> network socket wait" and you'd need to use *-a* option to show all of them
>

  • will add that option to future collections (and watch the video/read help)

>
> 7) Were LGWR/LG* processes mostly idle during the problem time?
> - As I kept reading the thread, it looks like ASH reported LGWR being
> mostly idle during the problem time
> - This is not exactly the same pattern, but I've seen cases in the past
> where everyone's waiting for LGWR and LGWR doesn't seem to be busy and
> doesn't show any significant waits
> - It turned out that LGWR was actually waiting for CKPT to make
> progress, but it wasn't synchronously waiting, but just going to sleep and
> "checking again later" in a loop
> - I wrote about this here:
> https://blog.tanelpoder.com/posts/log-file-switch-checkpoint-incomplete-and-lgwr-waiting-for-checkpoint-progress/
> - So it may still make sense to check what LGWR/LG* processes are doing
> even if they don't seem to be active much (like do you see an occasional
> controlfile read wait by LGWR etc)
> - SQL*Trace would capture all activity of LGWR, but may slow things
> down further, so you could also sample & spool V$SESSION_WAIT_HISTORY (that
> keeps a log of last 10 waits of every session in memory)
>
> *** i think i've addressed most of this.

> Incidentally I'm doing another hacking session tomorrow (Thursday 30 Jan _at_
> 12pm ET) about complex Oracle performance problems that span multiple
> layers and have multiple moving parts. I can add some LGWR troubleshooting
> magic in the end of that (here goes my plan to fit the session in 1.5h
> again!). I'll post the details in my blog in a few minutes.
>
> *** can't wait!!

> --
> Tanel Poder
> https://blog.tanelpoder.com/seminar
>
>
> On Tue, Jan 28, 2020 at 2:22 PM Chris Stephens <cstephens16_at_gmail.com>
> wrote:
>
>> doesn't look like any ash data for lgwr:
>>
>> SQL> _at_ashtop session_id,blocking_session,event2 1=1 "TIMESTAMP'2020-01-27
>> 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"
>> TOTALSECONDS AAS %This SESSION_ID BLOCKING_SESSION
>> EVENT2 FIRST_SEEN LAST_SEEN
>> DIST_SQLEXEC_SEEN
>> _______________ ______ __________ _____________ ___________________
>> _____________________ ______________________ ______________________
>> ____________________
>> 81 0.7 82% | 1713 1710 log
>> file sync 2020-01-27 15:50:54 2020-01-27 15:52:14
>> 1
>> 5 0 5% | 1221 ON
>> CPU 2020-01-27 15:50:59 2020-01-27 15:52:14
>> 1
>> 3 0 3% | 246 ON
>> CPU 2020-01-27 15:51:00 2020-01-27 15:52:00
>> 1
>> 3 0 3% | 1713 ON
>> CPU 2020-01-27 15:50:51 2020-01-27 15:52:27
>> 2
>> 1 0 1% | 123 ON
>> CPU 2020-01-27 15:51:35 2020-01-27 15:51:35
>> 1
>> 1 0 1% | 127 ON
>> CPU 2020-01-27 15:51:50 2020-01-27 15:51:50
>> 1
>> 1 0 1% | 252 2321 latch
>> free 2020-01-27 15:52:28 2020-01-27 15:52:28
>> 1
>> 1 0 1% | 978 ges
>> remote message 2020-01-27 15:50:48 2020-01-27 15:50:48
>> 1
>> 1 0 1% | 983 latch
>> free 2020-01-27 15:52:28 2020-01-27 15:52:28
>> 1
>> 1 0 1% | 1713
>> library cache lock 2020-01-27 15:50:48 2020-01-27 15:50:48
>> 1
>> 1 0 1% | 1831 ON
>> CPU 2020-01-27 15:50:49 2020-01-27 15:50:49
>> 1
>>
>>
>> 11 rows selected.
>>
>>
>> SQL> _at_ashtop session_id,blocking_session,event2 session_id=1710
>> "TIMESTAMP'2020-01-27 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"
>>
>> no rows selected
>>
>> On Tue, Jan 28, 2020 at 1:09 PM Noveljic Nenad <
>> nenad.noveljic_at_vontobel.com> wrote:
>>
>>> You’re using v0.2, but v0.6 should be the latest version. 0.2 wouldn’t
>>> show idle blockers.
>>>
>>> I looked at the ashtop output once again. Strange that lgwr processes
>>> nor any other bg processes show up significantly there. The question is,
>>> what was lgwr doing while the fg processes were waiting on him.
>>>
>>>

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Jan 30 2020 - 17:27:50 CET

Original text of this message