Re: intermittent long "log file sync" waits
Date: Sun, 2 Feb 2020 09:20:07 -0600
Message-ID: <CAEFL0szoZkns_vDHZBmX22BVCznDwG3LN+bLUQY3EuqrvoQqhw_at_mail.gmail.com>
SQL> _at_ashtopLFS username,event2,program2,blocking_session,blocking_inst_id
"program like '%(LG%)'" "TIMESTAMP'2020-01-29 13:04:00'"
"TIMESTAMP'2020-01-29 13:05:00'"
From the beginning we noticed slow writes to local disc (>40ms at times).
On Friday, as I was running out of ideas, the other DBA on the project
suggested we create a ramdisk for /tmp as we noticed lots of writes to
/tmp/.oracle. After doing that and rebooting all nodes, we are now unable
to reproduce the issue.
I'd already turned off adaptive log file syn:
SQL> _at_pd "_use_adaptive"
SQL>
On Monday (or Tuesday if I end up taking tomorrow off), we plan to remove
the ramdisk and see if the issue reproduces again.
On Sat, Feb 1, 2020 at 9:34 PM Tanel Poder <tanel_at_tanelpoder.com> wrote:
> Hi Chris,
no rows selected
SQL>
i pulled over all the tables/views referenced in ashtop.sql and
ash_wait_chains.sql to a single instance database. Hence, "ashtopLFS.sql"
Show all parameters and session values from x$ksppi/x$ksppcv...
NUM N_HEX NAME
VALUE DESCRIPTION
---------- ----- --------------------------------------------------------
------------------------------
---------------------------------------------------------------------------------------------------
2049 801 _use_adaptive_log_file_sync FALSE
Adaptively switch between post/wait and polling
>
> As 12c+ has log writer slaves and Oracle can adaptavely switch from single
> LGWR to LGWR + LGxx slave (_use_single_log_writer=adaptive), can you check
> ASH data for the LGxx slaves at this time too. Not all process-to-process
> interactions/waits are instrumented with a blocking_session in Oracle...
>
> Should be this command:
>
> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id
> "program like '%(LG%)'" "TIMESTAMP'2020-01-29 13:04:00'"
> "TIMESTAMP'2020-01-29 13:05:00'"
>
> --
> Tanel Poder
> https://blog.tanelpoder.com/seminar
>
>
> On Thu, Jan 30, 2020 at 3:44 PM Chris Stephens <cstephens16_at_gmail.com>
> wrote:
>
>> Here's some additional ash output from the trace data posted in last
>> message including event_hist.sql:
>>
>> SQL> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id
>> 1=1 "TIMESTAMP'2020-01-29 13:04:20'" "TIMESTAMP'2020-01-29 13:04:27'"
>> TOTALSECONDS AAS %This USERNAME EVENT2
>> PROGRAM2 BLOCKING_SESSION BLOCKING_INST_ID FIRST_SEEN
>> LAST_SEEN DIST_SQLEXEC_SEEN
>> _______________ ______ __________ ________________ ________________
>> _____________ ___________________ ___________________
>> ______________________ ______________________ ____________________
>> 2 0.3 50% | SYS ON CPU
>> (sqlplus) 2020-01-29 13:04:24
>> 2020-01-29 13:04:25 1
>> 1 0.1 25% | GEN3_MGOWER_3 ON CPU
>> (python) 2020-01-29 13:04:22
>> 2020-01-29 13:04:22 1
>> 1 0.1 25% | GEN3_MGOWER_3 log file sync
>> (python) 1710 2 2020-01-29 13:04:20
>> 2020-01-29 13:04:20 1
>>
>>
>> SQL> _at_bg lgwr
>> NAME DESCRIPTION SID OPID SPID PADDR
>> SADDR
>> _______ ______________ _______ _______ ________ ___________________
>> ___________________
>> LGWR Redo etc. 1710 34 26552 00000001E8718860
>> 00000001D8BFF4A0
>>
>> SQL> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id
>> session_id=1710 "TIMESTAMP'2020-01-29 13:04:20'" "TIMESTAMP'2020-01-29
>> 13:04:27'"
>>
>> no rows selected
>>
>> SQL> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id
>> session_id=1710 "TIMESTAMP'2020-01-29 13:04:00'" "TIMESTAMP'2020-01-29
>> 13:05:00'"
>>
>> no rows selected
>>
>> SQL> _at_ash/event_hist "log file" 1=1 "TIMESTAMP'2020-01-29 13:04:00'"
>> "TIMESTAMP'2020-01-29 13:05:00'"
>> EVH_EVENT EVH_MILLISEC EVH_SAMPLE_COUNT
>> EVH_EST_EVENT_COUNT EVH_EST_TOTAL_SEC PCT_EVT_TIME EVH_GRAPH
>> FIRST_SEEN LAST_SEEN
>> ________________ __________________ ___________________
>> ______________________ ____________________ _______________ _______________
>> ______________________ ______________________
>> log file sync < 1024 1
>> 1.8 1.8 12.8 |# | 2020-01-29
>> 13:04:13 2020-01-29 13:04:13
>> log file sync < 4096 1
>> 1 4.1 29.1 |### | 2020-01-29
>> 13:04:12 2020-01-29 13:04:12
>> log file sync < 8192 1
>> 1 8.2 58.2 |###### | 2020-01-29
>> 13:04:20 2020-01-29 13:04:20
>>
>> SQL> _at_ash/event_hist "write" 1=1 "TIMESTAMP'2020-01-29 13:04:00'"
>> "TIMESTAMP'2020-01-29 13:05:00'"
>>
>> no rows selected
>>
>> SQL> _at_ash/event_hist "*" 1=1 "TIMESTAMP'2020-01-29 13:04:00'"
>> "TIMESTAMP'2020-01-29 13:05:00'"
>> EVH_EVENT EVH_MILLISEC EVH_SAMPLE_COUNT
>> EVH_EST_EVENT_COUNT EVH_EST_TOTAL_SEC PCT_EVT_TIME EVH_GRAPH
>> FIRST_SEEN LAST_SEEN
>> _________________ __________________ ___________________
>> ______________________ ____________________ _______________ _______________
>> ______________________ ______________________
>> log file sync < 1024 1
>> 1.8 1.8 12.8 |# | 2020-01-29
>> 13:04:13 2020-01-29 13:04:13
>> log file sync < 4096 1
>> 1 4.1 29.1 |### | 2020-01-29
>> 13:04:12 2020-01-29 13:04:12
>> log file sync < 8192 1
>> 1 8.2 58.2 |###### | 2020-01-29
>> 13:04:20 2020-01-29 13:04:20
>> row cache lock < 1 1
>> 2469.1 2.5 100 |##########| 2020-01-29
>> 13:04:01 2020-01-29 13:04:01
>>
>> SQL> _at_ashtop username,event2,program2,blocking_session,blocking_inst_id
>> "lower(program) like '%lg%'" "TIMESTAMP'2020-01-29 13:04:00'"
>> "TIMESTAMP'2020-01-29 13:05:00'"
>>
>> no rows selected
>>
>> i'm going to copy over gv$active_session_history to a single instance
>> database and take a look at ash_wait_chains but i don't expect that to
>> reveal anything either.
>>
>> i need to confirm whether adding redo resolved issue (will have developer
>> fire up workload) but I think next step is to investigate x$trace and
>> possible process communication issues?
>>
>> On Thu, Jan 30, 2020 at 10:27 AM Chris Stephens <cstephens16_at_gmail.com>
>> wrote:
>>
>>> Hey Tanel,
>>>
>>> Thanks for taking time to write this up! Will definitely be attending
>>> that webinar.
>>>
>>> comments inline
>>>
>>> On Wed, Jan 29, 2020 at 10:32 PM Tanel Poder <tanel_at_tanelpoder.com>
>>> wrote:
>>>
>>>> Comments:
>>>>
>>>> 1) First, looks like you're using all the right tools for systematic
>>>> troubleshooting ;-)
>>>>
>>>
>>> ***i try!
>>>
>>>>
>>>> 2) As "log file sync" is not directly an I/O wait event (IO happens
>>>> indirectly under the hood), but "wait for LGWR to ACK/increment redo
>>>> persisting position" then there may be multiple reasons for delays. I'm
>>>> repeating some of what others have said here too:
>>>> - LGWR could be waiting for remote ACK (synchronous data guard) -
>>>> that should show up as a LNS* wait or something like that though
>>>> - Process slowness due to CPU overload/scheduling latency or swapping
>>>> (in both cases, LGWR may do its work fast, but it takes a while for the
>>>> foreground process to wake up and end the wait event - you could check if
>>>> there are any reports regarding switching LGWR mode in alert.log around
>>>> that time)
>>>> - Bug/missed post due to switching between LGWR progress polling vs.
>>>> waiting for LGWR post
>>>> - ASM instance communication issues (ASM communication can be on the
>>>> critical path when creating/changing files on ASM and even for reading,
>>>> when the ASM metadata cache in your local instance doesn't have what it
>>>> needs)
>>>> - So, the physical redo log file disk write is not necessarily the
>>>> biggest contributor to the log file sync wait
>>>>
>>>> *** no data guard in the env.
>>> *** we have excess CPU. (20 cores / instance and usually only 1-5 active
>>> foreground processes but i will confirm from TFA collections)
>>> *** yesterday i was ready to blame bug. today, not so sure. i checked
>>> log file switches with a script google found me a few days ago. now
>>> thinking that was inaccurate as a script from metastink shows tones of
>>> switching so that could very well be related.
>>> ***asm communication is something i need to look into. TFA has reported
>>> some ASM issues but there's no details and i never saw anything in asm
>>> alert logs indicating an issue.
>>>
>>>
>>>> 3) You'll need to measure what LGWR itself was doing during the problem
>>>> time
>>>> - Since you're on 19c (12c or later) the LGWR has slaves too and they
>>>> may be doing most of the work
>>>> - You'll need to look into both the LGWR process and any other LG*
>>>> ones too
>>>> - Since you have (Oracle) Snapper running already and writing to
>>>> trace you should have the data already
>>>> - The *lgwr* in the following _at_snapper ash,stats,trace 10 999999 lgwr
>>>> syntax includes LGWR slaves too
>>>> - As LGWR slaves have their own sessions (and SIDs), you should run
>>>> ashtop etc and report if there's any BLOCKING_SESSION value listed for LGWR
>>>> and slaves
>>>> - If LGWR and the slaves were listed *idle* during the problem time
>>>> (while others waited for log file sync event at the same time) *and
>>>> if *there was no huge OS level CPU/memory shortage at the time, this
>>>> looks like a background <-> foreground communication problem.
>>>> - This (and any potential ASM issues) could probably be
>>>> troubleshooted via KST tracing, KSR channels and X$TRACE that could help
>>>> here (some of it is enabled by default ( Like I showed in a previous
>>>> hacking session: https://www.youtube.com/watch?v=mkmvZv58W6w )
>>>>
>>>>
>>> here is all the output i have corresponding to a 6 sec "log file sync"
>>> wait:
>>>
>>> Experience duration: *R* = *t*1 − *t*0 = 41.336137 seconds
>>> <#m_-5505970199061174881_m_7542945560652427325_m_-5509691095923674063_profile-by-subroutine>
>>> Experience begin time: *t*0 = 2020-01-29T13:04:07.560478
>>> Experience end time: *t*1 = 2020-01-29T13:04:48.896615
>>> CALL-NAME DURATION % CALLS MEAN
>>> MIN MAX
>>> ------------------------------ --------- ------ ------ --------
>>> -------- --------
>>> SQL*Net message from client 26.017105 66.5% 9,886 0.002632
>>> 0.000209 7.858051
>>> log file sync 10.582409 27.1% 150 0.070549
>>> 0.000295 6.909711
>>> EXEC 2.088757 5.3% 9,702 0.000215
>>> 0.000000 0.209199
>>> DLM cross inst call completion 0.109617 0.3% 212 0.000517
>>> 0.000005 0.080575
>>> row cache lock 0.070118 0.2% 125 0.000561
>>> 0.000208 0.000972
>>> FETCH 0.047586 0.1% 1,029 0.000046
>>> 0.000000 0.003056
>>> PARSE 0.045293 0.1% 4,778 0.000009
>>> 0.000000 0.002692
>>> CLOSE 0.033453 0.1% 4,784 0.000007
>>> 0.000000 0.000237
>>> library cache lock 0.023614 0.1% 55 0.000429
>>> 0.000152 0.000901
>>> library cache pin 0.022311 0.1% 55 0.000406
>>> 0.000153 0.000689
>>> 17 others 0.071430 0.2% 10,772 0.000007
>>> 0.000000 0.001487
>>> ------------------------------ --------- ------ ------ --------
>>> -------- --------
>>> TOTAL (27) 39.111693 100.0% 41,548 0.000941
>>> 0.000000 7.858051
>>>
>>> 6 sec wait from raw data:
>>>
>>> *** 2020-01-29T13:04:20.724666-06:00
>>> WAIT #0: nam='log file sync' ela= 6909711 buffer#=58996 sync
>>> scn=976089279 p3=0 obj#=-1 tim=1025228531958
>>> WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536
>>> #bytes=1 p3=0 obj#=-1 tim=1025228532118
>>> WAIT #0: nam='SQL*Net message from client' ela= 1711 driver
>>> id=1413697536 #bytes=1 p3=0 obj#=-1 tim=1025228533852
>>> BINDS #140204587734864:
>>>
>>> i enabled 10046 on lgwr:
>>>
>>> *** 2020-01-29T13:04:20.725994-06:00
>>> WAIT #0: nam='rdbms ipc message' ela= 2918443 timeout=300 p2=0 p3=0
>>> obj#=-1 tim=1025228533313
>>> WAIT #0: nam='log file parallel write' ela= 279 files=1 blocks=17
>>> requests=2 obj#=-1 tim=1025228533777
>>> WAIT #0: nam='log file parallel write' ela= 171 files=1 blocks=3
>>> requests=2 obj#=-1 tim=1025228534104
>>> WAIT #0: nam='LGWR wait for redo copy' ela= 22 copy latch #=28 p2=0 p3=0
>>> obj#=-1 tim=1025228534182
>>> WAIT #0: nam='log file parallel write' ela= 183 files=1 blocks=2
>>> requests=2 obj#=-1 tim=1025228534437
>>> WAIT #0: nam='log file parallel write' ela= 174 files=1 blocks=2
>>> requests=2 obj#=-1 tim=1025228534721
>>> WAIT #0: nam='rdbms ipc message' ela= 31 timeout=8 p2=0 p3=0 obj#=-1
>>> tim=1025228534786
>>> WAIT #0: nam='rdbms ipc message' ela= 3233 timeout=8 p2=0 p3=0 obj#=-1
>>> tim=1025228538038
>>> WAIT #0: nam='log file parallel write' ela= 170 files=1 blocks=2
>>> requests=1 obj#=-1 tim=1025228538297
>>> WAIT #0: nam='rdbms ipc message' ela= 25 timeout=8 p2=0 p3=0 obj#=-1
>>> tim=1025228538358
>>> WAIT #0: nam='rdbms ipc message' ela= 11829 timeout=8 p2=0 p3=0 obj#=-1
>>> tim=1025228550204
>>> WAIT #0: nam='LGWR wait for redo copy' ela= 52 copy latch #=8 p2=0 p3=0
>>> obj#=-1 tim=1025228550294
>>> WAIT #0: nam='log file parallel write' ela= 239 files=1 blocks=7
>>> requests=1 obj#=-1 tim=1025228550624
>>> WAIT #0: nam='rdbms ipc message' ela= 23 timeout=6 p2=0 p3=0 obj#=-1
>>> tim=1025228550699
>>> WAIT #0: nam='rdbms ipc message' ela= 16750 timeout=6 p2=0 p3=0 obj#=-1
>>> tim=1025228567467
>>> WAIT #0: nam='log file parallel write' ela= 207 files=1 blocks=9
>>> requests=1 obj#=-1 tim=1025228567762
>>> WAIT #0: nam='rdbms ipc message' ela= 26 timeout=5 p2=0 p3=0 obj#=-1
>>> tim=1025228567825
>>> WAIT #0: nam='rdbms ipc message' ela= 16249 timeout=5 p2=0 p3=0 obj#=-1
>>> tim=1025228584091
>>> WAIT #0: nam='log file parallel write' ela= 213 files=1 blocks=9
>>> requests=1 obj#=-1 tim=1025228584400
>>> WAIT #0: nam='rdbms ipc message' ela= 48 timeout=3 p2=0 p3=0 obj#=-1
>>> tim=1025228584502
>>> WAIT #0: nam='rdbms ipc message' ela= 16466 timeout=3 p2=0 p3=0 obj#=-1
>>> tim=1025228600993
>>> WAIT #0: nam='log file parallel write' ela= 228 files=1 blocks=9
>>> requests=1 obj#=-1 tim=1025228601393
>>> WAIT #0: nam='rdbms ipc message' ela= 29 timeout=1 p2=0 p3=0 obj#=-1
>>> tim=1025228601482
>>> WAIT #0: nam='rdbms ipc message' ela= 9548 timeout=1 p2=0 p3=0 obj#=-1
>>> tim=1025228611053
>>> WAIT #0: nam='rdbms ipc message' ela= 7467 timeout=300 p2=0 p3=0 obj#=-1
>>> tim=1025228618574
>>> WAIT #0: nam='LGWR wait for redo copy' ela= 13 copy latch #=8 p2=0 p3=0
>>> obj#=-1 tim=1025228618636
>>> WAIT #0: nam='log file parallel write' ela= 222 files=1 blocks=9
>>> requests=1 obj#=-1 tim=1025228618960
>>> WAIT #0: nam='rdbms ipc message' ela= 31 timeout=300 p2=0 p3=0 obj#=-1
>>> tim=1025228619066
>>> WAIT #0: nam='rdbms ipc message' ela= 15946 timeout=300 p2=0 p3=0
>>> obj#=-1 tim=1025228635045
>>> WAIT #0: nam='log file parallel write' ela= 240 files=1 blocks=9
>>> requests=1 obj#=-1 tim=1025228635427
>>> WAIT #0: nam='rdbms ipc message' ela= 33 timeout=298 p2=0 p3=0 obj#=-1
>>> tim=1025228635525
>>> WAIT #0: nam='rdbms ipc message' ela= 15987 timeout=298 p2=0 p3=0
>>> obj#=-1 tim=1025228651537
>>> WAIT #0: nam='log file parallel write' ela= 255 files=1 blocks=9
>>> requests=1 obj#=-1 tim=1025228651912
>>> WAIT #0: nam='rdbms ipc message' ela= 51 timeout=296 p2=0 p3=0 obj#=-1
>>> tim=1025228652034
>>> WAIT #0: nam='rdbms ipc message' ela= 16332 timeout=296 p2=0 p3=0
>>> obj#=-1 tim=1025228668405
>>> WAIT #0: nam='log file parallel write' ela= 256 files=1 blocks=9
>>> requests=1 obj#=-1 tim=1025228668804
>>> WAIT #0: nam='rdbms ipc message' ela= 37 timeout=295 p2=0 p3=0 obj#=-1
>>> tim=1025228668911
>>> WAIT #0: nam='rdbms ipc message' ela= 16969 timeout=295 p2=0 p3=0
>>> obj#=-1 tim=1025228685907
>>> WAIT #0: nam='log file parallel write' ela= 273 files=1 blocks=9
>>> requests=1 obj#=-1 tim=1025228686327
>>> WAIT #0: nam='rdbms ipc message' ela= 38 timeout=293 p2=0 p3=0 obj#=-1
>>> tim=1025228686453
>>> WAIT #0: nam='rdbms ipc message' ela= 6908 timeout=293 p2=0 p3=0 obj#=-1
>>> tim=1025228693390
>>> WAIT #0: nam='log file parallel write' ela= 247 files=1 blocks=2
>>> requests=1 obj#=-1 tim=1025228693768
>>> WAIT #0: nam='rdbms ipc message' ela= 38 timeout=292 p2=0 p3=0 obj#=-1
>>> tim=1025228693862j
>>> ...more of the same
>>> 264396 WAIT #0: nam='rdbms ipc message' ela= 32 timeout=52 p2=0 p3=0
>>> obj#=-1 tim=1025231090742
>>> 264397 WAIT #0: nam='rdbms ipc message' ela= 19460 timeout=52 p2=0 p3=0
>>> obj#=-1 tim=1025231110227
>>> 264398 WAIT #0: nam='log file parallel write' ela= 272 files=1 blocks=9
>>> requests=1 obj#=-1 tim=1025231110633
>>> 264399 WAIT #0: nam='rdbms ipc message' ela= 37 timeout=50 p2=0 p3=0
>>> obj#=-1 tim=1025231110748
>>> 264400 WAIT #0: nam='rdbms ipc message' ela= 15484 timeout=50 p2=0 p3=0
>>> obj#=-1 tim=1025231126261
>>> 264401 WAIT #0: nam='log file parallel write' ela= 247 files=1 blocks=4
>>> requests=1 obj#=-1 tim=1025231126641
>>> 264402 WAIT #0: nam='rdbms ipc message' ela= 36 timeout=49 p2=0 p3=0
>>> obj#=-1 tim=1025231126753
>>> 264403 WAIT #0: nam='rdbms ipc message' ela= 16732 timeout=49 p2=0 p3=0
>>> obj#=-1 tim=1025231143514
>>> 264404 WAIT #0: nam='LGWR wait for redo copy' ela= 31 copy latch #=8
>>> p2=0 p3=0 obj#=-1 tim=1025231143603
>>> 264405 WAIT #0: nam='log file parallel write' ela= 279 files=1 blocks=9
>>> requests=1 obj#=-1 tim=1025231143976
>>> 264406 WAIT #0: nam='rdbms ipc message' ela= 35 timeout=47 p2=0 p3=0
>>> obj#=-1 tim=1025231144093
>>> 264407 WAIT #0: nam='rdbms ipc message' ela= 5384 timeout=47 p2=0 p3=0
>>> obj#=-1 tim=1025231149515
>>> 264408 WAIT #0: nam='log file parallel write' ela= 255 files=1 blocks=4
>>> requests=1 obj#=-1 tim=1025231149889
>>> 264409 WAIT #0: nam='rdbms ipc message' ela= 32 timeout=47 p2=0 p3=0
>>> obj#=-1 tim=1025231149989
>>> 264410
>>> 264411 *** 2020-01-29T13:04:23.802774-06:00
>>> 264412 WAIT #0: nam='rdbms ipc message' ela= 460056 timeout=46 p2=0
>>> p3=0 obj#=-1 tim=1025231610086
>>> 264413
>>> 264414 *** 2020-01-29T13:04:26.797925-06:00
>>> 264415 WAIT #0: nam='rdbms ipc message' ela= 2995027 timeout=300 p2=0
>>> p3=0 obj#=-1 tim=1025234605238
>>> 264416 WAIT #0: nam='rdbms ipc message' ela= 10649 timeout=1 p2=0 p3=0
>>> obj#=-1 tim=1025234616054
>>> 264417
>>>
>>> i will get you snapper lgwr output associated w/ this interval as soon
>>> as i can find it (can't remember where i stashed it and lost trace file
>>> name .. fml)
>>>
>>> 4) Paste the full *Snapper* output (of one time snapshot from the
>>>> problem time) here if you can - we'll be able to see what the slaves were
>>>> doing
>>>> - You'd need to include not only LGWR SID but also any LG* slave SIDs
>>>> too
>>>> - Also we'll see better LGWR delay metrics from V$SESSTAT that go
>>>> beyond just wait events
>>>> - Thinking of metrics like these - snapper already handles them (if
>>>> you look into the rightmost column)
>>>> - *redo synch time overhead (usec)* - "FG wakeup overhead per log
>>>> file sync"
>>>> - *redo write time* - accounting end to end redo write time (not
>>>> just the log file parallel write syscall) "per redo write"
>>>>
>>>> 5) Regarding the *ASH wait chains* error on 19c - I'm aware of this
>>>> but forgot to look into it
>>>> - It has (probably) something to do with GV$ views (or PX slaves
>>>> used for PDB V$ queries) and possibly the connect by I'm using
>>>> - Does it work in the CDB (if multitenant)?
>>>> - Does it work with lower optimizer features enabled?
>>>> - As we are already manually looking into what LGWR and its slaves
>>>> are doing, there's no need for the wait chains script here
>>>>
>>>> ***oracle has filed a bug. altering optimizer_features_enabled doesn't
>>> help.
>>>
>>>
>>>> 6) Linux Process Snapper - pSnapper
>>>> - If you want to see the full "wait" states of Linux
>>>> processes/threads, use the* -a *option.
>>>> - It's somewhat like always showing "idle" threads too
>>>> - But for linux apps it's harder to determine what's idle and what's
>>>> some network RPC wait, so I currently just do what Linux "load" metric does
>>>> - I just shows threads in Runnable and Disk/Uninterruptible sleep
>>>> states by default (-a shows all states, including Sleep and kernel-Idle
>>>> threads etc)
>>>> - So the Linux pSnapper "idle" means a somewhat different thing than
>>>> Oracle "idle" - say LGWR was waiting for a Data Guard ACK, that should not
>>>> be shown as Idle in Oracle, but at the Linux process level it's "just some
>>>> network socket wait" and you'd need to use *-a* option to show all of
>>>> them
>>>>
>>>
>>> *** will add that option to future collections (and watch the video/read
>>> help)
>>>
>>>>
>>>> 7) Were LGWR/LG* processes mostly idle during the problem time?
>>>> - As I kept reading the thread, it looks like ASH reported LGWR
>>>> being mostly idle during the problem time
>>>> - This is not exactly the same pattern, but I've seen cases in the
>>>> past where everyone's waiting for LGWR and LGWR doesn't seem to be busy and
>>>> doesn't show any significant waits
>>>> - It turned out that LGWR was actually waiting for CKPT to make
>>>> progress, but it wasn't synchronously waiting, but just going to sleep and
>>>> "checking again later" in a loop
>>>> - I wrote about this here:
>>>> https://blog.tanelpoder.com/posts/log-file-switch-checkpoint-incomplete-and-lgwr-waiting-for-checkpoint-progress/
>>>> - So it may still make sense to check what LGWR/LG* processes are
>>>> doing even if they don't seem to be active much (like do you see an
>>>> occasional controlfile read wait by LGWR etc)
>>>> - SQL*Trace would capture all activity of LGWR, but may slow things
>>>> down further, so you could also sample & spool V$SESSION_WAIT_HISTORY (that
>>>> keeps a log of last 10 waits of every session in memory)
>>>>
>>>> *** i think i've addressed most of this.
>>>
>>>
>>>> Incidentally I'm doing another hacking session tomorrow (Thursday 30
>>>> Jan _at_ 12pm ET) about complex Oracle performance problems that span multiple
>>>> layers and have multiple moving parts. I can add some LGWR troubleshooting
>>>> magic in the end of that (here goes my plan to fit the session in 1.5h
>>>> again!). I'll post the details in my blog in a few minutes.
>>>>
>>>> *** can't wait!!
>>>
>>>> --
>>>> Tanel Poder
>>>> https://blog.tanelpoder.com/seminar
>>>>
>>>>
>>>> On Tue, Jan 28, 2020 at 2:22 PM Chris Stephens <cstephens16_at_gmail.com>
>>>> wrote:
>>>>
>>>>> doesn't look like any ash data for lgwr:
>>>>>
>>>>> SQL> _at_ashtop session_id,blocking_session,event2 1=1
>>>>> "TIMESTAMP'2020-01-27 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"
>>>>> TOTALSECONDS AAS %This SESSION_ID BLOCKING_SESSION
>>>>> EVENT2 FIRST_SEEN LAST_SEEN
>>>>> DIST_SQLEXEC_SEEN
>>>>> _______________ ______ __________ _____________ ___________________
>>>>> _____________________ ______________________ ______________________
>>>>> ____________________
>>>>> 81 0.7 82% | 1713 1710
>>>>> log file sync 2020-01-27 15:50:54 2020-01-27 15:52:14
>>>>> 1
>>>>> 5 0 5% | 1221 ON
>>>>> CPU 2020-01-27 15:50:59 2020-01-27 15:52:14
>>>>> 1
>>>>> 3 0 3% | 246 ON
>>>>> CPU 2020-01-27 15:51:00 2020-01-27 15:52:00
>>>>> 1
>>>>> 3 0 3% | 1713 ON
>>>>> CPU 2020-01-27 15:50:51 2020-01-27 15:52:27
>>>>> 2
>>>>> 1 0 1% | 123 ON
>>>>> CPU 2020-01-27 15:51:35 2020-01-27 15:51:35
>>>>> 1
>>>>> 1 0 1% | 127 ON
>>>>> CPU 2020-01-27 15:51:50 2020-01-27 15:51:50
>>>>> 1
>>>>> 1 0 1% | 252 2321
>>>>> latch free 2020-01-27 15:52:28 2020-01-27 15:52:28
>>>>> 1
>>>>> 1 0 1% | 978
>>>>> ges remote message 2020-01-27 15:50:48 2020-01-27 15:50:48
>>>>> 1
>>>>> 1 0 1% | 983
>>>>> latch free 2020-01-27 15:52:28 2020-01-27 15:52:28
>>>>> 1
>>>>> 1 0 1% | 1713
>>>>> library cache lock 2020-01-27 15:50:48 2020-01-27 15:50:48
>>>>> 1
>>>>> 1 0 1% | 1831 ON
>>>>> CPU 2020-01-27 15:50:49 2020-01-27 15:50:49
>>>>> 1
>>>>>
>>>>>
>>>>> 11 rows selected.
>>>>>
>>>>>
>>>>> SQL> _at_ashtop session_id,blocking_session,event2 session_id=1710
>>>>> "TIMESTAMP'2020-01-27 15:50:48'" "TIMESTAMP'2020-01-27 15:52:40'"
>>>>>
>>>>> no rows selected
>>>>>
>>>>> On Tue, Jan 28, 2020 at 1:09 PM Noveljic Nenad <
>>>>> nenad.noveljic_at_vontobel.com> wrote:
>>>>>
>>>>>> You’re using v0.2, but v0.6 should be the latest version. 0.2
>>>>>> wouldn’t show idle blockers.
>>>>>>
>>>>>> I looked at the ashtop output once again. Strange that lgwr processes
>>>>>> nor any other bg processes show up significantly there. The question is,
>>>>>> what was lgwr doing while the fg processes were waiting on him.
>>>>>>
>>>>>>
-- http://www.freelists.org/webpage/oracle-lReceived on Sun Feb 02 2020 - 16:20:07 CET