KTSJ / Wnnn
Date: Thu, 15 Jul 2021 11:02:24 +0000
Message-ID: <DU2PR04MB8726ECE36EFBBC397103F462A1129_at_DU2PR04MB8726.eurprd04.prod.outlook.com>
I was observing a foreground process yesterday which was running a series of batched updates from Java in a single thread and was running very slowly. Each element in the batch was updating a single row via a unique scan. The execution time of this feed was reported as having tripled since moving to 19.6.
Performance was atrocious. For example, from AWR over a period of 15-16 hours, an average size batch of a couple of hundred elements was averaging anywhere between 8 and over 100 seconds per execution per hour, the vast majority of time in cluster related waits. Averages hide a whole bunch of detail of course but a useful indicator.
I was observing from GV$SESSION and GV$ASH and the source of the cluster waits seems to be related to KTSJ slave activity and there was strong correlation of the “two” (java update plus multiple active KTSJ slaves) working on the same datafile/blocks – series of the two doing gc buffer busy release, gc buffer busy acquire, gc current block busy with the occasional cell single block physical read. Blocking session information on some of the gc waits occasionally pointing at the other (update blocked by KTSJ or vice versa)
Reading all the responses oracle-l thread from May 2020 on KTSJ was the best source of information I could find: https://www.freelists.org/post/oracle-l/reads-by-KTSJ,17 And a couple of bug references leading from there, not all of which relevant to my version (19.6) but giving indications of what might be going on:
- blocks are not marked as free in assm after delete - 12.2 and later (Doc ID 30265523.8)
- performance degradation by w00 processes after applying july 2020 dbru (Doc ID 32075777.8) superseded by
- force full repair enabled by fix control and populate repair list even if _assm_segment_repair_bg=false (Doc ID 32234161.8) With mention of the parameter _assm_segment_repair_bg.
Per the explanations in the oracle-l thread, seems to be foreground session doing something which then prompts background session to check/fix the ASSM information. But in my case, this fixing is causing significant contention back to the foreground session.
I ran snapper on some of the KTSJ slaves and of the ASSM fix related stats, ASSM bg: slave fix state was consistently around 5000 in a 5 second period. That is not a statistic I have any context to judge the value of.
This is a monthly feed so doesn’t happen every day but when it does it sits in a critical path. It’s finished now so there’s not a lot I can look at now if it’s not in ASH. Obviously a next step is to try to reproduce this in a test environment.
I just wondered whether anyone had done any further investigation into this behaviour.
Cheers,
Dominic
Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
-- http://www.freelists.org/webpage/oracle-lReceived on Thu Jul 15 2021 - 13:02:24 CEST