KTSJ / Wnnn

From: Dominic Brooks <dombrooks_at_hotmail.com>
Date: Thu, 15 Jul 2021 11:02:24 +0000
Message-ID: <DU2PR04MB8726ECE36EFBBC397103F462A1129_at_DU2PR04MB8726.eurprd04.prod.outlook.com>



I was observing a foreground process yesterday which was running a series of batched updates from Java in a single thread and was running very slowly. Each element in the batch was updating a single row via a unique scan. The execution time of this feed was reported as having tripled since moving to 19.6.

Performance was atrocious. For example, from AWR over a period of 15-16 hours, an average size batch of a couple of hundred elements was averaging anywhere between 8 and over 100 seconds per execution per hour, the vast majority of time in cluster related waits. Averages hide a whole bunch of detail of course but a useful indicator.

I was observing from GV$SESSION and GV$ASH and the source of the cluster waits seems to be related to KTSJ slave activity and there was strong correlation of the “two” (java update plus multiple active KTSJ slaves) working on the same datafile/blocks – series of the two doing gc buffer busy release, gc buffer busy acquire, gc current block busy with the occasional cell single block physical read. Blocking session information on some of the gc waits occasionally pointing at the other (update blocked by KTSJ or vice versa)

Reading all the responses oracle-l thread from May 2020 on KTSJ was the best source of information I could find: https://www.freelists.org/post/oracle-l/reads-by-KTSJ,17 And a couple of bug references leading from there, not all of which relevant to my version (19.6) but giving indications of what might be going on:

Per the explanations in the oracle-l thread, seems to be foreground session doing something which then prompts background session to check/fix the ASSM information. But in my case, this fixing is causing significant contention back to the foreground session.

I ran snapper on some of the KTSJ slaves and of the ASSM fix related stats, ASSM bg: slave fix state was consistently around 5000 in a 5 second period. That is not a statistic I have any context to judge the value of.

This is a monthly feed so doesn’t happen every day but when it does it sits in a critical path. It’s finished now so there’s not a lot I can look at now if it’s not in ASH. Obviously a next step is to try to reproduce this in a test environment.

I just wondered whether anyone had done any further investigation into this behaviour.

Cheers,
Dominic

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Jul 15 2021 - 13:02:24 CEST

Original text of this message