RE: latch: cache buffers chains
Date: Wed, 6 Oct 2021 10:52:23 -0400
Message-ID: <2dbd01d7bac1$c5c92ff0$515b8fd0$_at_rsiz.com>
I think force local on parallel is not required to be identical on primary and standby. If this is different in the obvious way on primary and standby, that could explain much hilarity.
From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Jonathan Lewis Sent: Saturday, October 02, 2021 11:38 AM To: ORACLE-L (oracle-l_at_freelists.org) Subject: Re: latch: cache buffers chains
The most interesting difference in the two plans is that the primary plan does a serial scan of the GTT. (Execs = 1) while the standby plan shows 16 PX slaves attempting to do a parallel scan (Execs = 16) for the PX BLock Iterator at operation 10, (execs = 396) for the table access full at operation 11. That suggests something is broken for parallel scans of GTTs.
There are more oddities when looking at the stats.
For example,
egrep -e "index scans kdiixs1" -e"table scan rows" -e"callback" snapper_stats.sql
29 of the 32 sessions report matching results like:
4297 _at_4, STAT, table scan rows gotten , 25811, 403.64, , , , , ~ per execution 4297 _at_4, STAT, index scans kdiixs1 , 25811, 403.64, , , , , ~ per execution 4508 _at_4, STAT, table scan rows gotten , 25801, 403.48, , , , , ~ per execution 4508 _at_4, STAT, index scans kdiixs1 , 25801, 403.48, , , , , ~ per execution 4614 _at_4, STAT, table scan rows gotten , 77421, 1.21k, , , , , ~ per execution 4614 _at_4, STAT, index scans kdiixs1 , 77421, 1.21k, , , , , ~ per execution
Why does every row gotten have a matching index range scan? (Maybe something has gone wrong with the " adaptive" features and feeding the statistics collector operation. But why 3 of the PX slaves not report either statistic in that minute - is something jammed? But why have (at least) 29 slaves been involved in tablescanning the GTT when you're running parallel 16?
Another example (there are many)
4 _at_4, STAT, physical read total multi block requests , 678, 10.6, , , , , 342 total single block read requests 4 _at_4, STAT, table scan blocks gotten , 768, 12.01, , , , , ~ per execution 111 _at_4, STAT, physical read total multi block requests , 678, 10.6, , , , , 319 total single block read requests 111 _at_4, STAT, table scan blocks gotten , 768, 12.01, , , , , ~ per execution 217 _at_4, STAT, physical read total multi block requests , 673, 10.52, , , , , 314 total single block read requests 217 _at_4, STAT, table scan blocks gotten , 762, 11.92, , , , , ~ per execution
Why are multi block read requests (usually an indicator of tablescans) getting just over one block at a time while tablescanning when every extents in a temporary segment will be 1MB (by default, though you could have declared it to be something else).
One thing the last results reminds me of is a bug where a trace file would show:
WAIT "db file scattered read" file 5, block 256, block count 128
WAIT "db file scattered read" file 5, block 257, block count 127
WAIT "db file scattered read" file 5, block 258, block count 126
WAIT "db file scattered read" file 5, block 259, block count 125
I think you need to look at the ASH data to see if that gives you any further clues
Then you could enable extended SQL tracing
And you could enable PX tracing to see what is going on between PX processes
You could run the Monitor report (as a text output) several times in rapid succession immediately after starting the query to see if the scan gets to 10M rows in the first couple of seconds and then keeps on working while getting no furhter.
You didn't post the PX summary section of the Monitor report - that might have given a couple more little clues about what's going on.
Something, possibly several somethings, is clearly broken in the mix of GTT, parallel execution, and standby database.
The one thing I didn't see in the stats was any indication of a read-consistency problem. Though the amount of cross-chat for global cache requests was a little surprising,
Regards
Jonathan Lewis
On Sat, 2 Oct 2021 at 13:35, Laurentiu Oprea <laurentiu.oprea06_at_gmail.com> wrote:
Exactly, code is using temp_undo_enabled true at session level. I usually use Tanel`s snapper, I capture stats for 1 minute for all the slaves involved.
În sâm., 2 oct. 2021 la 11:55, Jonathan Lewis <jlewisoracle_at_gmail.com> a scris:
I forgot to ask for clarification about how you are populating a GTT on a standby database.
Is this taking advantage of the 12c "temp_undo_enabled"
The primary example suggests you're looking at a table of 385MB, with reads requests of 1MB each.
Is the standby GTT likely to be about the same size - it's suffered 22GB of blocks read with an average read size of about 320KB. I'm now wondering if this is a problem of read-consistency: are the slaves doing 1MB reads of the GTT then going crazy checking temp undo segments using single block reads.
Can you run a test on the standby that runs in isolation so that v$sysstat will be (mostly) the sum of the PX session stats; or run the test and check v$px_sesstat for the query so that you can capture what the PX processes for the query coordinator are doing. Also a couple of (diffs of) snapshots of v$filestat might be a pointer to where the action is happening.
Regards
Jonathan Lewis
On Fri, 1 Oct 2021 at 11:17, Laurentiu Oprea <laurentiu.oprea06_at_gmail.com> wrote:
Hello Jonathan,
Regarding point 1 I was thinking now that might be a single reason, in primary stats are collected for the GTTs after insert while in standby are not. Still estimates looks correct (maybe dynamic sampling)
The original flow and query is very complicated so I tried to reduce it to a simple count(*) select with a join to replicate the behavior (sql monitor in the print screen).
În vin., 1 oct. 2021 la 12:34, Jonathan Lewis <jlewisoracle_at_gmail.com> a scris:
First guess -
Is there any reason why the primary would know that the parent session had written a very large amount of data to the GTT while the standby thinks the GTT is very small. In the latter case you may find that PX sessions are using buffered reads when you normally expect parallel tablescans to use direct path.
Second guess -
Is is possible that the execution plan has changed so that the primary is doing a tablescan of the GTT but the standby is doing a parallel tablescan of something else and using that to drive an indexed access path into the GTT
Regards
Jonathan Lewis
On Fri, 1 Oct 2021 at 10:06, Laurentiu Oprea <laurentiu.oprea06_at_gmail.com> wrote:
Hello everyone.
Version 12.1
Looks like for a sql statement executed in standby database, with parallel, that selects data from a GTT (with on commit delete rows) the PX slaves spends a significant amount of time into latch: cache buffers chains and read by other session . Same query executed in primary database has no issues. THis issue is observed just for GTTs.
Anyone have any ideas on how to mitigate this issue?
THank you.
-- http://www.freelists.org/webpage/oracle-lReceived on Wed Oct 06 2021 - 16:52:23 CEST