Re: latch: cache buffers chains

From: Jonathan Lewis <jlewisoracle_at_gmail.com>
Date: Sat, 2 Oct 2021 16:38:19 +0100
Message-ID: <CAGtsp8m7n58-sPdz2WL6rQgvSyuYOym4oZ6=VfuLNg2sOLPW9w_at_mail.gmail.com>

The most interesting difference in the two plans is that the primary plan does a serial scan of the GTT. (Execs = 1) while the standby plan shows 16 PX slaves attempting to do a parallel scan (Execs = 16) for the PX BLock Iterator at operation 10, (execs = 396) for the table access full at operation 11. That suggests something is broken for parallel scans of GTTs.

There are more oddities when looking at the stats. For example,

egrep -e "index scans kdiixs1" -e"table scan rows" -e"callback" snapper_stats.sql
29 of the 32 sessions report matching results like: 4297 _at_4, STAT, table scan rows gotten

 ,         25811,     403.64,         ,             ,          ,
,          ~ per execution
  4297  _at_4,  STAT, index scans kdiixs1
  ,         25811,     403.64,         ,             ,          ,
,          ~ per execution
  4508  _at_4,  STAT, table scan rows gotten
 ,         25801,     403.48,         ,             ,          ,
,          ~ per execution
  4508  _at_4,  STAT, index scans kdiixs1
  ,         25801,     403.48,         ,             ,          ,
,          ~ per execution
  4614  _at_4,  STAT, table scan rows gotten
 ,         77421,      1.21k,         ,             ,          ,
,          ~ per execution
  4614  _at_4,  STAT, index scans kdiixs1
  ,         77421,      1.21k,         ,             ,          ,
,          ~ per execution

Why does every row gotten have a matching index range scan? (Maybe something has gone wrong with the " adaptive" features and feeding the statistics collector operation. But why 3 of the PX slaves not report either statistic in that minute - is something jammed? But why have (at least) 29 slaves been involved in tablescanning the GTT when you're running parallel 16?

Another example (there are many)

4 _at_4, STAT, physical read total multi block requests

 ,           678,       10.6,         ,             ,          ,
,        342 total single block read requests
     4  _at_4,  STAT, table scan blocks gotten
 ,           768,      12.01,         ,             ,          ,
,          ~ per execution
   111  _at_4,  STAT, physical read total multi block requests
 ,           678,       10.6,         ,             ,          ,
,        319 total single block read requests
   111  _at_4,  STAT, table scan blocks gotten
 ,           768,      12.01,         ,             ,          ,
,          ~ per execution
   217  _at_4,  STAT, physical read total multi block requests
 ,           673,      10.52,         ,             ,          ,
,        314 total single block read requests
   217  _at_4,  STAT, table scan blocks gotten
 ,           762,      11.92,         ,             ,          ,
,          ~ per execution

Why are multi block read requests (usually an indicator of tablescans) getting just over one block at a time while tablescanning when every extents in a temporary segment will be 1MB (by default, though you could have declared it to be something else).

One thing the last results reminds me of is a bug where a trace file would show:

WAIT "db file scattered read"  file 5, block 256, block count 128
WAIT "db file scattered read"  file 5, block 257, block count 127
WAIT "db file scattered read"  file 5, block 258, block count 126
WAIT "db file scattered read"  file 5, block 259, block count 125

I think you need to look at the ASH data to see if that gives you any further clues
Then you could enable extended SQL tracing And you could enable PX tracing to see what is going on between PX processes You could run the Monitor report (as a text output) several times in rapid succession immediately after starting the query to see if the scan gets to 10M rows in the first couple of seconds and then keeps on working while getting no furhter.

You didn't post the PX summary section of the Monitor report - that might have given a couple more little clues about what's going on.

Something, possibly several somethings, is clearly broken in the mix of GTT, parallel execution, and standby database. The one thing I didn't see in the stats was any indication of a read-consistency problem. Though the amount of cross-chat for global cache requests was a little surprising,

Regards
Jonathan Lewis

On Sat, 2 Oct 2021 at 13:35, Laurentiu Oprea <laurentiu.oprea06_at_gmail.com> wrote:

> Exactly, code is using temp_undo_enabled true at session level. I usually
> use Tanel`s snapper, I capture stats for 1 minute for all the slaves
> involved.
>
>
> În sâm., 2 oct. 2021 la 11:55, Jonathan Lewis <jlewisoracle_at_gmail.com> a
> scris:
>
>>
>> I forgot to ask for clarification about how you are populating a GTT on a
>> standby database.
>> Is this taking advantage of the 12c "temp_undo_enabled"
>> The primary example suggests you're looking at a table of 385MB, with
>> reads requests of 1MB each.
>> Is the standby GTT likely to be about the same size - it's suffered 22GB
>> of blocks read with an average read size of about 320KB. I'm now wondering
>> if this is a problem of read-consistency: are the slaves doing 1MB reads of
>> the GTT then going crazy checking temp undo segments using single block
>> reads.
>>
>> Can you run a test on the standby that runs in isolation so that
>> v$sysstat will be (mostly) the sum of the PX session stats; or run the test
>> and check v$px_sesstat for the query so that you can capture what the PX
>> processes for the query coordinator are doing. Also a couple of (diffs of)
>> snapshots of v$filestat might be a pointer to where the action is happening.
>>
>>
>> Regards
>> Jonathan Lewis
>>
>>
>>
>> On Fri, 1 Oct 2021 at 11:17, Laurentiu Oprea <laurentiu.oprea06_at_gmail.com>
>> wrote:
>>
>>> Hello Jonathan,
>>>
>>> Regarding point 1 I was thinking now that might be a single reason, in
>>> primary stats are collected for the GTTs after insert while in standby are
>>> not. Still estimates looks correct (maybe dynamic sampling)
>>>
>>> The original flow and query is very complicated so I tried to reduce it
>>> to a simple count(*) select with a join to replicate the behavior (sql
>>> monitor in the print screen).
>>>
>>>
>>>
>>>
>>> În vin., 1 oct. 2021 la 12:34, Jonathan Lewis <jlewisoracle_at_gmail.com>
>>> a scris:
>>>
>>>> First guess -
>>>> Is there any reason why the primary would know that the parent session
>>>> had written a very large amount of data to the GTT while the standby thinks
>>>> the GTT is very small. In the latter case you may find that PX sessions are
>>>> using buffered reads when you normally expect parallel tablescans to use
>>>> direct path.
>>>>
>>>> Second guess -
>>>> Is is possible that the execution plan has changed so that the primary
>>>> is doing a tablescan of the GTT but the standby is doing a parallel
>>>> tablescan of something else and using that to drive an indexed access path
>>>> into the GTT
>>>>
>>>> Regards
>>>> Jonathan Lewis
>>>>
>>>>
>>>> On Fri, 1 Oct 2021 at 10:06, Laurentiu Oprea <
>>>> laurentiu.oprea06_at_gmail.com> wrote:
>>>>
>>>>> Hello everyone.
>>>>>
>>>>> Version 12.1
>>>>>
>>>>> Looks like for a sql statement executed in standby database, with
>>>>> parallel, that selects data from a GTT (with on commit delete rows) the PX
>>>>> slaves spends a significant amount of time into latch: cache buffers chains
>>>>> and read by other session . Same query executed in primary database has no
>>>>> issues. THis issue is observed just for GTTs.
>>>>>
>>>>> Anyone have any ideas on how to mitigate this issue?
>>>>>
>>>>> THank you.
>>>>>
>>>>>

--
http://www.freelists.org/webpage/oracle-l

Received on Sat Oct 02 2021 - 17:38:19 CEST

This message: [ Message body ]
Next message: Laurentiu Oprea: "Re: latch: cache buffers chains"
Previous message: Krishnaprasad Yadav: "Re: Plan changed After creating profile"
Maybe in reply to: Laurentiu Oprea: "latch: cache buffers chains"
In reply to Jonathan Lewis: "Re: latch: cache buffers chains"
Next in thread: Laurentiu Oprea: "Re: latch: cache buffers chains"
Reply: Laurentiu Oprea: "Re: latch: cache buffers chains"
Reply: Mark W. Farnham: "RE: latch: cache buffers chains"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message