Re: Same query with no plan change or volume but runs for hours vs minutes

From: Jonathan Lewis <jlewisoracle_at_gmail.com>
Date: Tue, 5 Dec 2023 14:22:43 +0000
Message-ID: <CAGtsp8kfvES-udTH+Nwq+-5oz_702w0yTnDrApYeeu+zepVeeQ_at_mail.gmail.com>



The ksq event is about locks, and you're not spending any significant time on lock waits - it's "all" CPU - so don't bother with that trace.

If you do the ptrace/short_stack trace I think you'll probably find that the most commonly occurring function calls are ktspscan_bmb and ktspfsrch which are searching for space and checking bitmap blocks - but we can already see that in the session activity stats. If something

You say that you have a delete and then an insert of around 30 million rows. Is this "delete / commit / insert" or "delete, insert" without commit - it makes an enormous difference.

Gathering stats on another node at the same time (is this a call that will gather stats on the table AND the index, or a table-only call)? This will do at least a tablescan on the table, but may will then do an index fast full scan on the index if cascade=>true. Different behaviour could appear if the gather starts

  1. before the delete starts
  2. during the delete
  3. after the delete end but before the commit
  4. after the commit but before the insert starts
  5. after the insert starts
  6. after the inserts ends

I haven't checked the details of how the latest versions of Oracle mark bitmap blocks (L1 and L2) as having space available in data blocks and don't want to spend a few hours running some carefully controlled tests to find out, but (FOR EXAMPLE) if the stats collection happens while the delete is going on the stats gathering program has to use read-consistency to produce its result and the mix of blocks going empty because of the delete and CR blocks being made full MIGHT manage to break some detail of how Oracle walks bitmap blocks and the data blocks they point to. This is all hypothetical - but something is almost certainly going wrong in the search for free space and this type of collision has caused problems in earlier versions of Oracle.

Regards
Jonathan Lewis

On Tue, 5 Dec 2023 at 11:23, Lok P <loknath.73_at_gmail.com> wrote:

> Thank you so much.
>
> Need some guidance on the tracing part. Should we do something like below
> while the query is running long? And also for how much time period should
> we take the trace? . And I believe we need to wait for the next long
> execution to happen.
>
> alter system set events 'trace[ksq][SQL:XXXXX] ';
> alter system set events 'trace[ksq][SQL:XXXXX] off';
>
> Another thing as we discovered , the stats gathering happens from another
> node-2 during the same time and also runs longer for ~30minutes+ (vs
> ~10minutes in normal days) on the same partition multiple times on which
> the data load is happening from this node-1. So not sure if that can be the
> cause or effect of this INSERT query slowness?
>
> Basically one node is executing the delete + insert conventional
> statement at the same time another node is executing stats gathered on the
> same partition. Can this cause such an anomaly in flagging the L1 bitmaps
> for the index blocks or bumping up the "reject db - optimizer index blocks
> reject" stats and thus end up consuming a lot more CPU ?
>
>
>
> On Tue, Dec 5, 2023 at 3:34 PM Timur Akhmadeev <timur.akhmadeev_at_gmail.com>
> wrote:
>
>> https://gist.github.com/oraclelearner/6720861b373814614fe3b0669aee6901
>>
>> Thanks, I was thinking about possible statement restart due to undo
>> segment extension as there are such events and "transaction rollbacks"=1 in
>> the bad run (see
>> https://jonathanlewis.wordpress.com/2019/09/10/update-restarts/ for an
>> example) but it appears it is not the case, or it is just not reported by
>> SQL Monitor.
>>
>> You need to get stack traces as it will immediately point you in the
>> right direction.
>>
>> On Tue, Dec 5, 2023 at 11:40 AM Lok P <loknath.73_at_gmail.com> wrote:
>>
>>> The sql monitor for the complete execution with DONE status of the bad
>>> run is as below.
>>>
>>> https://gist.github.com/oraclelearner/6720861b373814614fe3b0669aee6901
>>>
>>> One thing I was wrong about and just found that the same set of rows
>>> (almost ~30million) gets inserted as part of another process a day before
>>> into this same partition. And then as part of the current day's run , first
>>> there runs a delete query which deletes almost all the rows(~30million)
>>> from that partition and then this insert query runs and inserts all of the
>>> rows again with few additional rows (~32million). But again this pattern
>>> happens daily ,so unsure of the reason why it causes slowness on certain
>>> days only. Yet to investigate , if it's the correct business logic or we
>>> can avoid the delete somehow.
>>>
>>> And also as "Jonathan" pointed on the "gc current grant busy" stats, i
>>> do see stats gather was kicking on the same object from another nopde while
>>> this INSERT was executing and that stats gather finishes in ~10-15minutes.
>>> But hopefully that won't cause such an anomaly in flagging "L1 bitmaps" for
>>> the index blocks.
>>>
>>> Another thing i observed is, the value of the object_id and
>>> data_object_ids of table partition are all same but the for the PK index
>>> i.e dba_ind_partition is pointing to different object_id and
>>> data_object_ids in dba_objects. So I was thinking of any such structural
>>> change occurring on the index partition causing the insert to run longer.
>>> But I see similar differences between object_id and data_object_ids for all
>>> the index partitions, so that is not adding up to the theory. and also i
>>> didn't find any job or related query in the dba_hist_sqltext, which must be
>>> doing index rebuild/shrink/coalesce etc. Also did not find any such INSERT
>>> which must be happening at same time on the same object.
>>>
>>> On Thu, Nov 30, 2023 at 9:24 PM Timur Akhmadeev <
>>> timur.akhmadeev_at_gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Since it is reproducible I suggest you take a few stacks of the
>>>> database process with oradebug short_stack and then check them at
>>>> orafun.info/stack
>>>> If you can't do short stacks then can you post SQL Monitor of the bad
>>>> run after it is fully completed and its status is DONE.
>>>>
>>>>
>>>>
>>>> On Mon, Nov 27, 2023 at 11:15 PM Lok P <loknath.73_at_gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> The insert query normally runs for <15minutes somedays runs for
>>>>> ~6-7hrs with exactly the same plan and approx same volume. And all the
>>>>> resources seem to be spent on step "LOAD TABLE CONVENTIONAL" and its "ON
>>>>> CPU". Wondering where it's spending time during the long execution?
>>>>>
>>>>> While the long run was happening , we tried to fetch the stats from
>>>>> v$sesstat for a ~5minutes duration and collected the non zero stats. Not
>>>>> able to find out anything odd. Could you please guide me here, what must be
>>>>> the cause of this sudden long execution on some days for this same query
>>>>> with almost the same volume?
>>>>>
>>>>> I have posted the sql monitors from both the runs below and also in
>>>>> the below github linc. And also published the nonzero output from the
>>>>> v$sesstat for the long execution for a ~5 minutes delta duration.We do see
>>>>> some stats like "KTFB alloc**" , is that normal or some issue is
>>>>> causing these stats to pop up?
>>>>>
>>>>> https://gist.github.com/oraclelearner/722a77ef5ebce23396d82e2835958303
>>>>>
>>>>> ******** Sql monitor from long run ********
>>>>> Global Information
>>>>> ------------------------------
>>>>> Status : EXECUTING
>>>>>
>>>>> Instance ID : 1
>>>>>
>>>>> SQL Execution ID : 16777216
>>>>>
>>>>> Execution Started : 11/27/2023 03:30:16
>>>>>
>>>>> First Refresh Time : 11/27/2023 03:30:19
>>>>>
>>>>> Last Refresh Time : 11/27/2023 10:06:13
>>>>>
>>>>> Duration : 23759s
>>>>>
>>>>> PLSQL Entry Ids (Object/Subprogram) : 161298,9
>>>>>
>>>>> PLSQL Current Ids (Object/Subprogram) : 161298,15
>>>>>
>>>>>
>>>>> Global Stats
>>>>>
>>>>> ===========================================================================================================================================================================
>>>>> | Elapsed | Cpu | IO | Application | Concurrency | Cluster
>>>>> | PL/SQL | Other | Buffer | Read | Read | Uncompressed | Offload |
>>>>> Offload | Cell |
>>>>> | Time(s) | Time(s) | Waits(s) | Waits(s) | Waits(s) | Waits(s)
>>>>> | Time(s) | Waits(s) | Gets | Reqs | Bytes | Bytes | Elig Bytes |
>>>>> Returned Bytes | Offload |
>>>>>
>>>>> ===========================================================================================================================================================================
>>>>> | 23758 | 23570 | 4.88 | 1.28 | 4.78 | 58
>>>>> | 41 | 119 | 18G | 1M | 719GB | 359GB | 715GB |
>>>>> 4GB | 99.42% |
>>>>>
>>>>> ===========================================================================================================================================================================
>>>>>
>>>>> SQL Plan Monitoring Details (Plan Hash Value=557779869)
>>>>>
>>>>> ==================================================================================================================================================================================================================================
>>>>> | Id | Operation | Name
>>>>> | Rows | Cost | Time | Start | Execs | Rows | Read | Read
>>>>> | Mem | Activity | Activity Detail |
>>>>> Progress |
>>>>> | | |
>>>>> | (Estim) | | Active(s) | Active | | (Actual) | Reqs | Bytes |
>>>>> | (%) | (# samples) |
>>>>> |
>>>>>
>>>>> ==================================================================================================================================================================================================================================
>>>>> | -> 0 | INSERT STATEMENT |
>>>>> | | | 23761 | +3 | 1 | 0 | | |
>>>>> . | 0.16 | Cpu (35) |
>>>>> |
>>>>> | | |
>>>>> | | | | | | | | |
>>>>> | | cell smart table scan (2) |
>>>>> |
>>>>> | -> 1 | LOAD TABLE CONVENTIONAL | PBDRA_DTL
>>>>> | | | 23762 | +2 | 1 | 0 | 499K | 4GB |
>>>>> . | 98.42 | gc current block 2-way (4) |
>>>>> |
>>>>> | | |
>>>>> | | | | | | | | |
>>>>> | | gc current grant 2-way (23) |
>>>>> |
>>>>> | | |
>>>>> | | | | | | | | |
>>>>> | | gc current grant busy (18) |
>>>>> |
>>>>> | | |
>>>>> | | | | | | | | |
>>>>> | | latch: cache buffers chains (5) |
>>>>> |
>>>>> | | |
>>>>> | | | | | | | | |
>>>>> | | log file switch completion (1) |
>>>>> |
>>>>> | | |
>>>>> | | | | | | | | |
>>>>> | | undo segment extension (2) |
>>>>> |
>>>>> | | |
>>>>> | | | | | | | | |
>>>>> | | Cpu (22695) |
>>>>> |
>>>>> | | |
>>>>> | | | | | | | | |
>>>>> | | cell single block physical read: RDMA (8) |
>>>>> |
>>>>> | | |
>>>>> | | | | | | | | |
>>>>> | | cell single block physical read: xrmem cache (34) |
>>>>> |
>>>>> | | |
>>>>> | | | | | | | | |
>>>>> | | enq: FB - contention (1) |
>>>>> |
>>>>> | | |
>>>>> | | | | | | | | |
>>>>> | | ges message buffer allocation (2) |
>>>>> |
>>>>> | | |
>>>>> | | | | | | | | |
>>>>> | | cell single block physical read: flash cache (3) |
>>>>> |
>>>>> | -> 2 | SEQUENCE | PBDRA_SEQ
>>>>> | | | 23761 | +3 | 1 | 18M | | |
>>>>> . | 0.50 | Cpu (109) |
>>>>> |
>>>>> | | |
>>>>> | | | | | | | | |
>>>>> | | reliable message (6) |
>>>>> |
>>>>> | -> 3 | HASH JOIN |
>>>>> | 68259 | 1M | 23761 | +3 | 1 | 18M | | |
>>>>> 5MB | 0.02 | Cpu (5) |
>>>>> |
>>>>> | 4 | TABLE ACCESS STORAGE FULL | PRFS
>>>>> | 78 | 2 | 1 | +3 | 1 | 63 | 1 | 40960
>>>>> | . | | |
>>>>> |
>>>>> | -> 5 | PARTITION RANGE ITERATOR |
>>>>> | 152K | 1M | 23761 | +3 | 1 | 18M | | |
>>>>> . | | |
>>>>> |
>>>>> | -> 6 | TABLE ACCESS STORAGE FULL | PBABR_DTL
>>>>> | 152K | 1M | 23761 | +3 | 215 | 18M | 740K | 715GB |
>>>>> 7MB | 0.33 | Cpu (73) |
>>>>> 100% |
>>>>> | | |
>>>>> | | | | | | | | |
>>>>> | | gcs drm freeze in enter server mode (1) |
>>>>> |
>>>>> | | |
>>>>> | | | | | | | | |
>>>>> | | reliable message (1) |
>>>>> |
>>>>> | | |
>>>>> | | | | | | | | |
>>>>> | | cell smart table scan (1) |
>>>>> |
>>>>>
>>>>> ==================================================================================================================================================================================================================================
>>>>>
>>>>>
>>>>> ********* Quick Run ****************
>>>>>
>>>>> Global Information
>>>>> ------------------------------
>>>>> Status : DONE
>>>>>
>>>>> Instance ID : 1
>>>>>
>>>>> SQL Execution ID : 16777216
>>>>>
>>>>> Execution Started : 11/24/2023 04:39:04
>>>>>
>>>>> First Refresh Time : 11/24/2023 04:39:07
>>>>>
>>>>> Last Refresh Time : 11/24/2023 04:55:37
>>>>>
>>>>> Duration : 993s
>>>>>
>>>>> PLSQL Entry Ids (Object/Subprogram) : 161298,9
>>>>>
>>>>> PLSQL Current Ids (Object/Subprogram) : 161298,15
>>>>>
>>>>>
>>>>> Global Stats
>>>>>
>>>>> ===========================================================================================================================================================================
>>>>> | Elapsed | Cpu | IO | Application | Concurrency | Cluster
>>>>> | PL/SQL | Other | Buffer | Read | Read | Uncompressed | Offload |
>>>>> Offload | Cell |
>>>>> | Time(s) | Time(s) | Waits(s) | Waits(s) | Waits(s) | Waits(s)
>>>>> | Time(s) | Waits(s) | Gets | Reqs | Bytes | Bytes | Elig Bytes |
>>>>> Returned Bytes | Offload |
>>>>>
>>>>> ===========================================================================================================================================================================
>>>>> | 993 | 942 | 5.27 | 0.05 | 0.01 | 0.02
>>>>> | 68 | 46 | 316M | 2M | 867GB | 443GB | 861GB |
>>>>> 7GB | 99.21% |
>>>>>
>>>>> ===========================================================================================================================================================================
>>>>>
>>>>> SQL Plan Monitoring Details (Plan Hash Value=557779869)
>>>>>
>>>>> =====================================================================================================================================================================================
>>>>> | Id | Operation | Name
>>>>> | Rows | Cost | Time | Start | Execs | Rows | Read | Read |
>>>>> Mem | Activity | Activity Detail |
>>>>> | | |
>>>>> | (Estim) | | Active(s) | Active | | (Actual) | Reqs | Bytes |
>>>>> (Max) | (%) | (# samples) |
>>>>>
>>>>> =====================================================================================================================================================================================
>>>>> | 0 | INSERT STATEMENT |
>>>>> | | | 991 | +3 | 1 | 0 | | |
>>>>> . | | |
>>>>> | 1 | LOAD TABLE CONVENTIONAL | PBDRA_DTL
>>>>> | | | 991 | +3 | 1 | 0 | 874K | 7GB |
>>>>> . | | |
>>>>> | 2 | SEQUENCE | PBDRA_SEQ
>>>>> | | | 992 | +2 | 1 | 30M | | |
>>>>> . | | |
>>>>> | 3 | HASH JOIN |
>>>>> | 66851 | 1M | 991 | +3 | 1 | 30M | | |
>>>>> 5MB | | |
>>>>> | 4 | TABLE ACCESS STORAGE FULL | PRFS
>>>>> | 78 | 2 | 1 | +3 | 1 | 63 | 2 | 49152 |
>>>>> 1MB | | |
>>>>> | 5 | PARTITION RANGE ITERATOR |
>>>>> | 149K | 1M | 991 | +3 | 1 | 32M | | |
>>>>> . | | |
>>>>> | 6 | TABLE ACCESS STORAGE FULL | PBABR_DTL
>>>>> | 149K | 1M | 991 | +3 | 286 | 32M | 892K | 861GB |
>>>>> 7MB | | |
>>>>>
>>>>> =====================================================================================================================================================================================
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards
>>>> Timur Akhmadeev
>>>>
>>>
>>
>> --
>> Regards
>> Timur Akhmadeev
>>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Tue Dec 05 2023 - 15:22:43 CET

Original text of this message