Re: high temp space usage for same analytic function on latest version

From: Jonathan Lewis <jlewisoracle_at_gmail.com>
Date: Thu, 17 Nov 2022 22:21:24 +0000
Message-ID: <CAGtsp8m-yWLzVboNoCPz_1vV3giBPvH+VS=jqo5852=X=rjxJg_at_mail.gmail.com>



I thought I'd take a look on MOS for bugs that report performance problems related to "pushed rank" and found one.
Bug 30786641 : QUERY PERFORMANCE REGRESSION AFTER UPGRADING FROM 11.2.0.4 TO 19.5
It might be related; but maybe not, and you may have the patch built into your version anyway - the patch is in19.15.2.0

Regards
Jonathan Lewis

On Thu, 17 Nov 2022 at 20:30, Jonathan Lewis <jlewisoracle_at_gmail.com> wrote:

>
> I've now examined the three different Monitor report with a little care.
> THere's more information I would like to collect if I were on site and
> able to re-run the queries, but I think I may have identified the critical
> issue.
>
> I thought you kept saying the problem was in the outer hash join - maybe
> you didn't and I just picked that up from the bits and pieces.
>
> The massive I/O overhead (temp space usage) is in the WINDOW CHILD PUSHED
> RANK operation (the one above the outer join) -It's a problem whether or
> not you use the profile from 11g. Critical observation: the input to the
> window sort is 7 billion rows in 11g, the sort space required is 10GB, with
> 19GB of reads and writes (so a multipass sort operation). How do you sort
> 7 B rows with only 11GB of space ? Answer: you can't.
>
> In 19c the number of rows is 7B in one case and 10B in the other, and the
> space requirement (before failing) is a much more realistic 1TB. The
> inference we draw from this is that the PUSHED RANK is working in 11g, but
> not working correctly in 19c, so the entire data set has to be captured
> before the SORT in 19c but 11g can keep saying "I've got one of those
> already, I only need to keep one" as each row arrives. The output that
> succeeds in roughly 1 row in 100, so it's not too surprising that the temp
> space used if pushed rank doesn't work is roughly 100 times the size.)
>
> The next really important detail to check is the actual runtime predicate
> used in the pushed rank, just in case it's not what is claimed.
> Another really important check is that 11g has been getting the right
> results - maybe the mechanism it has been using was disabled in 19c because
> it could produce the wrong results.
>
> Regards
> Jonathan Lewis
>
>
>
>
>
>
>
>
> On Wed, 16 Nov 2022 at 18:44, yudhi s <learnerdatabase99_at_gmail.com> wrote:
>
>> Yes I tried that and it does increase the estimates for the cte from 38k
>> to 860k but that problematic section of the plan remains same consuming
>> terabytes of temp space.
>>
>> On Thu, 17 Nov, 2022, 12:10 am Lok P, <loknath.73_at_gmail.com> wrote:
>>
>>> Did you try forcing the opt_estimate hint to influence the cte
>>> estimation to increase 18 times as Jonathan suggested? Does that change
>>> anything in the execution path?
>>>
>>> On Wed, 16 Nov, 2022, 11:24 pm yudhi s, <learnerdatabase99_at_gmail.com>
>>> wrote:
>>>
>>>> Extremely sorry, as it looks like I have messed it up. Trying to
>>>> correct it.
>>>>
>>>> Basically we had a sql profile attached in past while we were in
>>>> version 11.2.0.4 and was working fine. also when we moved to 19.11 too that
>>>> same profile was there making the query finish without any issue. but that
>>>> same plan is now making the query to go for consuming terabytes of temp on
>>>> version 19.15. I dont have a sql monitor from 19.11 plan which was
>>>> succeeding fine, but its the same plan phv as the 19.15 plan which i posted
>>>> below in the 2nd sql monitor.
>>>>
>>>> However, the point i want to make here is , all of these plans are
>>>> almost same with minor difference. And If you see the specific section of
>>>> the plan which does the "HASH JOIN OUTER" and passing the resulting output
>>>> to the analytical function is exactly same in all these cases but still the
>>>> 19.15 version is going for ~1TB+ temp. So its definitely odd.
>>>>
>>>> Below is the three comparison of that exact section of the plan which
>>>> is playing key role here in consuming all tempspace. In the first case even
>>>> if the ~7Billion rows in getting sorted using just ~10GB of tempspace,
>>>> whereas in second and third cases its using in TBs. Though the data volume
>>>> is higher but this excessive use of tempspace is not justified. Definitely
>>>> seems buggy.
>>>>
>>>>
>>>> https://gist.github.com/databasetech0073/55c5722da8105dfcc647ab3bd1b24a48
>>>>
>>>> And also as you have highlighted the bug related to bloom filter i
>>>> tried forcing the (no_px_join_filter hints and even setting up
>>>> "_bloom_filter_enabled" and "_bloom_pruning_enabled" in session level to
>>>> stop the bloom pruning and tried running SELECT part of the query, but
>>>> still its using same excessive amount of tempspace.
>>>>
>>>>
>>>> --1st sqlmonitor
>>>>
>>>> Below is the sql monitor and outline in version 11.2.0.4, which was
>>>> finishing by consuming <20GB tempspace.
>>>>
>>>>
>>>> https://gist.github.com/databasetech0073/630dad9fc43d6ab6c8863251e29f1cf2
>>>>
>>>> --2nd sql monitor
>>>>
>>>> Below is the sql monitor and outline in version 19.15 (by forcing
>>>> exactly the same plan through a sql profile as it used to take in 11.2).
>>>>
>>>>
>>>> https://gist.github.com/databasetech0073/37ec33e73a23bd0055148ba43a0a49ca
>>>>
>>>> --3rd sql monitor
>>>>
>>>> Below is the sql monitor and outline in current version ( i.e 19.15
>>>> without any profile) which is also consuming high tempspace.
>>>>
>>>>
>>>> https://gist.github.com/databasetech0073/4b895f5790a993ed326d389b54295f9b
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 15 Nov, 2022, 6:39 pm Jonathan Lewis, <jlewisoracle_at_gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> Since I've noticed Bloom filters as an anomaly, it might be worth
>>>>> checking if you're patched for:
>>>>> "Bug 30618406 - Huge PGA memory consumption by bloom filters
>>>>> (qesblfilter_seg) in px slaves in full partition-wise join"
>>>>>
>>>>> I know this plan isn't using full PW joins - but (read the details) -
>>>>> your parallel operation uses a lot of memory, and then a lot of temp space.
>>>>> Sometimes the bug goes a lot further than the initial description.
>>>>>
>>>>> Marked as fixed in 23.1, but I can see patches for 19.16 in the RU.
>>>>>
>>>>> Regards
>>>>> Jonathan Lewis
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, 15 Nov 2022 at 07:15, Jonathan Lewis <jlewisoracle_at_gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> I've taken a few minutes to look through the github material. The
>>>>>> headings don't agree with the contents. Here's a list showing each heading
>>>>>> plus two lines from the Outline Data under the heading:
>>>>>>
>>>>>> Display cursor plan from version 11.2.0.4
>>>>>> OPTIMIZER_FEATURES_ENABLE('19.1.0')
>>>>>> DB_VERSION('19.1.0')
>>>>>>
>>>>>>
>>>>>> Execution which is failing in 19.15 version with high tempspace
>>>>>> consumption
>>>>>> OPTIMIZER_FEATURES_ENABLE('11.2.0.4')
>>>>>> DB_VERSION('19.1.0')
>>>>>>
>>>>>>
>>>>>> Execution which was running fine in 19.11 version
>>>>>> OPTIMIZER_FEATURES_ENABLE('11.2.0.4')
>>>>>> DB_VERSION('11.2.0.4')
>>>>>>
>>>>>>
>>>>>> There doesn't seem to be any point in spending time looking at detail
>>>>>> when I've got no confidence that the bits of information supplied belong
>>>>>> together.
>>>>>>
>>>>>>
>>>>>> A couple of high visibility points, though:
>>>>>> a) the version labelled "Execution which was running fine in 19.11
>>>>>> version" reported OPT_PARAM('optimizer_dynamic_sampling' 3) in its outline
>>>>>> data, and that could have helped Oracle get better estimates on some of the
>>>>>> predicates that included functions of columns.
>>>>>> b) the version labelled "Display cursor plan from version 11.2.0.4"
>>>>>> included Bloom Filtering sent to storage on the massive tablescan where I
>>>>>> pointed out the massive CPU anomaly in an earlier post.
>>>>>> 76 - storage((COALESCE("FPT"."DEL_FLG",'N')='N' AND
>>>>>> TRUNC(INTERNAL_FUNCTION("FPT"."H_DT_TIM"))=:B1 AND
>>>>>> SYS_OP_BLOOM_FILTER(:BF0000,"FPT"."A_FK")))
>>>>>> filter((COALESCE("FPT"."DEL_FLG",'N')='N' AND
>>>>>> TRUNC(INTERNAL_FUNCTION("FPT"."H_DT_TIM"))=:B1 AND
>>>>>> SYS_OP_BLOOM_FILTER(:BF0000,"FPT"."A_FK")))
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Jonathan Lewis
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 14 Nov 2022 at 17:18, yudhi s <learnerdatabase99_at_gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thank You Sayan and Jonathan.
>>>>>>>
>>>>>>> As you suggested, as a quick thing , in the select query, I was
>>>>>>> trying this hint in the main block to influence optimizer to increase the
>>>>>>> resulting rows ~18 times from the CTE "C" , but its not getting reflected
>>>>>>> in the plan, which means optimizer is not considering this, but then it
>>>>>>> should appear in the hint report section of the plan as error or unused
>>>>>>> etc.. but it is not getting published there too, so not sure why is this
>>>>>>> happening. Is there anything wrong in the hint below which i am using?
>>>>>>>
>>>>>>> I am using below hint in the main select query which performs outer
>>>>>>> join with the "cte" Table C
>>>>>>>
>>>>>>> /*+OPT_ESTIMATE( TABLE, C, scale_rows=18)*/
>>>>>>>
>>>>>>> Also , I tried as below but seeing no change to the tempspace
>>>>>>> consumption though.
>>>>>>>
>>>>>>> cast(RPAD (SUBSTR (CP.CP_NBR, 1, CP.CPCL_NBR - X.RN),CASE
>>>>>>> CP.PANL_CNT WHEN 0 THEN 16 ELSE CP.PANL_CNT END, '9') as
>>>>>>> varchar2(22)) AS HIGH_VALUE
>>>>>>>
>>>>>>> cast(RPAD (SUBSTR (CP.CP_NBR, 1, CP.CPCL_NBR - X.RN), CASE
>>>>>>> CP.PANL_CNT WHEN 0 THEN 16 ELSE CP.PANL_CNT END, '0') as varchar2(22)
>>>>>>> ) AS LOW_VALUE,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 14 Nov, 2022, 4:22 pm Jonathan Lewis, <
>>>>>>> jlewisoracle_at_gmail.com> wrote:
>>>>>>>
>>>>>>>> Still avoiding spending a lot of time on the github text.
>>>>>>>>
>>>>>>>> a) You materialize a CTE that invovles a connect by query that
>>>>>>>> generates 18 rows - use the opt_estimate() hint to tell Oracle that the
>>>>>>>> materialized view holds 18 times the rows it expects. (If you examine the
>>>>>>>> plan the optimizer allows for ONE row coming from dual, not 18).
>>>>>>>>
>>>>>>>> b) You have several places where you use expresssions like
>>>>>>>> substr(c_nbr,1,4) in predicates . Create (optionally invisible) virtual
>>>>>>>> columns for these expressions and gather stats on them, this will give the
>>>>>>>> optimizer a better chance of getting the right arithmetic, especially for
>>>>>>>> cases where the expression needs a frequency (or top-frequency) histogram.
>>>>>>>> Invisible is safest for existing production code, but you may need to test
>>>>>>>> carefully to see if the optimizer manages to use the stats from the virtual
>>>>>>>> columns when it sees the equivalent expressions.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Jonathan Lewis
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, 13 Nov 2022 at 14:28, yudhi s <learnerdatabase99_at_gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thank you Lok. Actually the join is not exactly on the columns but
>>>>>>>>> with some functions so that might be the case why it seems like Cartesian.
>>>>>>>>>
>>>>>>>>> However the point is , this query has not been changed and I agree
>>>>>>>>> that it may be because the data has been increased a lil in one of the
>>>>>>>>> table 'C' or 'T' , so the join output has been increased from 7billion to
>>>>>>>>> 10billion but how come that justifies the temp space requirement from 20gb
>>>>>>>>> to 2.5terabyte it not twice or thrice but multiple times.
>>>>>>>>>
>>>>>>>>> As this query is running in parallel-16 degree because of a
>>>>>>>>> underlying table degree , and considering no other option at hand, so I was
>>>>>>>>> thinking if by increasing or decreasing the parallelism with help of hints
>>>>>>>>> will help in completing the query at least with lesser temp space(we have
>>>>>>>>> 2.5TB of temp space with us right now currently) ? Or any other possible
>>>>>>>>> option to make this query succeed?
>>>>>>>>>
>>>>>>>>> Just to note , we had initially 200gb of temp space but just with
>>>>>>>>> the hope that this query may finish, we already increased it to 2.5TB now,
>>>>>>>>> but still no luck and the query is failing and it's happening after this
>>>>>>>>> 19.15 patch.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, 13 Nov, 2022, 6:06 pm Lok P, <loknath.73_at_gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> The below section you posted shows the input to 'Hash join outer'
>>>>>>>>>> were ~1Million and 37milion respectively, but the result came from the
>>>>>>>>>> outer join is ~7billion. So is there any oddity in the data in table 'T'
>>>>>>>>>> and 'C' which is playing a role which is increasing the rows input to your
>>>>>>>>>> analytical function recently?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> SQL Plan Monitoring Details (Plan Hash Value=2056239125)
>>>>>>>>>>
>>>>>>>>>> =================================================================================================================================================================================================================================================================================
>>>>>>>>>> | Id | Operation
>>>>>>>>>> | Name | Rows | Cost | Time |
>>>>>>>>>> Start | Execs | Rows | Read | Read | Write | Write | Cell | Mem
>>>>>>>>>> | Temp | Activity | Activity Detail |
>>>>>>>>>> | |
>>>>>>>>>> | | (Estim) | | Active(s) |
>>>>>>>>>> Active | | (Actual) | Reqs | Bytes | Reqs | Bytes | Offload | (Max)
>>>>>>>>>> | (Max) | (%) | (# samples) |
>>>>>>>>>>
>>>>>>>>>> =================================================================================================================================================================================================================================================================================
>>>>>>>>>> |
>>>>>>>>>> | | | | |
>>>>>>>>>> | | | | | | | | |
>>>>>>>>>> | | direct path read temp (553) |
>>>>>>>>>> | |
>>>>>>>>>> | | | | |
>>>>>>>>>> | | | | | | | | |
>>>>>>>>>> | | direct path write temp (47) |
>>>>>>>>>> | 31 | HASH JOIN RIGHT OUTER
>>>>>>>>>> | | 9M | 2M | 1676 |
>>>>>>>>>> +218 | 16 | 7G | | | | | | 217M |
>>>>>>>>>> | 23.08 | Cpu (4041) |
>>>>>>>>>> | 32 | BUFFER SORT
>>>>>>>>>> | | | | 4 |
>>>>>>>>>> +216 | 16 | 1M | | | | | | 97M |
>>>>>>>>>> | | |
>>>>>>>>>> | 33 | PX RECEIVE
>>>>>>>>>> | | 10647 | 24 | 4 |
>>>>>>>>>> +216 | 16 | 1M | | | | | | |
>>>>>>>>>> | | |
>>>>>>>>>> | 34 | PX SEND HASH
>>>>>>>>>> | :TQ10005 | 10647 | 24 | 1633 |
>>>>>>>>>> +216 | 1 | 1M | | | | | | |
>>>>>>>>>> | 0.01 | Cpu (2) |
>>>>>>>>>> | 35 | VIEW
>>>>>>>>>> | | 10647 | 24 | 1632 |
>>>>>>>>>> +217 | 1 | 1M | | | | | | |
>>>>>>>>>> | | |
>>>>>>>>>> | 36 | TABLE ACCESS STORAGE FULL
>>>>>>>>>> | SYS_TEMP_0FDA1E71E_D71F1BDE | 10647 | 24 | 1632 |
>>>>>>>>>> +217 | 1 | 1M | 106 | 104MB | | | | |
>>>>>>>>>> | 0.01 | cell multiblock physical read (1) |
>>>>>>>>>> | 37 | PX RECEIVE
>>>>>>>>>> | | 9M | 2M | 1676 |
>>>>>>>>>> +218 | 16 | 37M | | | | | | |
>>>>>>>>>> | 0.05 | Cpu (9) |
>>>>>>>>>> | 38 | PX SEND HASH
>>>>>>>>>> | :TQ10009 | 9M | 2M | 1677 |
>>>>>>>>>> +217 | 16 | 37M | | | | | | |
>>>>>>>>>> | 0.13 | Cpu (23) |
>>>>>>>>>> | 39 | VIEW
>>>>>>>>>> | | 9M | 2M | 1677 |
>>>>>>>>>> +217 | 16 | 37M | | | | | | |
>>>>>>>>>> | | |
>>>>>>>>>>
>>>>>>>>>> On Sun, Nov 13, 2022 at 2:34 AM yudhi s <
>>>>>>>>>> learnerdatabase99_at_gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you Jonathan.
>>>>>>>>>>> Actually the original query was a long one so i was trying to
>>>>>>>>>>> reduce the complexity by just posting the particular section of the query
>>>>>>>>>>> which was causing the temp space spike. But I agree that just a partial
>>>>>>>>>>> query doesn't make much sense. Also my apology as the plan format was
>>>>>>>>>>> distorted.
>>>>>>>>>>>
>>>>>>>>>>> I have again posted the exact query below with the actual object
>>>>>>>>>>> names being replaced with dummy names. I have posted the plan in the github
>>>>>>>>>>> in the link below, so the sql execution plan format will be intact. And I
>>>>>>>>>>> am not having the "display cursor" plan for the current runs/failures for
>>>>>>>>>>> this INSERT query, so i have posted the sql monitors for them and i had the
>>>>>>>>>>> display cursor plan when it was running fine in 11.2.0.4 so i posted that
>>>>>>>>>>> just for reference. Also I have copied and pasted the outline section from
>>>>>>>>>>> the display_awr below each of the sqlmonitor plan.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://gist.github.com/databasetech0073/714263bce477f624763f757e457cb861
>>>>>>>>>>>
>>>>>>>>>>> As mentioned earlier, the section of the plan which is doing the
>>>>>>>>>>> outer join is below and it results in ~10billion rows. Which then is passed
>>>>>>>>>>> to the analytical function. Was trying to understand, Is there any possible
>>>>>>>>>>> way to minimize the temp space consumption in this scenario? because we
>>>>>>>>>>> kept on increasing the temp space assuming it would succeed somehow, but
>>>>>>>>>>> it just kept consuming 2.5TB+ space and then also it failed with "Ora-01652
>>>>>>>>>>> unable to extend temp segment .." as it saturates all allocated temp
>>>>>>>>>>> space?
>>>>>>>>>>> As far as I know the encryption/TDE was already there in this
>>>>>>>>>>> database before we moved from 19.11 to 19.15. I will double check with the
>>>>>>>>>>> infra team on this.
>>>>>>>>>>>
>>>>>>>>>>> "LEFT OUTER JOIN C
>>>>>>>>>>> ON SUBSTR (T.C_NBR, 1, 4) = C.BNK_ID_NB_4
>>>>>>>>>>> AND LENGTH (T.C_NBR) = C.PANL_CNT
>>>>>>>>>>> AND T.C_NBR BETWEEN C.LOW_VALUE AND C.HIGH_VALUE) S"
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Nov 12, 2022 at 12:50 PM Jonathan Lewis <
>>>>>>>>>>> jlewisoracle_at_gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> >> We reran the query and it took a different plan hash value
>>>>>>>>>>>> , however the path is almost similar wrt the line which does "Outer join"
>>>>>>>>>>>> and the "analytical function" evaluation. I have posted the specific
>>>>>>>>>>>> section of sql monitor below with predicate and column projection for that
>>>>>>>>>>>> run.
>>>>>>>>>>>>
>>>>>>>>>>>> The supplied outputs are virtually unreadable, for different
>>>>>>>>>>>> execution plans, for different versions (and youve changed from 19.11 for
>>>>>>>>>>>> the "slow" query to 11.2.0.4!) , and partial. I'm not going to look at them.
>>>>>>>>>>>>
>>>>>>>>>>>> >> However, the sql monitor shows status as 'error' out after
>>>>>>>>>>>> sometime(~30minutes)) but the underlying sessions(~33 parallel sessions)
>>>>>>>>>>>> kept on running for 2hrs+
>>>>>>>>>>>>
>>>>>>>>>>>> That looks like the behaviour that Mohamed Houri has written
>>>>>>>>>>>> about in the past:
>>>>>>>>>>>> https://hourim.wordpress.com/2020/03/07/sql-monitor-flaw/
>>>>>>>>>>>>
>>>>>>>>>>>> I'd forgotten it would be harder to collect the correct
>>>>>>>>>>>> statistics from all the processes when running a parallel query - finding
>>>>>>>>>>>> sessions by SQL_ID could work, but using views like v$px_sesstat etc. is
>>>>>>>>>>>> safer. Nothing stands out from the information you supplied about wait
>>>>>>>>>>>> events, and the only "unusual" thing in the session activity is that you
>>>>>>>>>>>> are writing encrypted blocks to the temporary tablespace - were you doing
>>>>>>>>>>>> that in the earlier version? It shouldn't explain a massive increase in
>>>>>>>>>>>> space requirements, but it might be relevant - it would help to explain an
>>>>>>>>>>>> increase in CPU.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>>>>>>> Jonathan Lewis
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, 11 Nov 2022 at 17:07, yudhi s <
>>>>>>>>>>>> learnerdatabase99_at_gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thank You Sayan, Jonathan.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We reran the query and it took a different plan hash value ,
>>>>>>>>>>>>> however the path is almost similar wrt the line which does "Outer join" and
>>>>>>>>>>>>> the "analytical function" evaluation. I have posted the specific section of
>>>>>>>>>>>>> sql monitor below with predicate and column projection for that run.
>>>>>>>>>>>>> However, the sql monitor shows status as 'error' out after
>>>>>>>>>>>>> sometime(~30minutes)) but the underlying sessions(~33 parallel sessions)
>>>>>>>>>>>>> kept on running for 2hrs+, until then I saw ~2TB of temp space already
>>>>>>>>>>>>> consumed and later it failed with insufficient temp space error.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Nov 17 2022 - 23:21:24 CET

Original text of this message