Re: FW: Error related to hash memory
Date: Fri, 10 Dec 2021 20:37:06 +0000
Message-ID: <CAGtsp8=Cj_SrbZTUVgXC8aqURd_TR6=JJmw=4SxUUr4eOT-SQw_at_mail.gmail.com>
You might want to test the following on a backup system, but the "new" tracing/debug mechanism in 11g allows all sorts of interesting variations. It is possible to set a all sorts of events at the system level but restrict them in many ways - in particular you can set events to fire ONLY for a given SQL_ID, not matter who executes it. I've shown examples on my blog of enabling 10046 (as a named event) system wide whenever a particulr statement runs, but in your case I think the following should work:
alter system set events 'kg_event[32699][SQL:0hj564hmb98qc] level 32768';
You'll have to substitute the SQL_ID of the interesting statement in the
[SQL:xxxxxxxxxxxxx] clause.
For this event number level 32768 looks as if it might be the "everyrthing"
level. To disable the event after you've got a couple of traces I think
you'll have to run the command again but using level 0.
Regards
Jonathan Lewis
On Fri, 10 Dec 2021 at 09:46, Lok P <loknath.73_at_gmail.com> wrote:
> Thank You So Much Jonathan.
>
> As this event has to be set at session level rather than system level and
> this query is inside a procedure with a global temporary table used in it
> and is getting called by the application layer. And We may not be able to
> hit the same process from the application multiple times to reproduce the
> error. So we are facing a bit of a challenge. Don't see other options but
> maybe we will have to set the session level event trace(32699) and somehow
> execute the procedure manually multiple times to hit the error and capture
> the trace then.
>
> On Thu, Dec 9, 2021 at 3:52 PM Jonathan Lewis <jlewisoracle_at_gmail.com>
> wrote:
>
>>
>> >> Also Jonathan, I was trying to test the job in a less busy hour and
>> seeing the same plan with 'Hash group by " operation is using~4GB of
>> memory. So it's contradicting the thought of ~2GB memory limitation of the
>> "hash group by".
>>
>> Good. That eliminates one possibility. (In fact, looking back at the
>> original Monitor report you sent, we can that one of the other hash
>> operations reported 4GB max memory - so that would have been a reasonable
>> indication that the 2GB wasn't a limit).
>>
>> >> And also when I ran the same with 'sort group by' but parallel-8 and
>> it took almost the same time though with lesser memory consumption and temp
>> spill. Attached is the sql monitor for both.
>>
>> I think the only significant thing you can take from this test (since you
>> changed the degree of parallelism) is the workload of the two lines
>> relating to the serial group by, which tell us:
>> Hash group by: 4GB max memory, 23GB max temp, 234seconds CPU, 50 seconds
>> I/O
>> Sort group by: 119M max memory, 9GB max temp, 276 seconds CPU, 113
>> seconds I/O.
>>
>> If the problem is something to do with memory consumption the extra CPU
>> and I/O time is a relatively small prive to pay for "protection"- and that
>> was with 3% extra data; but that doesn't tell you much about what the
>> difference would be if the group by and data load was running parallelat
>> the same DOP.
>>
>>
>> >> However, As we had also seen the query/job reruns without failure , so
>> does it mean it's really a memory crunch (even if we have 40% of the memory
>> left unutilized in the host during the issue period). Thus bumping
>> pga_aggregate_target should be the correct approach here?
>>
>> At present we have no information that tells us it's the correct
>> approach, and nothing that tells us it's the wrong approach.
>> We do have Jared Still's comment that he has seen something of the sort
>> when the hash aggregation took up "all the temporary space".
>>
>>
>> I should point out, by the way, that I misread the original plan and said
>> the volume of data going into the hash aggregation was more than 1,000
>> times the estimated volume.
>> The number 1,000 came from comparing 168M with 127K - because I misread
>> the Cost column as the Rows column. The actual was really 168M times the
>> estimate.
>>
>> Have you tried running with the tracing event set. You need to do this
>> at least once simply to see how large the trace file is.
>> Whatever else you do (if you stick with the hash join) you MIGHT then
>> decide that the event should ALWAYS be set for this statement so that if it
>> does crashin the say way in the future you will be able to see from the
>> trace file where it had got to before crashing.
>>
>> Regards
>> Jonathan Lewis
>>
>>
>>
>>
-- http://www.freelists.org/webpage/oracle-lReceived on Fri Dec 10 2021 - 21:37:06 CET