Fw: Instrumenting Poor performance - SOLVED

From: Upendra nerilla <nupendra_at_hotmail.com>
Date: Fri, 12 Oct 2018 02:38:27 +0000
Message-ID: <SN1PR10MB0368DB4DC61F1FB55C3B8942D8E20_at_SN1PR10MB0368.namprd10.prod.outlook.com>

Didn't see it come through.. resending.. Apologies if it is duplicated..

From: Upendra nerilla <nupendra_at_hotmail.com> Sent: Tuesday, October 2, 2018 11:31 AM
To: Oracle-L
Subject: Re: Instrumenting Poor performance - SOLVED

All,
After much struggle, Oracle has come back pinpointing to a bug..

Here is the issue description:
Occasionally one of the custom EBS process was running slow. After trying a couple of fixes we thought we fixed the issue (because it occurred once or twice a month). However with no explanation the issue randomly showed up.

Here is how a good and bad tkprof output looked..

GOOD EXECUTION:
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------

Parse        0      0.00       0.00          0          0          0           0
Execute      9      0.00       0.00          0          0          0           0
Fetch        9    119.20     119.95        135   40086840          0        2111


------- ------  -------- ---------- ---------- ---------- ----------  ----------


total       18    119.20     119.96        135   40086840          0        2111


POOR EXECUTION:
call     count       cpu    elapsed       disk      query    current        rows


------- ------  -------- ---------- ---------- ---------- ----------  ----------


Parse        0      0.00       0.00          0          0          0           0
Execute      3      0.00       0.00          0          0          0           0
Fetch        3    160.44    1289.07    3673632   13274897          0         368


------- ------  -------- ---------- ---------- ---------- ----------  ----------


total        6    160.44    1289.07    3673632   13274897          0         368

Both the good and bad executions were using the same plan hash (which was forced through a baseline). Other than the occasional heavy "db file sequential read" wait events, there wasn't anything we could find. All the data access was similar between good and bad execution.

Upon closer look of 10046 trace, we observed that the same blocks were being read twice.. (see the screenshot attached)

Oracle narrowed down this to bugs - Bug 20352562<https://support.oracle.com/epmos/faces/BugDisplay?id=20352562&parent=SrDetailText&sourceId=3-18254389511>, bug 19145675<https://support.oracle.com/epmos/faces/BugDisplay?id=19145675&parent=SrDetailText&sourceId=3-18254389511> superceded by bug 19787571<https://support.oracle.com/epmos/faces/BugDisplay?id=19787571&parent=SrDetailText&sourceId=3-18254389511> (not open for public viewing)

Here are the excerpts from the bug:
"Single synchronous reads are due to the chaining. Maybe the duplicated read is because of intra-block chaining. The table has 264 columns, therefore the rows will be split into pieces of <=255 rows, these may be in the same block or separate blocks. I took a look at 0x01403d00 (5,15616) which is the first culprit, the rows are intra-block chained thus:

row 0 last piece of row 1
row 1 head piece
row 2 last piece of row 3
row 3 head piece
row 4 last piece of row 5
row 5 head piece
row 6 last piece of row 7
row 7 head piece
row 8 last piece of row 9
row 9 head piece
row 10 last piece of row 11
row 11 head piece
row 12 last piece of row 13
row 13 head piece
row 14 last piece of row 15
row 15 head piece
kcbldrget() looks like the start point."

In the meanwhile, we rewrote the offending query to eliminate a chunk of processing which fixed the issue for now. Oracle development has created an emergency patch for us which we are testing it.

Many thanks to everyone who assisted at the right time.

-Upendra

From: niall.litchfield_at_gmail.com <niall.litchfield_at_gmail.com> Sent: Tuesday, September 18, 2018 3:24 AM To: nupendra_at_hotmail.com
Cc: dombrooks_at_hotmail.com; Martin Berger; ORACLE-L Subject: Re: Instrumenting Poor performance

Hi Upendra

I see that you say that this occurs around the start of each month and disappears after a couple of days. Have you perhaps partitioned your Order Entry tables and (assuming the plan really is the same) you are starting to access new partitions for the first time? Usually month change funniness would involve getting a plan change from/to nested loop access to the data as a result of misleading stats on the new partition, but I see no reason in principle why there wouldn't be a caching effect - though I'd be surprised if it lasted days.

On Mon, Sep 17, 2018 at 10:27 PM Upendra nerilla <nupendra_at_hotmail.com<mailto:nupendra_at_hotmail.com>> wrote: Dominic/Jonathan/Stefan and others,
Aplogies for the delay... I was trying to make sure I go back and review all the data that we already have gathered..

A couple of quick information:
This is running from EE 12.1.0.2 RAC.
We do have Tuning/Diag licenses.

It turns out, we do have a bit of useful information gathered during our last poor execution..

We have the 10046 trace
SQLT report

I performed a quick profiler comparison for both the bad and good execution, plan hash is the same. Here is the summary of the tkprof out how it appears..

Bad run:
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------

Parse        0      0.00       0.00          0          0          0           0
Execute      3      0.00       0.00          0          0          0           0
Fetch        3    160.44    1289.07    3673632   13274897          0         368


------- ------  -------- ---------- ---------- ---------- ----------  ----------


total        6    160.44    1289.07    3673632   13274897          0         368


Good run:
call     count       cpu    elapsed       disk      query    current        rows


------- ------  -------- ---------- ---------- ---------- ----------  ----------


Parse        0      0.00       0.00          0          0          0           0
Execute      9      0.00       0.00          0          0          0           0
Fetch        9    119.20     119.95        135   40086840          0        2111


------- ------  -------- ---------- ---------- ---------- ----------  ----------


total       18    119.20     119.96        135   40086840          0        2111

Upon looking at the details, "OE_ORDER_LINES_ALL" seems to be the object which was read from disk..

db file sequential read 92.7% 1,258.1501s 3,692,162 0.0003s 0.0001s 0.5071s

I am now leaning towards Jonathan's suggestion of a potential large scale update on this table.. I will check with the team and see if there is any such update which happens.. With the AWR baseline created around that time, I am going to go through and see if I could find it..

Knowing that I have all this information, is there anything else that I could be looking at?

Thank you so much everyone

-Upendra

From: Dominic Brooks <dombrooks_at_hotmail.com<mailto:dombrooks_at_hotmail.com>> Sent: Friday, September 14, 2018 2:52 AM To: nupendra_at_hotmail.com<mailto:nupendra_at_hotmail.com> Cc: Martin Berger; Oracle-L oracle-l
Subject: Re: Instrumenting Poor performance

Sql plan hash does not include the predicates so that�s one niche consideration to look out for. AWR doesn�t hold predicates so you will need to wait until you can pull plans from memory via dbms_xplan.display_cursor to see that. It�s possible but unusual.

Real time sql monitoring can you tell you where the time was spent but if you no longer have that available for bad execution then you can get a very rough approximation from ASH raw data.

Full sql trace is best to see what you were doing but ASH is usually good enough.

A combination of wait events (e.g runtime decisions about switching from buffered to direct io) and the detail of what you are waiting on exactly (e.g reading lots of undo) might be sufficient to give pointers.

Otherwise you might need the combination of ash/tracing (what) and snapping session statistics (why) to give a better picture.

Regards
Dominic

Sent from my iPhone

On 13 Sep 2018, at 21:47, Upendra nerilla <nupendra_at_hotmail.com<mailto:nupendra_at_hotmail.com>> wrote:

is it always the same SQL_ID?
You an enable sql_trace for this particular SQL: alter system set events 'sql_trace[sql: <SQL_ID>] level=12';

It is always the same sql_id. We have a 10046 and 10053 trace for the good execution.. Waiting for the next poor execution to capture the traces again..

I assume EBS can enable tracing for specific activities, but I'm not close to EBS, but maybe there are better solutions than simple sql_trace.

I am not an EBS guy, so I am at a loss here. :(

Do you have proper facility to create a profile based on these sql_trace files?

There is already a sql_profile created for the sql_id. Is that what you mean?

From: Martin Berger <martin.a.berger_at_gmail.com<mailto:martin.a.berger_at_gmail.com>> Sent: Thursday, September 13, 2018 3:52 PM To: nupendra_at_hotmail.com<mailto:nupendra_at_hotmail.com> Cc: Oracle-L oracle-l
Subject: Re: Instrumenting Poor performance

Hi Upendra,

is it always the same SQL_ID?
You an enable sql_trace for this particular SQL: alter system set events 'sql_trace[sql: <SQL_ID>] level=12';

I assume EBS can enable tracing for specific activities, but I'm not close to EBS, but maybe there are better solutions than simple sql_trace.

With the traces of different "good" and "bad" and see where the time is spent.

Do you have proper facility to create a profile based on these sql_trace files?

br,
berx

Am Do., 13. Sep. 2018 um 21:11 Uhr schrieb Upendra nerilla <nupendra_at_hotmail.com<mailto:nupendra_at_hotmail.com>>: Hello Team -
We have an EBS application which is running EBS and several other modules (OTC). From time to time I see a particular query giving poor response time. Normal response is <4 mins and during the poor execution, it goes over an hour.

We have a SQL baseline created for that SQL_ID and forcing the optimal plan. We are able to see that optimizer is using the same hash. There are no host resource (cpu/memory/io) constraints.. This job runs only a couple of times a day, very time sensitive for the business. We are unable to reproduce this at lower environments even when we performed refreshes..

We see a pattern that this issue shows up during the first week of the month and disappears after a couple of days (not the same exact day). Here is what I am thinking of gathering - high level info..

Gather daily row count of all the tables involved in the query..
Review the jobs and see if there is anything unique..

What else could I gather to troubleshoot this further? Appreciate any insights..

Thank you
-Upendra

--

Niall Litchfield
Oracle DBA
http://www.orawin.info

--

http://www.freelists.org/webpage/oracle-l Received on Fri Oct 12 2018 - 04:38:27 CEST

This message: [ Message body ]
Next message: Paul Drake: "Oct 2018 CPU pre-release posted"
Previous message: Mladen Gogala: "Re: 18c client Net Manager is missing Directory"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message