RE: Finding cause and fix for Idle wait event

From: Mark W. Farnham <mwf_at_rsiz.com>
Date: Sun, 26 Feb 2023 09:18:04 -0500
Message-ID: <0c5d01d949ed$25469600$6fd3c200$_at_rsiz.com>



An interesting way to test realized network throughput to an individual client can be done with sqlplus.  

Create a simple one table scan that produces a known size output file that is resistant to compression (big buffer size, no pause on a pagefull, format off, long numbers with no repeating digits, etc.).  

Turn screen output off and redirect output to /dev/null (on the client), fire the query from the client, time until you get the next prompt.

(You do need to make the output large enough that the mechanics of running your stop watch and the transmission of the sql string and parsing are insignificant).  

Divide report size by elapsed time.  

That’s down.  

For up, create a table no one else is using with no indexes, no integrity, no nothin’,enough preallocated space so there is no space management, and turn autocommit off. Create an insert file of a known size on the client and _at_ sign it.

Time until you get feedback.

Compare to the speed of doing the _at_ sign of the same input file on the server.

The difference is approximately the network time.  

This is a very pedestrian sanity check and it is not exact.  

If it shows about 44 Kbaud, I’ll tell a humorous-to-me story about a customer who was sucking big reports of data to edit and then send back as a batch who insisted the database was overloaded when I couldn’t snap a single wait with my old snapstat package. Using a remote terminal to the same client computer in the computer room attached by an Ethernet cable solved the problem without upgrading the network to the office across the quad (because 20 people typing with think time can’t generate 44 kbaud back and forth looking at one page of text data at a time, but they can try to pull or shove huge amounts of data back and forth as files.)  

Okay, I said it was humourous-to-me and I already told it.  

The point is that IF you suspect an application or something else on the client is the root cause of slowness, do something that is known to be fast on the client that generates a reasonably accurate chunk of data transmission between the client and the server. IF that goes fast, then the problem is clearly the application. IF that is slow, then you’re not getting the fraction of some damn VLAN that QOS is supposed to guarantee (or some other network problem). And if a simple insert is slow on the server, you have a problem on the server.  

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Jonathan Lewis Sent: Saturday, February 25, 2023 8:01 AM To: contact_at_soocs.de
Cc: loknath.73_at_gmail.com; Oracle L
Subject: Re: Finding cause and fix for Idle wait event    

Stefan,  

The reason you shouldn't ignore the "more data" waits completely is that they're telling you that the problem isn't the network.  

If you had random variation in the "message" AND "more data" then you could point the finger at the network - but since the 3 "more data" are always very fast and the "message" are always slow then the gap is the application.      

Regards

Jonathan Lewis            

On Sat, 25 Feb 2023 at 11:58, Stefan Koehler <contact_at_soocs.de> wrote:

Hello Lok,
please let me just add my 2 cents to this discussion.

I am ignoring the "SQL*Net more data from client" wait for now as the total wait time is ignorable. However "SQL*Net message from client" can be the application OR the network - you can't tell by just looking at this wait.

I am quoting Tanel Poder for a more detailed explanation ( https://tanelpoder.com/2008/02/10/sqlnet-message-to-client-vs-sqlnet-more-data-to-client/ ): "Thanks to that, all of the time a TCP packet spent “flying” towards the client is actually accounted in SQL*Net message from client wait statistic. The problem here is though, that we don’t know how much of this time was spent on the wire and how much of it was application think time."

You can approach this unsureness by looking at the wait times itself (e.g. like every single "SQL*Net message from client" wait that takes longer than 1 second is probably application think/work time and everything below 1 second is probably the network) - tools like the Method-R profiler by Cary Millsap are doing this work for you (for every single wait) and are showing you the final result in a HTML report.

Anyway if you want to be absolutely sure if it is the application or the network (that causes this high "SQL*Net message from client" wait time) you can TCPDUMP the network traffic (on application side) and then process this TCPDUMP file with the free tool STADO ( http://blog.ora-600.pl/2020/01/22/when-awr-is-not-enough-a-k-a-why-your-app-is-not-cloud-ready-smartdb/ ) by Kamil Stawiarski.

Best Regards
Stefan Koehler

Independent Oracle performance consultant and researcher Website: http://www.soocs.de
Twitter: _at_OracleSK

> Lok P <loknath.73_at_gmail.com> hat am 25.02.2023 07:58 CET geschrieben:
>
>
> Thank you so much Pap and Kyle.
>
> Exactly what I wanted to know. Thank you for the clarification and apologize for my confusion. Actually I also shared the trace file in the post earlier just to be sure we are going in the right direction. And that's because I may saw in past in some blogs , this waits can be network related.
>
> However now as you mentioned, it's clear that the issue is not in the network but in the app layer, I will follow up with the app/informatica admin/team. Thank you again for all the guidance.

--
http://www.freelists.org/webpage/oracle-l







--
http://www.freelists.org/webpage/oracle-l
Received on Sun Feb 26 2023 - 15:18:04 CET

Original text of this message