Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
Home -> Community -> Mailing Lists -> Oracle-L -> Re: Wait event problems
I think Rachel is referring following email from Carry Millsap.. It is a
really nice explanation of this issue....
Regards
Rafiq
I have an example for you (Anjo, I hope you won't mind). A prospect we visited once upon a time had been fighting a performance problem with an Oracle Payroll program. They "knew" what their problem was: very clearly, v$system_event was telling them that their overwhelmingly dominant system performance problem was waits for "latch free." (Well, no it wasn't. It was waits for the Oracle Payroll program. ;) )
Actually, the v$system_event view indicated clearly that the top wait event was "SQL*Net message from client" (and some other "SQL*Net %" stuff), but of course every good DBA knows that you have to discard all the "SQL*Net" events because they are "idle" events.
Three months of trying and failing to fix the problem led finally to a last-resort, desperation hardware upgrade. Twelve 700MHz CPUs became twelve new 1,024MHz CPUs. (Don't ask why "waits for latch free" was translated as "need to upgrade CPUs.") And unfortunately, the Payroll program got *slower*. Not just "seemed slower"--*was* slower.
After the upgrade failed to help, these guys finally let us come see them. We traced the Payroll process (10046, level 8), very carefully initiating the trace immediately after the initiation of the Payroll run, and very carefully terminating the trace immediately after the completion of the Payroll run. The key here is that every second of data in the trace file was a second of somebody's response time.
The result was a really fun story. The thing ran in about 33 minutes. Here's a summary of what was in the session's trace file:
Oracle Kernel Event Duration Calls Avg ------------------------------ ------------------ -------- ---------- SQL*Net message from client 984.01s 49.6% 95,161 0.010340s SQL*Net more data from client 418.82s 21.1% 3,345 0.125208s db file sequential read 279.54s 14.1% 45,085 0.006200s CPU service 248.69s 12.5% 222,760 0.001116s unaccounted-for 27.73s 1.4% latch free 23.69s 1.2% 34,695 0.000683s log file sync 1.09s 0.1% 506 0.002154s SQL*Net more data to client 0.83s 0.0% 15,982 0.000052s log file switch completion 0.28s 0.0% 3 0.093333s enqueue 0.25s 0.0% 106 0.002358s SQL*Net message to client 0.24s 0.0% 95,161 0.000003s buffer busy waits 0.22s 0.0% 67 0.003284s db file scattered read 0.01s 0.0% 2 0.005000s SQL*Net break/reset to client 0.00s 0.0% 2 0.000000s ------------------------------ ------------------ -------- ---------- Total 1,985.40s 100.0%
Well, waits for "latch free" accounted for only 1.2% of the total response time. If the DBAs had been completely successful in eliminating "latch free" waits from their system, it would have made only a 1.2% difference in the runtime of this program. Point 1: <emphasis>This program's bottleneck is not the same as its system's system-wide average bottleneck.</emphasis> This happens all the time. You can NOT see this type of problem when you look only at system-wide average statistics.
You probably didn't pay much heed to the first thing you should have noticed: slightly more than 70% of this program's runtime was consumed by so-called "server idle" time. ??! You can't ignore this time, even if people do call it "idle." Idle or not, this time was part of someone's response time, so we need to deal with it. Now if we hadn't collected our statistics so carefully (with proper time scope and proper program scope), then we would have seen probably much more "SQL*Net message from client" time in our data. If you make that particular collection error, then you *have* to disregard the so-called "idle" wait time.
What was the Payroll program's problem? A couple of things. First, we knew that the Payroll program was running on the same host as Oracle. But the latency per call on the "SQL*Net message from client" event looked suspiciously LAN-like (order of 10ms), not IPC-like (order of 1ms or less). So we checked the tnsnames.ora file. Turns out that to conserve administration effort, the DBAs had decided to use a single tnsnames.ora file system-wide. So the Payroll program was using the same TCP/IP protocol that the system's PC clients were using. Just by switching the protocol to BEQ instead of TCP in the db server's tnsnames.ora file (a change that took about 15 minutes to test, 20 seconds to implement, and about a week to navigate through change control), we eliminated about 50% of the program's runtime.
Next, what's with the "SQL*Net more data from client"? The developers of this particular Payroll program did a pretty bad thing in their code: they passed several really long (10KB) SQL texts from the client to the server in their parse calls (oh yeh, and with no bind variables either). Anyway, each time the Oracle kernel received a packet containing only part of a SQL statement, it tallied to "SQL*Net more data from client" while it waited for the next piece. Of course, communicating via the network interface card instead of IPC didn't help, but passing huge SQL statements from the app to the db is bad form to begin with (see Ixora for more detail on just how bad this is).
Finally, realize that Oracle tallies to "SQL*Net message from client" pretty much every time it's "between" working on db call fulfillment. Therefore, there were on the order of 95,161 db calls during this 33-minute period. Looking at the SQL, we discovered that of course, since the app developers didn't use bind variables, there was a parse call for almost every execute. This accounted for thousands of unnecessary parse calls. Next, each execute call manipulated only zero or one row, which added up to dozens of execute calls for each single SQL statement with a multi-row result source. The program could have performed the business function with a fraction of the db calls.
Oh yes... Why did the Payroll program get *slower* after the upgrade? Little Payroll program time was spent using CPU, so the upgrade had very little direct positive effect upon the program. Most of the program's time was spent waiting for network latency. Other programs ran at the same time as this Payroll job. The CPU upgrade made *them* faster, which intensified *their* constant number of network calls into a smaller time window. The result was increased competition for the network during the Payroll run. Therefore, every network I/O call the Payroll call made was a little slower than before the CPU upgrade. The increase in network response time overwhelmed the tiny direct improvement of the CPU time reduction, resulting in a net degradation of Payroll performance. ...Not a good thing, because this Payroll program had a higher business priority than everything else on the system at the same time.
The End.
So, some wrap-up:
...But the hit ratio stuff lives on, I think for two reasons: (1) Not enough people understand the response-time based diagnostic methods yet (but we're working on that!); and (2) Ratio-based techniques, when combined with time & materials-based billing contracts, are extremely effective at maximizing consulting revenue.
Cary Millsap
Hotsos Enterprises, Ltd.
cary.millsap_at_hotsos.com
http://www.hotsos.com
Reply-To: ORACLE-L_at_fatcity.com
To: Multiple recipients of list ORACLE-L <ORACLE-L_at_fatcity.com>
Date: Thu, 13 Jun 2002 08:44:04 -0800
You may want to search the archives of the list as well, I seem to recall Cary Millsap posting something not too long ago where they found that this "idle" wait event was a real problem.
If you can't find it, I can try to dig it out of my files
Rachel
-- Please see the official ORACLE-L FAQ: http://www.orafaq.com -- Author: Rachel Carmichael INET: wisernet100_at_yahoo.com Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051 San Diego, California -- Public Internet access / Mailing Lists -------------------------------------------------------------------- To REMOVE yourself from this mailing list, send an E-Mail message to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in the message BODY, include a line containing: UNSUB ORACLE-L (or the name of mailing list you want to be removed from). You may also send the HELP command for other information (like subscribing). _________________________________________________________________ Chat with friends online, try MSN Messenger: http://messenger.msn.com -- Please see the official ORACLE-L FAQ: http://www.orafaq.com -- Author: Mohammad Rafiq INET: rafiq9857_at_hotmail.com Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051 San Diego, California -- Public Internet access / Mailing Lists -------------------------------------------------------------------- To REMOVE yourself from this mailing list, send an E-Mail message to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in the message BODY, include a line containing: UNSUB ORACLE-L (or the name of mailing list you want to be removed from). You may also send the HELP command for other information (like subscribing).Received on Thu Jun 13 2002 - 12:03:36 CDT