Oracle-L: Re: Wait event problems

From: Rachel Carmichael <wisernet100_at_yahoo.com>
Date: Thu, 13 Jun 2002 16:28:12 -0800
Message-ID: <F001.0047D812.20020613162812@fatcity.com>

yep, that's the one

Mohammad Rafiq <rafiq9857_at_hotmail.com> wrote:
> I think Rachel is referring following email from Carry Millsap.. It
> is a
> really nice explanation of this issue....
> Regards
> Rafiq
>
>
>
> I have an example for you (Anjo, I hope you won't mind). A prospect
> we
> visited once upon a time had been fighting a performance problem with
> an
> Oracle Payroll program. They "knew" what their problem was: very
> clearly, v$system_event was telling them that their overwhelmingly
> dominant system performance problem was waits for "latch free."
> (Well,
> no it wasn't. It was waits for the Oracle Payroll program. ;) )
>
> Actually, the v$system_event view indicated clearly that the top wait
> event was "SQL*Net message from client" (and some other "SQL*Net %"
> stuff), but of course every good DBA knows that you have to discard
> all
> the "SQL*Net" events because they are "idle" events.
>
> Three months of trying and failing to fix the problem led finally to
> a
> last-resort, desperation hardware upgrade. Twelve 700MHz CPUs became
> twelve new 1,024MHz CPUs. (Don't ask why "waits for latch free" was
> translated as "need to upgrade CPUs.") And unfortunately, the Payroll
> program got *slower*. Not just "seemed slower"--*was* slower.
>
> After the upgrade failed to help, these guys finally let us come see
> them. We traced the Payroll process (10046, level 8), very carefully
> initiating the trace immediately after the initiation of the Payroll
> run, and very carefully terminating the trace immediately after the
> completion of the Payroll run. The key here is that every second of
> data
> in the trace file was a second of somebody's response time.
>
> The result was a really fun story. The thing ran in about 33 minutes.
> Here's a summary of what was in the session's trace file:
>
> Oracle Kernel Event Duration Calls Avg
>
> ------------------------------ ------------------ -------- ----------
> SQL*Net message from client 984.01s 49.6% 95,161 0.010340s
> SQL*Net more data from client 418.82s 21.1% 3,345 0.125208s
> db file sequential read 279.54s 14.1% 45,085 0.006200s
> CPU service 248.69s 12.5% 222,760 0.001116s
> unaccounted-for 27.73s 1.4%
> latch free 23.69s 1.2% 34,695 0.000683s
> log file sync 1.09s 0.1% 506 0.002154s
> SQL*Net more data to client 0.83s 0.0% 15,982 0.000052s
> log file switch completion 0.28s 0.0% 3 0.093333s
> enqueue 0.25s 0.0% 106 0.002358s
> SQL*Net message to client 0.24s 0.0% 95,161 0.000003s
> buffer busy waits 0.22s 0.0% 67 0.003284s
> db file scattered read 0.01s 0.0% 2 0.005000s
> SQL*Net break/reset to client 0.00s 0.0% 2 0.000000s
> ------------------------------ ------------------ -------- ----------
> Total 1,985.40s 100.0%
>
> Well, waits for "latch free" accounted for only 1.2% of the total
> response time. If the DBAs had been completely successful in
> eliminating
> "latch free" waits from their system, it would have made only a 1.2%
> difference in the runtime of this program. Point 1: <emphasis>This
> program's bottleneck is not the same as its system's system-wide
> average
> bottleneck.</emphasis> This happens all the time. You can NOT see
> this
> type of problem when you look only at system-wide average statistics.
>
> You probably didn't pay much heed to the first thing you should have
> noticed: slightly more than 70% of this program's runtime was
> consumed
> by so-called "server idle" time. ??! You can't ignore this time, even
> if
> people do call it "idle." Idle or not, this time was part of
> someone's
> response time, so we need to deal with it. Now if we hadn't collected
> our statistics so carefully (with proper time scope and proper
> program
> scope), then we would have seen probably much more "SQL*Net message
> from
> client" time in our data. If you make that particular collection
> error,
> then you *have* to disregard the so-called "idle" wait time.
>
> What was the Payroll program's problem? A couple of things. First, we
> knew that the Payroll program was running on the same host as Oracle.
> But the latency per call on the "SQL*Net message from client" event
> looked suspiciously LAN-like (order of 10ms), not IPC-like (order of
> 1ms
> or less). So we checked the tnsnames.ora file. Turns out that to
> conserve administration effort, the DBAs had decided to use a single
> tnsnames.ora file system-wide. So the Payroll program was using the
> same
> TCP/IP protocol that the system's PC clients were using. Just by
> switching the protocol to BEQ instead of TCP in the db server's
> tnsnames.ora file (a change that took about 15 minutes to test, 20
> seconds to implement, and about a week to navigate through change
> control), we eliminated about 50% of the program's runtime.
>
> Next, what's with the "SQL*Net more data from client"? The developers
> of
> this particular Payroll program did a pretty bad thing in their code:
> they passed several really long (10KB) SQL texts from the client to
> the
> server in their parse calls (oh yeh, and with no bind variables
> either).
> Anyway, each time the Oracle kernel received a packet containing only
> part of a SQL statement, it tallied to "SQL*Net more data from
> client"
> while it waited for the next piece. Of course, communicating via the
> network interface card instead of IPC didn't help, but passing huge
> SQL
> statements from the app to the db is bad form to begin with (see
> Ixora
> for more detail on just how bad this is).
>
> Finally, realize that Oracle tallies to "SQL*Net message from client"
> pretty much every time it's "between" working on db call fulfillment.
> Therefore, there were on the order of 95,161 db calls during this
> 33-minute period. Looking at the SQL, we discovered that of course,
> since the app developers didn't use bind variables, there was a parse
> call for almost every execute. This accounted for thousands of
> unnecessary parse calls. Next, each execute call manipulated only
> zero
> or one row, which added up to dozens of execute calls for each single
> SQL statement with a multi-row result source. The program could have
> performed the business function with a fraction of the db calls.
>
> Oh yes... Why did the Payroll program get *slower* after the upgrade?
> Little Payroll program time was spent using CPU, so the upgrade had
> very
> little direct positive effect upon the program. Most of the program's
> time was spent waiting for network latency. Other programs ran at the
> same time as this Payroll job. The CPU upgrade made *them* faster,
> which
> intensified *their* constant number of network calls into a smaller
> time
> window. The result was increased competition for the network during
> the
> Payroll run. Therefore, every network I/O call the Payroll call made
> was
> a little slower than before the CPU upgrade. The increase in network
> response time overwhelmed the tiny direct improvement of the CPU time
> reduction, resulting in a net degradation of Payroll performance.
> ...Not
> a good thing, because this Payroll program had a higher business
> priority than everything else on the system at the same time.
>
> The End.
>
> * * *
>
> So, some wrap-up:
>
> 1. Looking at v$session_event is better than looking at
> v$system_event
> for these things. When you collect system-wide data to solve a single
> program's performance problem, your irrelevant data can bury your
> relevant data. We see this all the time. Any time you query
> v$system_event and ask about the top-line events you see, your
> vulnerable to this problem.
>
> 2. Furthermore, looking even at v$session_event tells you NOTHING
> about
> the user's experience. When a v$ query tells you that there's
> "SQL*Net
> message from client" time, you can't know without more data whether
> this
> stuff showed up in someone's response time, or if it corresponded to
> time when the USER was actually idle.
>
> 3. Another important reason that looking at v$session_event data can
> mislead you is that there's no record of CPU time in there. To find
> that, you have to go to something like v$sesstat. The "CPU used by
> this
> session" statistic is supposed to give you the CPU time you need, but
> various bugs prevent you from getting what you need in many cases.
>
> 4. Response time = CPU from sesstat + waits from session_event + an
> unaccounted-for gap in time. Lots of people believe that the "gap" is
> a
> deficiency of 10046 data: it's not. The irony is that the same
> deficiency is even worse in v$ table methods; you just can't tell
> that
> there even *is* a gap: you can't compute gap=R-(cpu+waits), because
> with
> v$-based methods you don't know what R is. This "gap" is one of those
> things that we've discovered how to interpret reliably after some
> work.
> For more information, see some of our Hotsos Profiler documentation
> at
> www.hotsos.com/dnloads/1/kevents and click "unaccounted-for" (the
> document is free, but you have to register). (Bear with us, most of
> the
> event documentation on our kevents page is pre-beta, but the
> "unaccounted-for" document is pretty much complete.)
>
=== message truncated ===

Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Rachel Carmichael
  INET: wisernet100_at_yahoo.com

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

Received on Thu Jun 13 2002 - 19:28:12 CDT