Oracle-L: RE: I/O EVENTS

From: Jack Silvey <jack_silvey_at_YAHOO.COM>
Date: Tue, 21 May 2002 21:28:19 -0800
Message-ID: <F001.004671B4.20020521212819@fatcity.com>

Cary,

This is greatness.

Jack

Cary Millsap <cary.millsap_at_hotsos.com> wrote:
> I have an example for you (Anjo, I hope you won't
> mind). A prospect we
> visited once upon a time had been fighting a
> performance problem with an
> Oracle Payroll program. They "knew" what their
> problem was: very
> clearly, v$system_event was telling them that their
> overwhelmingly
> dominant system performance problem was waits for
> "latch free." (Well,
> no it wasn't. It was waits for the Oracle Payroll
> program. ;) )
>
> Actually, the v$system_event view indicated clearly
> that the top wait
> event was "SQL*Net message from client" (and some
> other "SQL*Net %"
> stuff), but of course every good DBA knows that you
> have to discard all
> the "SQL*Net" events because they are "idle" events.
>
> Three months of trying and failing to fix the
> problem led finally to a
> last-resort, desperation hardware upgrade. Twelve
> 700MHz CPUs became
> twelve new 1,024MHz CPUs. (Don't ask why "waits for
> latch free" was
> translated as "need to upgrade CPUs.") And
> unfortunately, the Payroll
> program got *slower*. Not just "seemed
> slower"--*was* slower.
>
> After the upgrade failed to help, these guys finally
> let us come see
> them. We traced the Payroll process (10046, level
> 8), very carefully
> initiating the trace immediately after the
> initiation of the Payroll
> run, and very carefully terminating the trace
> immediately after the
> completion of the Payroll run. The key here is that
> every second of data
> in the trace file was a second of somebody's
> response time.
>
> The result was a really fun story. The thing ran in
> about 33 minutes.
> Here's a summary of what was in the session's trace
> file:
>
> Oracle Kernel Event Duration
> Calls Avg
>
> ------------------------------ ------------------
> -------- ----------
> SQL*Net message from client 984.01s 49.6%
> 95,161 0.010340s
> SQL*Net more data from client 418.82s 21.1%
> 3,345 0.125208s
> db file sequential read 279.54s 14.1%
> 45,085 0.006200s
> CPU service 248.69s 12.5%
> 222,760 0.001116s
> unaccounted-for 27.73s 1.4%
>
> latch free 23.69s 1.2%
> 34,695 0.000683s
> log file sync 1.09s 0.1%
> 506 0.002154s
> SQL*Net more data to client 0.83s 0.0%
> 15,982 0.000052s
> log file switch completion 0.28s 0.0%
> 3 0.093333s
> enqueue 0.25s 0.0%
> 106 0.002358s
> SQL*Net message to client 0.24s 0.0%
> 95,161 0.000003s
> buffer busy waits 0.22s 0.0%
> 67 0.003284s
> db file scattered read 0.01s 0.0%
> 2 0.005000s
> SQL*Net break/reset to client 0.00s 0.0%
> 2 0.000000s
> ------------------------------ ------------------
> -------- ----------
> Total 1,985.40s 100.0%
>
> Well, waits for "latch free" accounted for only 1.2%
> of the total
> response time. If the DBAs had been completely
> successful in eliminating
> "latch free" waits from their system, it would have
> made only a 1.2%
> difference in the runtime of this program. Point 1:
> <emphasis>This
> program's bottleneck is not the same as its system's
> system-wide average
> bottleneck.</emphasis> This happens all the time.
> You can NOT see this
> type of problem when you look only at system-wide
> average statistics.
>
> You probably didn't pay much heed to the first thing
> you should have
> noticed: slightly more than 70% of this program's
> runtime was consumed
> by so-called "server idle" time. ??! You can't
> ignore this time, even if
> people do call it "idle." Idle or not, this time was
> part of someone's
> response time, so we need to deal with it. Now if we
> hadn't collected
> our statistics so carefully (with proper time scope
> and proper program
> scope), then we would have seen probably much more
> "SQL*Net message from
> client" time in our data. If you make that
> particular collection error,
> then you *have* to disregard the so-called "idle"
> wait time.
>
> What was the Payroll program's problem? A couple of
> things. First, we
> knew that the Payroll program was running on the
> same host as Oracle.
> But the latency per call on the "SQL*Net message
> from client" event
> looked suspiciously LAN-like (order of 10ms), not
> IPC-like (order of 1ms
> or less). So we checked the tnsnames.ora file. Turns
> out that to
> conserve administration effort, the DBAs had decided
> to use a single
> tnsnames.ora file system-wide. So the Payroll
> program was using the same
> TCP/IP protocol that the system's PC clients were
> using. Just by
> switching the protocol to BEQ instead of TCP in the
> db server's
> tnsnames.ora file (a change that took about 15
> minutes to test, 20
> seconds to implement, and about a week to navigate
> through change
> control), we eliminated about 50% of the program's
> runtime.
>
> Next, what's with the "SQL*Net more data from
> client"? The developers of
> this particular Payroll program did a pretty bad
> thing in their code:
> they passed several really long (10KB) SQL texts
> from the client to the
> server in their parse calls (oh yeh, and with no
> bind variables either).
> Anyway, each time the Oracle kernel received a
> packet containing only
> part of a SQL statement, it tallied to "SQL*Net more
> data from client"
> while it waited for the next piece. Of course,
> communicating via the
> network interface card instead of IPC didn't help,
> but passing huge SQL
> statements from the app to the db is bad form to
> begin with (see Ixora
> for more detail on just how bad this is).
>
> Finally, realize that Oracle tallies to "SQL*Net
> message from client"
> pretty much every time it's "between" working on db
> call fulfillment.
> Therefore, there were on the order of 95,161 db
> calls during this
> 33-minute period. Looking at the SQL, we discovered
> that of course,
> since the app developers didn't use bind variables,
> there was a parse
> call for almost every execute. This accounted for
> thousands of
> unnecessary parse calls. Next, each execute call
> manipulated only zero
> or one row, which added up to dozens of execute
> calls for each single
> SQL statement with a multi-row result source. The
> program could have
> performed the business function with a fraction of
> the db calls.
>
> Oh yes... Why did the Payroll program get *slower*
> after the upgrade?
> Little Payroll program time was spent using CPU, so
> the upgrade had very
> little direct positive effect upon the program. Most
> of the program's
> time was spent waiting for network latency. Other
> programs ran at the
> same time as this Payroll job. The CPU upgrade made
> *them* faster, which
> intensified *their* constant number of network calls
> into a smaller time
> window. The result was increased competition for the
> network during the
>
=== message truncated ===

Do You Yahoo!?
LAUNCH - Your Yahoo! Music Experience
http://launch.yahoo.com

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Jack Silvey
  INET: jack_silvey_at_yahoo.com

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

Received on Wed May 22 2002 - 00:28:19 CDT