Oracle-L: HIGH CPU WITH MULTIPLE CONCURRENT USERS (long)

From: Richard Eastham <eastham_at_flash.net>
Date: Tue, 16 Apr 2002 21:58:21 -0800
Message-ID: <F001.00446101.20020416215821@fatcity.com>

A co-worker is having a fairly serious issue with performance tuning of a system. The system is in the stress testing phase prior to rolling out into production. I have not included all the information as so far they have exceeded three TARs and are working on the fourth one right now. Oracle has become fairly heavily involved and is sending in the Advanced services team is now involved. He has identified that the main issue is a wait after the parsing of the SQL and during the fetch portion of the execution. The short version is running the same SQL statement ( basically nothing more than a simple query against a single table) the machine starts bogging down with a simulated 20+ users sessions and the system starts to choke at 100+ user sessions. We are talking a fairly decent midrange system. The query is a select with 5 columns extracted and a where clause that uses the in clause to select the same rows for each query. The question is has anyone seen this type of behavior before? If you have seen this before what was the root cause? Did you find a solution?

Oracle acknowledges that the scenario is reproducible within their test environment, but the core team is stating that it is working as designed. Oracle is working with us, but why not check with other sources.

A summary of where we are at: (4th TAR)

  1.. We tried to simulate the same performance degradation on an entirely different environment. We have been able to do the same. 
  2.. We had requested Oracle to simulate the test case in their environment. They have been able simulate the performance degradation. Their analysis is also provided in this attachment. 
  3.. To summarize, they have simulated where 1 user query runs in 2 seconds and 10-user query takes 7 seconds on a 4-processor server. 
  4.. The development team of Oracle has answered to this degradation as normal and as designed. However, the degradation is very high and is in contrast with their alleged benchmark results (67000 transactions per minute on a 8 processor hardware). For us the degradation is so high that we are not able to run 150 transactions per minute on a 4-processor server. The simulation within oracle also supports this degradation



--------------------------------------------------------------------------------

15-APR-02 22:09:08 GMT Pasting information into the tar on bug:2321553 <Abstract: HIGH CPU WITH MULTIPLE CONCURRENT USERS> since currently unavailable on MetaLink:

"PROBLEM:

Customer has a production database that was installed on a Sun Solaris 2.8. The Solaris was a fresh install. The database was a fresh install. Customer is having the following problems:
.

1. Performance problems with multiple users - more users more performance problems
2. The query runs fine, explain plan runs fine, query just takes moer time with more users - same query
3. Customer tested multi-user connection from the box via sqlplus ( no network ) - same issue
4. Customer removed the application from the env and ran multi-user test - same problem.
5. Customer loaded data in another 8.1.7 database on Win 2000 - same performance problem with more users.
.
.

DIAGNOSTIC ANALYSIS:

Analysis by NSAWYER.US

I have looked over the RDA and have found nothing that would cause the CPU to run high. - the more I review the issue the more it seems to be less a tuning issue and more a memory leak problem. But cannot find any report of it in the trace.
.

Looking for any bugs that could be related to this type of behavior. - I have found none.
.

Analysis

Reviewed the statpack and noticed that a query parses and retrieves for a split second, but the data is not return for 10 to 20 seconds. What is going during that extra time. This might be the place to concentrate on.
.

WORKAROUND:

None
.

INFORMATION:
SERVER: E4500, 4 CPU 400MHZ, 4GB RAM (Oracle SGA 1.6GB).
.

PROBLEM STATEMENT: Per customer,
There is a definite overload on the CPU when we run 10, 20, 100 user loads. In
.

fact during this test of 20 simultaneous users executing this simple one table
.

query, the processes waiting went way high. I can see the CPU trashing and 0% idle, 99% user, 1% kernel almost all the way.
.

IMPORTANT: Customer observation is that when 100 connection started exactly at the same time, response varies, but recursive connections show same resp.

Environment

The server is E4500, 4CPU of 450MHZ, 4GB memory with Hitachi SAN Storage array 9500 and 2GB cache. Sun Solaris 2.8 32 bit although this has also been tried on Intel Windows 2000 Oracle 8.1.7.3.0 although this has been tried on several other versions ranging from Oracle 8.1.7 - 9.0.1

11-APR-02 12:27:03 GMT Can you easily recover from, bypass or work around the problem? = NO Does your system or application continue normally after the problem occurs? = NO
Are the standard features of the system or application still available; is the loss of service minor? = NO

### Platform and O/S version, including patchset or service pack level? ### Sun Solaris 2.8 32 bit

### What version and patchset level of the database are you running? ### Oracle 8.1.7.3.0

### Please describe your problem: ###
We have a database that has been developed and configured. We are in the stage of performing stress and volume testing. The database is an OLTP database with at least 30 transactions (inserts, updates and deletes) coming from the various application servers. These are very discreet transactions. We also have a web server that accesses this database for reporting. The reporting are huge select statements that run against these live tables. When we run around 5 users, everything is fine and no problems. As we increase the users to 50 and 100 and 300, we are seeing an appreciable degradation in performance. The timings are like for 5 users the web select queries take under 5 seconds, while at 100 users it takes 80 seconds and at 300 users it takes 240 seconds.
I have stressed the database both in dedicated and MTS mode and the effect is the same.
We have to complete stress testing at 1000 users and at this rate we will not be able to make it. I am reading all the documentation in Metalink and I see similar tars, but no answers.
I would appreciate it if you can provide us with some guidance as this will make or break the project. I am attaching files of sar, vmstat, top, init and statspack report for idle and load activity. As you will be able to see, the CPU utlization is tremendous and even unrealistic.

### Did the error generate a trace file? ### Does not apply

### Please list all files that you plan to upload: ### initedbcip1.ora - Init file

idle_top.txt - top at idle time.
idle_vmstat.txt - vmstat at idle time.
idle_saru.txt - sar - at idle time.
idle_sarb.txt - sar -b at idle time.
idle_sarw.txt - sar -w at idle time.

stats_edbcip1.2002-04-14.0500 - statspack report during load.

load_vmstat.txt - vmstat at load
load_top.txt - top on load
load_saru.txt - sar on load.
load_sarb.txt - sar on load.

### What was being done at time of error? Any changes since this last worked? ## We were simulating and increasing the number of users connecting and accessing the database. This was part of the stress testing. The server is E4500, 4CPU of 450MHZ, 4GB memory with Hitachi SAN Storage array 9500 and 2GB cache.

### What is the frequency of the error? ### Consistently

### What is the impact to your business because of this problem? ### This will make or break the project. I am sure there are some configurations that I have set, that might need to be tweaked. We can also buy additional CPU and memory, only after we tweak the configuration. This seems to be a common issue and no clear cut solution. Please help us debug as I am running out of ideas.

11-APR-02 16:44:10 GMT Update:

Init.ora file shows:

db_files = 1024
open_cursors = 100
max_enabled_roles = 30
db_block_buffers = 76800

shared_pool_size = 1073741824
shared_pool_reserved_size = 107374182
cursor_sharing = EXACT

session_cached_cursors = 0
_sqlexec_progression_cost = 0
large_pool_size = 209715200
java_pool_size = 20971520
buffer_pool_keep = (buffers:9600, lru_latches:32) log_checkpoint_interval = 10000
log_checkpoint_timeout = 7200
processes = 1000
log_buffer = 2097152

audit_trail = db
audit_file_dest = /edbcip1/db14/security
timed_statistics = true # if you want timed statistics

max_dump_file_size = 100000 # limit trace file size to 5M each

oracle_trace_enable = true
oracle_trace_collection_path=/db/app/oracle/product/8.1.7/otrace/admin/cdf
oracle_trace_facility_path=/db/app/oracle/product/8.1.7/otrace/admin/fdf

db_block_size = 16384
job_queue_processes = 20
job_queue_interval = 60
distributed_transactions = 10
open_links = 4
compatible = "8.1.7"

sort_area_size = 65536
sort_area_retained_size = 65536
disk_asynch_io = FALSE

db_writer_processes = 4 #Set to the number of CPU and less than db_block_lru db_block_lru_latches = 288 # 2 * no of cpu as per oracle. did not work with 8 _spin_count = 40 # Used in latch contention, multiple of cpu.

mts_servers = 300 
mts_max_servers = 800 
mts_max_dispatchers = 300
mts_dispatchers = "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=cipedb1)
(QUEUESIZE=32)))
(DISPATCHERS=100)
(POOL=ON)
(TICK=1)

(CONNECTIONS=10)
(SESSIONS=10)"
mts_listener_address = "(ADDRESS=(PROTOCOL=TCP)(HOST=cipedb1)(PORT=1521))"

11-APR-02 16:48:16 GMT Update:

STATSPACK snapshot shows the following:

Top 5 Wait Events

~~~~~~~~~~~~~~~~~ Wait % Total

Event Waits Time (cs) Wt Time

-------------------------------------------- ------------ ------------ -------

latch free 49 286 26.05
log file parallel write 12 252 22.95
db file sequential read 122 181 16.48
sort segment request 1 109 9.93
refresh controlfile command 10 75 6.83

and second snapshot shows:

Top 5 Wait Events

~~~~~~~~~~~~~~~~~ Wait % Total

Event Waits Time (cs) Wt Time

-------------------------------------------- ------------ ------------ -------

latch free 60 379 29.06
log file parallel write 21 280 21.47
db file sequential read 123 183 14.03
sort segment request 1 109 8.36
control file parallel write 95 95 7.29

11-APR-02 19:49:47 GMT Update:

latch free A wait for a latch free occurs when a server requests a latch and is unable to immediately acquire that latch. Review Note # 34576.1 titled: WAITEVENT: "latch free" Reference Note
log file parallel write Writing redo records to the redo log files from the log buffer.

Wait Time: Time it takes for the I/Os to complete. Even though redo records are written in parallel, the parallel write is not complete until the last I/O is on disk.

3) db file sequential read
This indicates a single block read is occurring The session waits while a sequential read from the database is performed. This event is also used for rebuilding the controlfile, dumping datafile headers, and getting the database file headers.

Wait Time: The wait time is the actual time it takes to do the I/O.

11-APR-02 19:54:44 GMT ACTION PLAN:

To rule out the possibility that MTS is causing a problem, please comment out ALL of the mts parameters in the init.ora file. We do not want to use MTS.

Then run your stress tests.

Generate a new statspack snapshot if performance times are bad.

2) Are you using connection pooling on these web servers?

3) What type of application/web servers are you using?

4) If you run these queries via sqlplus locally on the database server, what is performance like?

12-APR-02 20:05:49 GMT Ct called in live.

They are now at the point where this issue has caused them to stop all forward progress. Ct has requested that the severity be raised to 1. Ct needs a resolution as soon as possible.

SUMMARY:

They are using a third party app that uses oci to access oracle. This app makes a sqlplus connection to the oracle database. They are firing a simple select statement, select * from <tablename>. They notice that the response time increases tremondously, after running this select statement 50 times consecutively.

They will be using 1000+ connections, so this increase in response time is going to hurt them.

They are, at the moment, NOT using MTS.

12-APR-02 20:18:14 GMT Ct was running 10 threads on the db server using sqlplus He saw 8 processes waiting, and the response time was 2.94 sec.

When he ran the query individually, it is taking 1.01seconds.

Ct also wanted to mention that TOP shows: Memory: 4096M real, 63M free, 3198M swap in use, 3895M swap free

PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 11640 root 8 59 -20 7608K 5816K cpu/5 56:06 19.01% caiLogA2

10586 oracle 1 48 0 2512M 2489M run 0:22 0.94% oracle
10453 oracle 1 49 0 2513M 2493M run 0:25 0.87% oracle
10480 oracle 1 58 0 2513M 2493M run 0:23 0.87% oracle
10406 oracle 1 58 0 2512M 2490M run 0:26 0.87% oracle
10571 oracle 1 58 0 2513M 2493M run 0:21 0.87% oracle
10458 oracle 1 49 0 2513M 2492M run 0:24 0.86% oracle
10413 oracle 1 54 0 2513M 2492M run 0:26 0.82% oracle
10465 oracle 1 54 0 2513M 2492M run 0:24 0.79% oracle
10441 oracle 1 39 0 2512M 2489M run 0:24 0.78% oracle
10485 oracle 1 58 0 2513M 2493M run 0:24 0.77% oracle
10473 oracle 1 58 0 2513M 2492M run 0:22 0.77% oracle
10550 oracle 1 58 0 2513M 2492M run 0:23 0.76% oracle
10455 oracle 1 58 0 2512M 2490M run 0:24 0.75% oracle

10468 oracle 1 490 2513M 2492M run 0:23 0.74% oracle

12-APR-02 21:08:16 GMT New info : Please find attached the statspack report for 1 user and 50 user. We also executed 10 simultaneous user requests from sql plus on the database server and the response time jumped from 1 second for 1 sql plus session to 3 seconds for 10 simultaneous sessions.

12-APR-02 23:58:15 GMT For starters, the 50 users report was run for 6.35 minutes while the other report was only run for 3.67 minutes. So ratio numbers and percentage should still compare but raw numbers may not.

Per transaction 50 1
Logical reads 322,555.50 48,387
block changes 514 593

Interesting to note that the above is per transaction, many more reads with more users but same number of block changes.

Wait event info:
Buffer busy waits 5,264 (66%) 440

The client also mentioned cpu usage. The cpu user per session is substantially higher in the 50 user test.

There is so much information in the trace files it is impossible to summarize here.

The client has simplified this to a question:

If you run the same select statement in 50 different sessions at the exact same time, should the first one be substantially faster then the last?

Considering the last one does not have to parse the statement and the answer is all in memory at this point, the client believes that it should not be slower, or at least not much.

Is there some other resource being waited on when 50 sessions are running identical statements at the same time?

We should attempt to reduce the buffer busy waits and go from there I think.

They also want to know why running this 50 user tests causes one of their cpus to peg at 100% usage.

Problem

By reviewing the statspack report (stats_50user.txt), buffer busy waits, was reported as the culprit (highest) wait time.

Diagnostic Analysis

Buffer busy waits for data blocks are often due to several processes repeatedly reading the same blocks (eg: if lots of people scan the same index) - the first session processes the blocks that are in the buffer cache quickly but then a block has to be read from disk - the other sessions (scanning the same index) quickly 'catch up' and want the block which is currently being read from disk - they wait for the buffer as someone is already reading the block in.

A large DB_BLOCK_SIZE with default initrans can be a problem.

Please review the following document, which has a section on buffer busy waits, and potential solutions. This should provide some assistance.

Doc ID: 62161.1

Next plan of action

Determine why there is high cpu utilization on the machine that is caused by Oracle. Also, provide recommendations of what values to set to reduce the number of buffer busy waits, when the 50 users execute the same select statement.

@WIP 13-APR-02 04:31:39 GMT Having Vivek provide sql_trace for the batch job. One for one user, one for multiple users (20).

Export table and send to us with sql so we may attempt a test on our side. This may be useful so we may see if we can reproduce the problem here.

13-APR-02 05:08:24 GMT Spoke with RHACHEM on the issue and he mentioned that the db_block_buffers: 28400 (is puny)

Recommend increasing this value to at least 40k, if not doubling it. This is what is probably causing the:
buffer is not pinned count: 884,397

By increasing the number of db_block_buffers, we should see a significant performance increase.

13-APR-02 05:44:47 GMT Already discussed the above traces with Vivek.

Bottom line is change the value of the DB_BLOCK_BUFFERS to a value much higher than the current setting of 28400. I recommend doubling it.

13-APR-02 05:54:33 GMT Will begin putting together a test case, but truly, the number of DB_BLOCK_BUFFERS is much too low at this point.

Please increase the value to at least 50000.

13-APR-02 06:29:55 GMT Total buffer size was DB_BLOCK_BUFFERS = 76800 SORT_AREA_SIZE was set to 1000000000 (1G)

The values have subsequentlly been decreased to 28400 and 500(meg) respectively.

Having Vivek alter the values to 76800 and 300(meg) respectively.

He is bouncing the DB and will attempt the execution again.

13-APR-02 07:29:01 GMT New info : Please note that SORT_AREA_SIZE was set to 1M and not 1G as mentioned.

13-APR-02 08:04:01 GMT New info : We have provided you with table exports (oracle.dmp). Please use these table and do some testing on your side. I have the same database structures configured on an NT server and I ran 1 thread and 20 threads. I saw the same performance degradation. 1 thread came back in 11 sec and 20 simultaneous threads came back in 83 secs.

13-APR-02 08:10:47 GMT If you monitor the session level waits, then you are focussing on what the problem is. Statspack is too general. I take it that the ct runs 1 piece of sql and it takes X time, but when he runs several sessions (with the same SQL??) , it takes ages. In that case, we need to get 10046 level 8 on the session(s). That way we will see what each one is waiting for. Get timed_statistics set on at DB level so that we can see the wait times. I think the ct mentioned something like 20 sessions, but really he needs to minimize that until he thinks there is a problem. What we will see , is waits appearing in the 'many' sessions which do not appear when we run a single session. Whatever those waits are will lead us to the resolution.

13-APR-02 15:04:01 GMT New info : Have I been switchd analyst again. Just top keep count, following are the analysts I have been bounced to:

1: Gus
2: Peter
3: Taylor
4: Michelle.

Have I been bounced again. Looks like it is taking each analyst to review the tar and by the time any suggestions are due, it is time for the next analyst. Is there some analysts with performance experience who can address this issue. The scenario is as follows:
We have a E4550, 4CPU and 4GB memory database server. We have not been able to process 20 concurrent requests without degradation in performance. It looks like that oracle is serializing the requests even though they are the same requests issued multiple times and they are simple select queries. Oracle does not seem to be reading from data buffer cache in memory or is locking even the memory segment. This server was supposed to run 1000 users. After various stats pack reports and trace files, none of the suggestions have helped us so far. Suggestions include

1: Increase SORT_AREA_SIZE.
2: INCREASE INITRANS on the table and index.
3: Reduce the number of rows in the block.
4: Increase the data_block_Buffers.
5: Pin the objects (table and index) into memory (buffer pool keep).

The last suggestion was to set events 10046 for 1 user and 20 users. The trace files was supposed to provide clues of what wait events are there in addition to the ones for the 1 user. Not heard any feedback on the same. I have been able simulate the same situation in a Windows environment. I had uploaded the structures of the tables involved. Have you been able to simulate the same scenario in your lap.

Richard Eastham

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Richard Eastham
  INET: eastham_at_flash.net

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

Received on Wed Apr 17 2002 - 00:58:21 CDT