Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
![]() |
![]() |
Home -> Community -> Mailing Lists -> Oracle-L -> HIGH CPU WITH MULTIPLE CONCURRENT USERS (long)
A co-worker is having a fairly serious issue with performance tuning of a system. The system is in the stress testing phase prior to rolling out into production. I have not included all the information as so far they have exceeded three TARs and are working on the fourth one right now. Oracle has become fairly heavily involved and is sending in the Advanced services team is now involved. He has identified that the main issue is a wait after the parsing of the SQL and during the fetch portion of the execution. The short version is running the same SQL statement ( basically nothing more than a simple query against a single table) the machine starts bogging down with a simulated 20+ users sessions and the system starts to choke at 100+ user sessions. We are talking a fairly decent midrange system. The query is a select with 5 columns extracted and a where clause that uses the in clause to select the same rows for each query. The question is has anyone seen this type of behavior before? If you have seen this before what was the root cause? Did you find a solution?
Oracle acknowledges that the scenario is reproducible within their test environment, but the core team is stating that it is working as designed. Oracle is working with us, but why not check with other sources.
A summary of where we are at: (4th TAR)
1.. We tried to simulate the same performance degradation on an entirely different environment. We have been able to do the same. 2.. We had requested Oracle to simulate the test case in their environment. They have been able simulate the performance degradation. Their analysis is also provided in this attachment. 3.. To summarize, they have simulated where 1 user query runs in 2 seconds and 10-user query takes 7 seconds on a 4-processor server. 4.. The development team of Oracle has answered to this degradation as normal and as designed. However, the degradation is very high and is in contrast with their alleged benchmark results (67000 transactions per minute on a 8 processor hardware). For us the degradation is so high that we are not able to run 150 transactions per minute on a 4-processor server. The simulation within oracle also supports this degradation --------------------------------------------------------------------------------
15-APR-02 22:09:08 GMT Pasting information into the tar on bug:2321553 <Abstract: HIGH CPU WITH MULTIPLE CONCURRENT USERS> since currently unavailable on MetaLink:
"PROBLEM:
Environment
The server is E4500, 4CPU of 450MHZ, 4GB memory with Hitachi SAN Storage array 9500 and 2GB cache. Sun Solaris 2.8 32 bit although this has also been tried on Intel Windows 2000 Oracle 8.1.7.3.0 although this has been tried on several other versions ranging from Oracle 8.1.7 - 9.0.1
11-APR-02 12:27:03 GMT
Can you easily recover from, bypass or work around the problem? = NO
Does your system or application continue normally after the problem occurs? =
NO
Are the standard features of the system or application still available; is the
loss of service minor? = NO
### Platform and O/S version, including patchset or service pack level? ### Sun Solaris 2.8 32 bit
### What version and patchset level of the database are you running? ### Oracle 8.1.7.3.0
### Please describe your problem: ###
We have a database that has been developed and configured. We are in the stage
of performing stress and volume testing. The database is an OLTP database with
at least 30 transactions (inserts, updates and deletes) coming from the various
application servers. These are very discreet transactions.
We also have a web server that accesses this database for reporting. The
reporting are huge select statements that run against these live tables.
When we run around 5 users, everything is fine and no problems. As we increase
the users to 50 and 100 and 300, we are seeing an appreciable degradation in
performance. The timings are like for 5 users the web select queries take under
5 seconds, while at 100 users it takes 80 seconds and at 300 users it takes 240
seconds.
I have stressed the database both in dedicated and MTS mode and the effect is
the same.
We have to complete stress testing at 1000 users and at this rate we will not
be able to make it. I am reading all the documentation in Metalink and I see
similar tars, but no answers.
I would appreciate it if you can provide us with some guidance as this will make
or break the project. I am attaching files of sar, vmstat, top, init and
statspack report for idle and load activity.
As you will be able to see, the CPU utlization is tremendous and even
unrealistic.
### Did the error generate a trace file? ### Does not apply
### Please list all files that you plan to upload: ### initedbcip1.ora - Init file
idle_top.txt - top at idle time. idle_vmstat.txt - vmstat at idle time. idle_saru.txt - sar - at idle time. idle_sarb.txt - sar -b at idle time. idle_sarw.txt - sar -w at idle time.
load_vmstat.txt - vmstat at load load_top.txt - top on load load_saru.txt - sar on load. load_sarb.txt - sar on load.
### What was being done at time of error? Any changes since this last worked? ## We were simulating and increasing the number of users connecting and accessing the database. This was part of the stress testing. The server is E4500, 4CPU of 450MHZ, 4GB memory with Hitachi SAN Storage array 9500 and 2GB cache.
### What is the frequency of the error? ### Consistently
### What is the impact to your business because of this problem? ### This will make or break the project. I am sure there are some configurations that I have set, that might need to be tweaked. We can also buy additional CPU and memory, only after we tweak the configuration. This seems to be a common issue and no clear cut solution. Please help us debug as I am running out of ideas.
11-APR-02 16:44:10 GMT Update:
Init.ora file shows:
db_files = 1024
open_cursors = 100
max_enabled_roles = 30
db_block_buffers = 76800
shared_pool_size = 1073741824 shared_pool_reserved_size = 107374182 cursor_sharing = EXACT
audit_trail = db audit_file_dest = /edbcip1/db14/security timed_statistics = true # if you want timed statisticsmax_dump_file_size = 100000 # limit trace file size to 5M each
oracle_trace_enable = true oracle_trace_collection_path=/db/app/oracle/product/8.1.7/otrace/admin/cdf oracle_trace_facility_path=/db/app/oracle/product/8.1.7/otrace/admin/fdfdb_block_size = 16384
sort_area_size = 65536 sort_area_retained_size = 65536 disk_asynch_io = FALSE
mts_servers = 300 mts_max_servers = 800 mts_max_dispatchers = 300 mts_dispatchers = "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=cipedb1) (QUEUESIZE=32))) (DISPATCHERS=100) (POOL=ON) (TICK=1)
11-APR-02 16:48:16 GMT Update:
STATSPACK snapshot shows the following:
Top 5 Wait Events
~~~~~~~~~~~~~~~~~ Wait % Total
-------------------------------------------- ------------ ------------ -------latch free 49 286 26.05
and second snapshot shows:
Top 5 Wait Events
~~~~~~~~~~~~~~~~~ Wait % Total
-------------------------------------------- ------------ ------------ -------latch free 60 379 29.06
11-APR-02 19:49:47 GMT Update:
Wait Time: Time it takes for the I/Os to complete. Even though redo records are written in parallel, the parallel write is not complete until the last I/O is on disk.
3) db file sequential read
This indicates a single block read is occurring
The session waits while a sequential read from the database is performed. This event is also used for rebuilding the controlfile, dumping datafile headers, and getting the database file headers.
Wait Time: The wait time is the actual time it takes to do the I/O.
11-APR-02 19:54:44 GMT ACTION PLAN:
Then run your stress tests.
Generate a new statspack snapshot if performance times are bad.
2) Are you using connection pooling on these web servers?
3) What type of application/web servers are you using?
4) If you run these queries via sqlplus locally on the database server, what is performance like?
12-APR-02 20:05:49 GMT Ct called in live.
They are now at the point where this issue has caused them to stop all forward progress. Ct has requested that the severity be raised to 1. Ct needs a resolution as soon as possible.
SUMMARY:
They are using a third party app that uses oci to access oracle. This app makes a sqlplus connection to the oracle database. They are firing a simple select statement, select * from <tablename>. They notice that the response time increases tremondously, after running this select statement 50 times consecutively.
They will be using 1000+ connections, so this increase in response time is going to hurt them.
They are, at the moment, NOT using MTS.
12-APR-02 20:18:14 GMT Ct was running 10 threads on the db server using sqlplus He saw 8 processes waiting, and the response time was 2.94 sec.
When he ran the query individually, it is taking 1.01seconds.
Ct also wanted to mention that TOP shows: Memory: 4096M real, 63M free, 3198M swap in use, 3895M swap free
PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 11640 root 8 59 -20 7608K 5816K cpu/5 56:06 19.01% caiLogA2
10586 oracle 1 48 0 2512M 2489M run 0:22 0.94% oracle 10453 oracle 1 49 0 2513M 2493M run 0:25 0.87% oracle 10480 oracle 1 58 0 2513M 2493M run 0:23 0.87% oracle 10406 oracle 1 58 0 2512M 2490M run 0:26 0.87% oracle 10571 oracle 1 58 0 2513M 2493M run 0:21 0.87% oracle 10458 oracle 1 49 0 2513M 2492M run 0:24 0.86% oracle 10413 oracle 1 54 0 2513M 2492M run 0:26 0.82% oracle 10465 oracle 1 54 0 2513M 2492M run 0:24 0.79% oracle 10441 oracle 1 39 0 2512M 2489M run 0:24 0.78% oracle 10485 oracle 1 58 0 2513M 2493M run 0:24 0.77% oracle 10473 oracle 1 58 0 2513M 2492M run 0:22 0.77% oracle 10550 oracle 1 58 0 2513M 2492M run 0:23 0.76% oracle 10455 oracle 1 58 0 2512M 2490M run 0:24 0.75% oracle10468 oracle 1 490 2513M 2492M run 0:23 0.74% oracle
12-APR-02 21:08:16 GMT New info : Please find attached the statspack report for 1 user and 50 user. We also executed 10 simultaneous user requests from sql plus on the database server and the response time jumped from 1 second for 1 sql plus session to 3 seconds for 10 simultaneous sessions.
12-APR-02 23:58:15 GMT For starters, the 50 users report was run for 6.35 minutes while the other report was only run for 3.67 minutes. So ratio numbers and percentage should still compare but raw numbers may not.
Per transaction 50 1
Logical reads 322,555.50 48,387
block changes 514 593
Interesting to note that the above is per transaction, many more reads with more users but same number of block changes.
Wait event info:
Buffer busy waits 5,264 (66%) 440
The client also mentioned cpu usage. The cpu user per session is substantially higher in the 50 user test.
There is so much information in the trace files it is impossible to summarize here.
The client has simplified this to a question:
If you run the same select statement in 50 different sessions at the exact same time, should the first one be substantially faster then the last?
Considering the last one does not have to parse the statement and the answer is all in memory at this point, the client believes that it should not be slower, or at least not much.
Is there some other resource being waited on when 50 sessions are running identical statements at the same time?
We should attempt to reduce the buffer busy waits and go from there I think.
They also want to know why running this 50 user tests causes one of their cpus to peg at 100% usage.
Problem
Diagnostic Analysis
A large DB_BLOCK_SIZE with default initrans can be a problem.
Please review the following document, which has a section on buffer busy waits, and potential solutions. This should provide some assistance.
Doc ID: 62161.1
Next plan of action
@WIP 13-APR-02 04:31:39 GMT Having Vivek provide sql_trace for the batch job. One for one user, one for multiple users (20).
Export table and send to us with sql so we may attempt a test on our side. This may be useful so we may see if we can reproduce the problem here.
13-APR-02 05:08:24 GMT Spoke with RHACHEM on the issue and he mentioned that the db_block_buffers: 28400 (is puny)
Recommend increasing this value to at least 40k, if not doubling it.
This is what is probably causing the:
buffer is not pinned count: 884,397
By increasing the number of db_block_buffers, we should see a significant performance increase.
13-APR-02 05:44:47 GMT Already discussed the above traces with Vivek.
Bottom line is change the value of the DB_BLOCK_BUFFERS to a value much higher than the current setting of 28400. I recommend doubling it.
13-APR-02 05:54:33 GMT Will begin putting together a test case, but truly, the number of DB_BLOCK_BUFFERS is much too low at this point.
Please increase the value to at least 50000.
13-APR-02 06:29:55 GMT Total buffer size was DB_BLOCK_BUFFERS = 76800 SORT_AREA_SIZE was set to 1000000000 (1G)
The values have subsequentlly been decreased to 28400 and 500(meg) respectively.
Having Vivek alter the values to 76800 and 300(meg) respectively.
He is bouncing the DB and will attempt the execution again.
13-APR-02 07:29:01 GMT New info : Please note that SORT_AREA_SIZE was set to 1M and not 1G as mentioned.
13-APR-02 08:04:01 GMT New info : We have provided you with table exports (oracle.dmp). Please use these table and do some testing on your side. I have the same database structures configured on an NT server and I ran 1 thread and 20 threads. I saw the same performance degradation. 1 thread came back in 11 sec and 20 simultaneous threads came back in 83 secs.
13-APR-02 08:10:47 GMT If you monitor the session level waits, then you are focussing on what the problem is. Statspack is too general. I take it that the ct runs 1 piece of sql and it takes X time, but when he runs several sessions (with the same SQL??) , it takes ages. In that case, we need to get 10046 level 8 on the session(s). That way we will see what each one is waiting for. Get timed_statistics set on at DB level so that we can see the wait times. I think the ct mentioned something like 20 sessions, but really he needs to minimize that until he thinks there is a problem. What we will see , is waits appearing in the 'many' sessions which do not appear when we run a single session. Whatever those waits are will lead us to the resolution.
13-APR-02 15:04:01 GMT New info : Have I been switchd analyst again. Just top keep count, following are the analysts I have been bounced to:
1: Gus 2: Peter 3: Taylor 4: Michelle.
1: Increase SORT_AREA_SIZE. 2: INCREASE INITRANS on the table and index. 3: Reduce the number of rows in the block. 4: Increase the data_block_Buffers. 5: Pin the objects (table and index) into memory (buffer pool keep).The last suggestion was to set events 10046 for 1 user and 20 users. The trace files was supposed to provide clues of what wait events are there in addition to the ones for the 1 user. Not heard any feedback on the same. I have been able simulate the same situation in a Windows environment. I had uploaded the structures of the tables involved. Have you been able to simulate the same scenario in your lap.
Richard Eastham
-- Please see the official ORACLE-L FAQ: http://www.orafaq.com -- Author: Richard Eastham INET: eastham_at_flash.net Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051 San Diego, California -- Public Internet access / Mailing Lists -------------------------------------------------------------------- To REMOVE yourself from this mailing list, send an E-Mail message to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in the message BODY, include a line containing: UNSUB ORACLE-L (or the name of mailing list you want to be removed from). You may also send the HELP command for other information (like subscribing).Received on Wed Apr 17 2002 - 00:58:21 CDT
![]() |
![]() |