RE: Solaris T5220 server problem

From: Wolfson Larry - lwolfs <lawrence.wolfson_at_acxiom.com>
Date: Fri, 27 May 2011 23:31:33 +0000
Message-ID: <EDA437CAA8612C418E013CDA4B4A75513AAA675C_at_CWYIGMBCRP02.Corp.Acxiom.net>



AN update on this. The problem is the way Solaris tries to build larger pages for new processes. When the server has been up for a long time memory is severely fragmented. When a new process is started the OS tries to provide large pages and since it can't, it keeps trying and trying until it finally coalesces a bunch of small ones. And then it does the same for the next process and the next.   We found another client with their server up over 2 years and experiencing the same problem. Fortunately they needed some patching and after the reboot they were stunned at the improved performance, We've talked both of them into quarterly reboots and the first one has the page coalescing tuned off.

It isn't in any Solaris doc that we could see. If anyone else has more information let me know. Although it's not documented you can see setting with mdb command.

   As for being dynamic we tried it twice and both times the servers crashed. Fortunately everything had been shutdown prior for a reboot and we weren't impacted. So just watch out for that.

It's easy to spot especially if you have more than one sever. Just do a truss - c sqlplus on both with a quick script and you'll see the difference.

  Larry

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Wolfson Larry - lwolfs Sent: Wednesday, April 27, 2011 7:15 PM
To: oracle-l_at_freelists.org
Subject: Solaris T5220 server problem

Hello!

            Finally convinced client long running code wasn't database, application, network problem.

Noticed when I was running one of my queries, that usually runs in a tenth of a second elapsed time, was taking about 8 seconds on production server 8G, 32 CPUs with both 10.2.0.4 prod & test (separate ORACLE_HOMES) on same server.

Wanted Unix admin to run some type of Dtrace. I had already run truss a number of times. Didn't get that, but SA found echo was running about 30-60 times longer on this server than dozens of others we manage (most not T5220s). They ran GUDS, which didn't help and then support person came up with this from a buddy he reached out to.

He suggested turning page coalescing off, which we found to be beneficial in many performance escalations. This is something you can do on the fly and if it's found to have a desirable effect, it can be permanently set in /etc/system. There are no know downsides to doing this in the real world.

Once this is enabled, could your DBA's run some test jobs which can be compared against timings for the same jobs when the test DB is down?

Here are the dirty details from previous communications on the topic:

quote --->

Large pages are not a problem. It is finding or coalescing them when none is available needs improvment. LPOOB feature is designed to improve application out of box performance. There are number of LPOOB fixes already been integrated in Sol10 U4 and more are planned for U5 and U6.

It is wiser to disable coalescing than disable LPOOB. If you don't want page coalescing then set following tunables dynamically or in /etc/system file.

And
What I didn't mention before is that the page coalescing issue is specifically mentioned with the Niagara family of CPUs, which is what this T5220, is running on systems running Java applications and Oracle databases (the Oracle part being pertinent here.) Still not saying that it's definitively going to resolve the problems, but it's worth trying based on the system type, Oracle, and symptoms.

This is dynamic change. Support person says we can easily toggle this back with no service interruption Client is not buying that and I was just wondering what experience anyone else has had with T5220s?

Support said they did this mostly for SAP and while we run a number of SAPs, not on this server which I would categorize as relatively lightly loaded. Prod is far busier during nightly batch window. Scheduled stats run well prior to that for 3-13 minutes.

Server and database have been up close to 2 years and they just noticed these processes running longer about 6 weeks ago. They put a new release in TEST but claim problem started just prior to that. Not refuting that.

Thanks for any ideas, suggestions, experiences.

  Larry



The information contained in this communication is confidential, is intended only for the use of the recipient named above, and may be legally privileged.

If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited.

If you have received this communication in error, please resend this communication to the sender and delete the original message or any copy of it from your computer system.

Thank You.


--
http://www.freelists.org/webpage/oracle-l
Received on Fri May 27 2011 - 18:31:33 CDT

Original text of this message