RE: System stats

From: <dimensional.dba_at_comcast.net>
Date: Sat, 13 Apr 2019 08:41:23 -0700
Message-ID: <074601d4f20f$5f7e9730$1e7bc590$_at_comcast.net>



Pretty much every SAN vendor has monitoring tools at the SAN level. The real question is whether or not you will get access to them in your company.

There is a variety of information you can collect from the OS side that will say you are having an I/O problem. However is it storage being the problem, that is a longer discussion.

Example, at one client I was at, I reported to the VP of Storage, performing database perf tuning. We had Mladen's Netapps and the VP refused us having access to the storage monitoring tools. Basically as is normally the case you are told by the storage guys there is no problem on their side and therefore you then have to go through a lot of alternate testing and collection of information to prove it is the storage such as

  1. Check the settings on the iscsi adapters and test variations on not only the server you are running on but other servers with storage that don't point to the specific netapps you are having trouble with along with testing across all time periods of the day to show patterns and wide variations in the storage performance.
  2. Write code to measure and record I/O performance from inside the database and outside the database (because they will blame it on the database sw). Test outside the database with a variety of test tools using shell, java and c++, as at every level they will try to blame it on some perf tuning not performed at the app level. In this way you can normally show there is minimal or no deviation in performance from the different pathways and definitely nothing that matches the deviations in storage performance you are seeing..
  3. Try to obtain information from the network team on the pathways and switches that follow the pathways storage takes (this includes fiber pathways if you are using them for storage). Network personnel are normally worse than storage personnel, as they normally just say the path shows no errors or you are only at 35%-50% utilization you are good.
  4. Write code to the test the network along the same pathways or response times to the switches along the route. Again have to use multifaceted code (shell, java, c++ to eliminate challenges it must be something in the app).
  5. At least get a dump from the Storage team of every server that is using the same san/netapp as your database server so you can run tests and collect OS metrics from each of the servers across time.

After going through all that the VP then requested that each of the vendors come in (Blade Server, Virtualization, Network, Database, Storage) to examine the setup for best practices. After 2 months of vendor swirl of changes having nothing to do with Storage for best practice concepts, the Storage vendor decided that we of course needed to go to their all flash storage. We of course purchased a new unit and only I pressed for in the scenario was that if the problem really wasn't the storage then all the servers that were on the original storage need to move to the new storage for an apples to apples vindication.  

The storage of course didn't get a vindication as the performance even with SSD's was within 5% just as bad as it was before.

In the end the bad server as the storage guys put it, was moved to it's own san and everything else was happy at that point.  

If each side would simply provide all information relative to monitoring at the detailed level, not the average level at hourly increments, then you could quickly resolve issues. With SANS you have to always look at all connections to the storage in performing your analysis.      

From: oracle-l-bounce_at_freelists.org <oracle-l-bounce_at_freelists.org> On Behalf Of Mladen Gogala
Sent: Friday, April 12, 2019 4:18 PM
To: oracle-l_at_freelists.org
Subject: Re: System stats  

You can monitor your SAN with Nagios agent for NetApp. If your SAN is not NetApp, throw it away and purchase NetApp. Not knowing what SAN equipment you have, that's the best advice I can offer. It is a general advice: if you are not using NetApp SAN, throw it away and buy NetApp. NetApp has your data on tap, like beer.

Regards

Disclosure:

I don't work for NetApp, but I really, really like their equipment.  

On 4/12/19 10:23 AM, (Redacted sender Jay.Miller for DMARC) wrote:

Can you recommend any sort of monitoring to identify when a SAN is getting overloaded? In our case it only became apparent when an app started experiencing latency at the same time for 5-10 minutes every day and we tracked it down to a batch job which was running on an entirely different cluster but which shared the same storage unit. Storage denied it was their problem right up until the point we proved it was.  

It would have been nice to have known that before the problems started showing up. Getting a new storage unit is a slow process.      

Jay Miller

Sr. Oracle DBA

201.369.8355  

From: oracle-l-bounce_at_freelists.org <mailto:oracle-l-bounce_at_freelists.org> [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Neil Chandler Sent: Tuesday, March 26, 2019 10:35 AM
To: Chris Taylor
Cc: gogala.mladen_at_gmail.com <mailto:gogala.mladen_at_gmail.com> ; ORACLE-L Subject: Re: System stats  

In the majority of places I have worked - 5 clients last year - the SANs were overloaded in 4 of them. They are too frequently sized for capacity and not throughput/response time. The response time was inevitably variable and System Stats would not have been helpful on the systems they have. In one of the clients, some of the critical DB's have dedicated storage but changing the system stats would have had little to no effect on those systems due to other measures having been put in place (including using a low optimizer index cost adj on one system, meaning lots of index use. Just not necessarily the right indexes.)  

The optimizer tries to be all things to all people, and there's lots of parameters to try to twist it into the shape that you want. The problem is frequently the abuse of those parameters - especially the global ones - via googling a problem, believing a silver bullet blog, and the lack of time to prove the solution so we just throw the fix into the system. It can be enlightening to strip the more extreme parameters back to their defaults and see how the system copes.  

As an aside, did you run your systems with the default parameters, discover notable problems and then use the 2 sets of system stats to correct those problems, or did you put them in from the start and everything was good?  

There's a case to be made for using system stats, but I just don't think that is something that should be used frequently.    

Neil.  


From: Chris Taylor <mailto:christopherdtaylor1994_at_gmail.com> <christopherdtaylor1994_at_gmail.com>
Sent: 26 March 2019 12:59
To: Neil Chandler
Cc: gogala.mladen_at_gmail.com <mailto:gogala.mladen_at_gmail.com> ; ORACLE-L Subject: Re: System stats  

As far as the workload, I used 2 workload stats and swapped between them - one for the day where the business hours and the off-business hours had their own personalities (for lack of a better word).  

As far as the SAN goes, if enough systems are hitting the SAN enough to cause the IO rate/throughput to become affected, then its *probably* time for a new SAN.      

Chris    

-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217


--
http://www.freelists.org/webpage/oracle-l
Received on Sat Apr 13 2019 - 17:41:23 CEST

Original text of this message