Re: 10g RAC: max performance & min cost with miSCSI?

From: DA Morgan <damorgan_at_psoug.org>
Date: Wed, 16 Nov 2005 06:45:46 -0800
Message-ID: <1132152348.417427@yasure>

Heikki Siltala wrote:
>
> Hello all,
>
> we are currently choosing which path to take in our Oracle database
> environment renewal process. What we want is 1) reliability 2) minimun
> downtime 3) maximum peformance 4) minimum cost. Sounds easy? :-) Due to
> reliability and minimun downtime requirements we are planning to build a
> 2 node Linux Oracle 10g RAC. To get maximum performance we have to focus
> on the disk system peformance. To minimize the cost we are planning to
> run the nodes 1 CPU each since Oracle lisences are per CPU. 2 CPU per
> node would require us for purchasing Enterprise Edition licenses, RAC
> licenses, OLAP and partitioning licenses etc. 1 CPU per node requires
> only a RAC license for 2 CPU since we have already everything else for 2
> CPUs. A detailed calculation showed that 1 CPU per node cost in our case
> only 25 percent in terms of Oracle licensing costs compared to 2 CPU per
> node alternative.
>
> I have started the desing from the IO peformance. I have never seen an
> Oracle database environment where CPU is the bottleneck. The bottleneck
> seems almost always be on the disk system so starting the design on disk
> system seems the right way to go. I have taken a goal to build a system
> that offers enough IO bandwidth to saturate the CPUs (1 CPU per node, 2
> nodes). Some materials suggest that a modern CPU can drive 200 MB/s (3
> GHz Xeon). Since we have 2 CPUs on RAC the disk system should be able to
> deliver 400 MB/s, let's say 500 MB/s to be safe. Assuming that the
> physical disks is typically the weakest point on IO performance I
> started to build it up from there. Using SAME (stripe and mirror
> everything) the IO load on disk will be random, not sequental. I assume
> a 15k disk can deliver 25 MB/s of random IOs, so 500 MB/s system
> requires at least 20 disks.
>
> I have browsed thru the storage solutions of different vendors (Sun,
> Dell/EMC, Adaptec, HP, IBM). To limit the possiblites I focused first on
> HP's offerings. What we need is disk system that can hold 20 disks (and
> maybe some additional 300 GB 10k disk for disk-to-disk-to-tape backups
> etc) and deliver 500 MB/s. HP's MSA500 has not enough disk slots and
> MSA1000 can deliver only 200 MB/s. If we shift to more advanced storage
> systems, the price tag rises significantly. But wait, there is still an
> alternative for 2 node RAC. HP MSA30 MI is a multi-initiator U320 SCSI
> array with dual SCSI busses and ability to hold 14 drives (7 drive slots
> per bus). The list price seems to be about 4000 euros. The issue with
> MSA30 MI is that it doesn't support Xeon (HP Proliant) servers, only
> Itanium (HP Integrity) servers. I can't understand why on earth they
> have come up with this limitation! If using MSA20 MI and two Itanium 1
> CPU nodes, the disk configuration would be two MSA30 MI arrays, two
> busses each, so both nodes would need U320 SCSI HBAs for four SCSI
> channels. The disks could be put so that each MSA30 bus holds five 15k
> disks.
>
> Now if we start to calculate the performance from the ground up, we have
> 5 disk per array bus each disk having 25 MB/s random access transfer
> rate so this makes 125 MB/s per bus. We have four busses on disk arrays,
> so the disk arrays can deliver 500 MB/s if the IO is evenly distributed.
> Each array bus is accessed by both nodes using multi-initator U320 SCSI
> channel, and since the channel can theoretically run 320 MB/s, 125 MB/s
> can be easily be transported over it. A node has 4 U320 channels to
> arrays and each channel has 125 MB/s of disk. So one node could
> theoretically get 500 MB/s of IO out disks (assuming that the HBAs are
> PCI-X 64 bit) and can easily saturate the CPU. And if both nodes are
> accessing the disks the IO speed drops to 250 MB/s which is still enough
> to saturate the CPU. So if this calculation is correct, we need to buy
> two MSA30 MI units (total 8000 euros), four 2 channel PCI-X U320 SCSI
> HBAs and 20 15k disks and get a IO performance of 500 MB/s on two node
> RAC system.
>
> The questions you might now ask is that how to distribute the load
> evenly on the disks and how to handle striping and mirroring since MSA30
> MI is a JBOD array. Now Oracle 10g ASM comes to rescue. If we configure
> all the disks as one ASM diskgroup using normal redundancy (2 copies
> kept) the ASM will do it all: fully automatic IO load balancing,
> striping and mirroring. No need for LVMs, LUNs, disk array RAIDs etc.
>
> The point of posting all this to the newsgroup is that I would like to
> know what you think about this idea of using multi-initiator SCSI as a
> RAC shared disks. This is of course 2 node system and cannot be scaled
> to n nodes, since shared SCSI with MSA30 MI is only for two nodes (and I
> think that RAC on m-i SCSI has the same limitation). The other open
> issues/questions still remains are
>
> 1. Can we fully rely on 10g ASM? In this solution the data redundancy is
> managed only by ASM, not by the storage system. What are the risks if we
> build a huge 20 disk diskgroup (2 failure groups, 10 disks each) to
> store ALL the database data (tablespaces, redo logs, archive logs,
> control files etc).
>
> 2. What would be the other options to get 500 a MB/s disk system
> performance in a reasonable price? If we put two MSA1000 units, each
> delivering 200 MB/s, we get 400 MB/s which is quite close. How about the
> offerings from other vendors?
>
> 3. Can you think any glues on why on earth HP has decided not to support
> MSA30 MI on Xeon (Proliant) servers. Why it is only for Linux 64bit and
> HP-UX Itanium (Integrity)? Is there similar mi-SCSI storage systems from
> other vendors that are supported for Xeon servers? It seems more than a
> little bit pointless to build a cheap disk system and then be forced to
> move from Xeon to Itanium.
>
> --
> Heikki

I am a bit concerned with some of what you wrote ... and some of what you didn't write.

First: My understanding with Standard Edition is that it covers up to 4 CPUs. If that is still correct you might want to go for a 3 node or 4 node cluster. You never want to be ina position where taking a CPU off-line cuts your processing power by 50% and removes all failover capability.

Second: The last place I would look for a storage subsystem for RAC is "(Sun, Dell/EMC, Adaptec, HP, IBM)". Contact NetApp and specifically ask about the FAS250 and FAS270 series. They will save you a huge amount of grief as they eliminate the need for a cluster file system and/or ASM.

-- 
Daniel A. Morgan
http://www.psoug.org
damorgan_at_x.washington.edu
(replace x with u to respond)

Received on Wed Nov 16 2005 - 08:45:46 CST