Spool raid.log Set pages 50000 Set lines 132 Set trimspool on Set echo on /* RAID simulator ============== Introduction ------------ The purpose of the script is to have fun, while trying to get a better understanding of RAID-4 and RAID-5. James Morle writes in "Scaling Oracle8i" that one can play with XOR operations on a scientific calculator to see, that the redundancy in RAID-5 is sufficient to rebuild a crashed disk. After a few minutes using the Windows Calculator Jesper decided to automate the calculations by starting to write this script. Michael got hooked on the idea and made several improvements. The script simulates a RAID-4 system using an Oracle database. We are going to - Deal with physical data disks, a parity disk and one large logical disk. - Set up automatic maintenance of redundancy on the parity disk. - Format the disks and put some data on the logical volume. - See consequences of a disk crash followed by a successful disk rebuild. - Do some updates and check that the integrity is maintained. - Make a binary dump of everything to make it clear, what RAID is about. How to get started ------------------ If you just want a quick tour then - Run this script with SQL*Plus against an Oracle9i database. - Open the logfile raid.log with your favourite editor. - Locate the test section (search for "TESTING") and start reading. It is very simple plain SQL simulating a disk crash and a disk rebuild. If you want more then - Continue reading until you are working at the bit-level and/or - See how the simulation was set up in the first section. If you just can't get enough - Solve the exercises and write your own code. Difference Between RAID-4 and RAID-5 ------------------------------------ A RAID-4 system has a dedicated parity disk, which is maintained at block level like this: RAID-4 disk0 disk1 disk2 disk3 parity disk ----- ----- ----- ----- ----------- block 0 block 1 block 2 block 3 parity 1 block 4 block 5 block 6 block 7 parity 2 block 8 block 9 block 10 block 11 parity 3 block 12 block 13 block 14 block 15 parity 4 Etc. Considering the 4 physical data disks as one large logical volume, the Logical Volume Blocks can be mapped to the Physical Disk blocks like this: Logical Volume Address = 4 * Physical Disk Address + Disk Number From that statement we can derive: Disk Number = Mod(Logical Volume Address, 4) Physical Disk Address = Trunc(Logical Volume Address / 4) When doing a write to the logical volume we have to - Read the old data block - Read the old parity block - Write the new data block - Write the new parity block This cost of four physical I/Os for one logical write is known as the "write penalty". If you are doing many random writes, then the parity disk becomes very hot. To avoid the bottleneck with one single very hot parity disk, RAID-5 stripes the parity blocks on all disks. Thus the write penalty is load balanced on more disks like this. RAID-5: disk0 disk1 disk2 disk3 disk4 ----- ----- ----- ----- ----- block 0 block 1 block 2 block 3 parity 1 block 4 block 5 block 6 parity 2 block 7 block 8 block 9 parity 3 block 10 block 11 block 12 parity 4 block 13 block 14 block 15 Etc. NOTE: The write penalty still exists. In RAID-5 it is just load balanced. In order to keep things simple, this script simulates RAID-4. This simplifies calculations regarding where to physically find the data and parity blocks. The difference between RAID-4 and RAID-5 is only where the blocks are stored. There are no differences in the contents of the blocks. Thus we can learn how RAID-4 and RAID-5 works, without making it too difficult to deal with. If you really like complexity, then try to establish the 1:1 relationship between logical and physical blocks in RAID-5. Then migrate this script to support RAID-5. (Easier exercises can be found elsewhere in the script). Oracle versions and required privileges --------------------------------------- The script has been tested on Oracle 9.0.1 and 9.2.0. The user running the script needs the following privileges (The CONNECT and RESOURCE roles are sufficient): - CREATE SESSION - CREATE TABLE - CREATE VIEW - CREATE PROCEDURE - CREATE TRIGGER - Quota on default tablespace. The structure of the script --------------------------- SECTION 1: SETUP. 1.1. Oracle does not have a XOR-function, so we have to define our own. 1.2. Create a table, which contains all physical disks in the RAID system. 1.3. A trigger is created. The trigger maintains the parity disk using the XOR function 1.4. A view and a trigger map the 4 physical data disks into one large logical volume. 1.5. The Raid system is initialised (formatted). SECTION 2: TEST IT ALL WORKS. 2.1. Write and read some data on the logical volume (not using all blocks). 2.2. Dump physical disks and see the parity. 2.3. Simulate a disk crash of disk 2. 2.4. Show how a block is recreated on the fly. 2.5. Continue from step 2.4 and rebuild all of disk 2. 2.6. Check that the rebuild was successful. 2.7. Do some random writes on the logical disk. 2.8. Check that the consistency of the RAID was maintained during the writes. SECTION 3: A BINARY VIEW. 3.1. Create a binary formatting routine. 3.2. Use the formatting routine to make a binary dump of the physical disks Here we can do the XOR-calculations manually without a calculator. (Easy but very time consuming). XOR truth table --------------- We need to know the XOR truth table when doing XOR-calculations. It looks like this: +-----+---------+ | XOR | 0 1 | +-----+---------+ | 0 | 0 1 | | 1 | 1 0 | +-----+---------+ Concluding Remarks ------------------ The simulation shows how RAID-4 works. But working is not the same as performing. It is fascinating that one single parity disk can contain so much redundancy that it can protect n data disks in a RAID. But if your need is to do many random writes, then you better go for a RAID-10 solution (first mirror the disks and then stripe the data). For more details read James Morle: "Scaling Oracle8i". He explains it much better than we can do here. Remember to have fun while studying and preparing your future storage system. Failing to do so will ultimately make the fun stop, when you have to run a real Oracle database on a real RAID-4 or RAID-5 system. Good Luck! Original idea, concept and design: RAID_Volume add-on: Jesper Haure Nørrevang Michael Möller jhn.aida@cbs.dk m2@miracleas.dk December 2003. */ REM ============= --------- ================= REM ============= 1. SETUP ================= REM ============= --------- ================= REM Step 1.1: Create our own BITXOR-function REM ---------------------------------------. REM This is a very simple XOR-function like that found on the Windows REM Scientific Calculator. We need it for parity calculations. create or replace function bitxor(n1 in number, n2 in number) return number is -- This function assumes: -- n1 and n2 are integers and -- 0 <= n1 <= 65535 and -- 0 <= n2 <= 65535 -- i.e. the datatype is a two byte unsigned integer (ub2). value number(5); l1 number(5); l2 number(5); b1 boolean; b2 boolean; result number(5); begin l1 := n1; l2 := n2; result := 0; for i in reverse 0 .. 15 -- i=15 in first loop, then 14, 13, ..., 1, 0. loop -- I.e. for each bit in the data block. value := power(2, i); -- Calculate the value of the corresponding -- bit, i.e. 32768, 16384 ..., 2, 1. b1 := (l1 >= value); -- Is the bit i set in n1? if b1 -- If yes then l1 := l1 - value; -- Calculate the rest. end if; b2 := (l2 >= value); -- Same calculation with b2 and b1. if b2 then l2 := l2 - value; end if; if (b1 != b2) -- If and only if the bit differs in n1 and n2. then -- (cf. XOR truth table) result := result + value;-- Then the bit is set in the XOR result end if; end loop; return result; end; / show errors REM Step 1.2: Create the table (the physical disks) REM -----------------------------------------------. REM The PAddr column simulates the Physical Address on the disks. REM The DISKn columns simulates the physical disks. REM The Parity_Disk column simulates the dedicated Parity Disk. REM The rows correspond to data on the disks. REM REM The stripe size is 16 bit (2 byte), i.e. numbers with max 5 figures. Drop Table RAID_Raw / Create Table RAID_Raw (PAddr number(2) constraint RAID_Raw_PK primary key, DISK0 number(5), DISK1 number(5), DISK2 number(5), DISK3 number(5), Parity_Disk number(5)) / REM Step 1.3: Create a trigger to maintain the parity disk REM -----------------------------------------------------. REM The trigger fires when data on the physical disks are written, i.e. when REM the table is updated. Create or replace trigger RAID_Parity Before Update On RAID_Raw For Each Row begin if updating ('DISK0') and -- Updating TO null simulates a disk crash. -- Updating FROM null simulates a block rebuild after a disk crash. :new.DISK0 is not null and :old.DISK0 is not null -- In either case we don't touch the parity. -- If normal update then do the parity maintenance: then :new.Parity_Disk := bitxor(bitxor(:old.DISK0, :new.DISK0), :old.Parity_Disk); -- use the current data -----------------^ -- READ the old data block --^ -- READ the old parity block --------------------------^ -- calculate the new parity block -- WRITE the new data block (the :new.DISK0 is updated) -- WRITE the new parity block (the :new.Parity_Disk is updated) -- I.e. four I/O operations. -- -- See James' book page 109 end if; if updating ('DISK1') and -- Same principles as above. :new.DISK1 is not null and :old.DISK1 is not null then :new.Parity_Disk := bitxor(bitxor(:old.DISK1, :new.DISK1), :old.Parity_Disk); end if; if updating ('DISK2') and -- Same principles as above. :new.DISK2 is not null and :old.DISK2 is not null then :new.Parity_Disk := bitxor(bitxor(:old.DISK2, :new.DISK2), :old.Parity_Disk); end if; if updating ('DISK3') and -- Same principles as above. :new.DISK3 is not null and :old.DISK3 is not null then :new.Parity_Disk := bitxor(bitxor(:old.DISK3, :new.DISK3), :old.Parity_Disk); end if; -- If updating all 4 disks at once, the parity maintenance becomes simpler: -- - Calculate parity of the 4 blocks you have in hand. -- - Write to all 5 disks (I.e. only one extra write). -- :new.Parity_Disk := -- bitxor(bitxor(bitxor(:new.DISK0, :new.DISK1), :new.DISK2), :new.DISK3); -- This feature is not implemented. end; / show errors REM Step 1.4: Make the four physical disks look like one large logical volume REM ------------------------------------------------------------------------. REM REM This is the other half of the work a raid controller does. REM We will have a logical "RAID_Volume" which has a sequential logical block REM address (LAddr) and one block with numeric (two bytes worth) of data. REM This is implemented as a view to read from the correct physical disks, REM and an INSTEAD OF trigger to write to the correct physical disk. REM REM (This View/trigger could be merged with the code of the parity above) REM (The "DIskSet"-inline-view is just a pseudo-table needed to count to 3) Create or Replace View RAID_Volume as Select PAddr * 4 + DiskSet.DiskNo as LAddr, Case DiskSet.DiskNo When 0 then DISK0 When 1 then DISK1 When 2 then DISK2 When 3 then DISK3 end as BLOCK From RAID_Raw, -- intentional Cartesian join (select PAddr as DiskNo from RAID_Raw where PAddr < 4) DiskSet -- 0,1,2,3 Order by LAddr / Create or Replace Trigger RAID_VolumeWrite Instead Of Update On RAID_Volume Begin Case mod(:Old.LAddr,4) When 0 then Update RAID_Raw Set DISK0 = :new.Block Where PAddr=Trunc(:Old.LAddr/4) ; When 1 then Update RAID_Raw Set DISK1 = :new.Block Where PAddr=Trunc(:Old.LAddr/4) ; When 2 then Update RAID_Raw Set DISK2 = :new.Block Where PAddr=Trunc(:Old.LAddr/4) ; When 3 then Update RAID_Raw Set DISK3 = :new.Block Where PAddr=Trunc(:Old.LAddr/4) ; end case ; -- For simplicity we do not raise errors on Insert or Delete - these are not -- permitted (You can't remove a block from a disk either! - just zero it) End; / show errors REM Step 1.5: Initialise the array REM -----------------------------. REM Similar to real array we need to initialise the data (and thus give the REM parity a valid value) for all disks. In short: we "format" the array. Begin For I In 0 .. 10 Loop Insert Into RAID_Raw(PAddr, DISK0, DISK1, DISK2, DISK3, Parity_Disk) Values (i, 0, 0, 0, 0, /* all zeros gives parity 0 */ 0 ); End Loop; commit; End; / REM ============= ------------ ================= REM ============= End of Setup ================= REM ============= ------------ ================= REM REM We now have a raid disk system. REM Blocks are read/written by Select/Update on the logical RAID_Volume. REM REM You can dump the physical disk blocks by select from the RAID_Raw table. REM REM =========== ---------- ================= REM =========== 2. TESTING ================= REM =========== ---------- ================= REM Step 2.1: Write some data to the logical volume and read it REM -----------------------------------------------------------. REM Let's write some data ... Begin For I In 7 .. 20 Loop Update RAID_Volume Set Block = i * 1111 Where LAddr = i; End Loop; End; / Commit ; REM ... and then read the logical volume as a normal disk Select * from RAID_Volume ; REM Step 2.2: Dump physical disks and check the integrity of the parity disk REM -----------------------------------------------------------------------. REM Let's have a look at the physical level. REM The checksum column is yet another XOR calculation over all data blocks REM and the parity block. If everything works, the checksum column contains REM only 0-values. The reason will become clearer in the last section: REM Binary dump of the RAID. Select PAddr, DISK0, DISK1, DISK2, DISK3, Parity_Disk, bitxor(bitxor(bitxor(bitxor(DISK0, DISK1), DISK2),DISK3),Parity_Disk) checksum from RAID_Raw order by PAddr ; REM Step 2.3: Simulate a disk crash REM ------------------------------. REM Disk 2 is now crashing Update RAID_Raw set DISK2 = NULL ; commit ; REM Dump the physical disks again to see the crashed disk. Select DISK0, DISK1, DISK2, DISK3, Parity_Disk from RAID_Raw order by PAddr ; REM See effects on a simple read from the logical volume. Select * from RAID_Volume where LAddr between 8 and 11 order by LAddr ; REM Step 2.4: Show how a block is recreated on the fly REM -------------------------------------------------. /* This is what is so great with RAID - we can happily rebuild data without reading anything from a backup. Sadly, the cost is we have to read 4 data blocks on 4 different disks to reconstruct one data block. (We SELECT (=Read) from DISK0, DISK1, DISK3 and Parity_Disk) It's great, but RAID is certainly not an excuse for not making backups. This shows access from LAddr=10, which maps to PAddr=2, DISK2 (which is NULL). To avoid making the RAID_Volume view too complicated we do this as a separate SELECT below. (Improving RAID_Volume is left as an exercise for YOU). */ REM The expected output of this select should be 11110. Select Bitxor(Bitxor(Bitxor(DISK0, DISK1), DISK3), Parity_Disk) Disk3_Re_Gen From RAID_Raw Where PAddr = 2 ; REM Step 2.5: Rebuild all of disk 2 REM ------------------------------. REM A raid controller would do this as a background job, when you replace the REM crashed disk by a new blank disk. Update RAID_Raw set DISK2 = Bitxor(Bitxor(Bitxor(DISK0, DISK1), DISK3), Parity_Disk) ; Commit ; REM Step 2.6: Look at the reconstructed data REM ---------------------------------------. REM REM Your exercise is to check that the disk contains the same data as before REM the crash. Select DISK0, DISK1, DISK2, DISK3, Parity_Disk From RAID_Raw Order by PAddr ; REM Step 2.7: Random writes of data REM ------------------------------. REM Let's perform random writes on the logical disk in order to see if the REM parity disk is maintained correctly. REM Simultaneous write to all disks in the same stripe is not implemented. REM (See comment at end of RAID_Parity. Why do not YOU implement this REM in the above view/triggers? :-) Begin For i In 1 .. 100 -- We are making 100 loops Loop -- In each: Update a random block with a random value. Update RAID_Volume Set BLOCK = Trunc(dbms_random.Value(0, 65536)) -- Random 2 byte value. Where LAddr = Trunc(dbms_random.Value(0, 44)); -- Random block. End Loop; Commit; End; / REM NOTE: For each update the RAID_Parity maintains the parity disk with REM all the I/O operations this requires. You could add code to count this. REM Step 2.8: Check consistency of the physical disks after the writes REM -----------------------------------------------------------------. Select DISK0, DISK1, DISK2, DISK3, Parity_Disk, Bitxor(Bitxor(Bitxor(Bitxor(DISK0, DISK1), DISK2),DISK3),Parity_Disk) Checksum >From RAID_Raw Order By PAddr ; REM =============== ---------------- ================= REM =============== 3. A BINARY VIEW ================= REM =============== ---------------- ================= REM Step 3.1: Create the NUM_TO_BIN-function REM ---------------------------------------. /* It can be difficult to see that 6261 XOR 7622 = 1459 (decimal). It is easier to see that 00011000 01110101 XOR 00011101 11000110 = 00000101 10110011 (same numbers in binary). See the XOR truth table Oracle has a build in BIN_TO_NUM-function. We need the opposite NUM_TO_BIN, but it does not exist. We write our own NUM_TO_BIN. */ Create Or Replace Function Num_To_Bin(N In Number) Return Varchar2 is -- This function assumes: -- N is integer and -- 0 <= n <= 65535 -- I.e. the datatype is: two byte unsigned integer (ub2). value number(5); l number(5); result varchar2(20); begin l := N; result := ''; for i in reverse 0 .. 15 -- i=15 in first loop, then 14, 13, 12 ..., loop -- 1, 0, i.e. for each bit. value := power(2, i); if (l >= value) -- Is bit i set in n? then l := l - value; -- Calculate the rest. result := result || '1'; -- This is a 1 bit else result := result ||'0'; -- This is a 0 bit. end if; if i = 8 -- space between the bytes for readability. then result := result || ' '; end if; end loop; return result; end; / show errors REM Step 3.2: Binary dump of the contents of the disks REM -------------------------------------------------. column bin_disk0 format A18 column bin_disk1 format A18 column bin_disk2 format A18 column bin_disk3 format A18 column bin_parity_disk format A18 column checksum format 99990 /* When looking at the whole RAID-4 in binary format, we can see that the parity bits are very similar to the party bit in a good old serial communication with some data bits and an even parity bit. Your exercise is to choose all bits at the same position (e.g. 42) on each of the five disks and count the number of 1-bits. It is always an even number. When looking at the RAID in binary format, it becomes very easy to realize, how we rebuilt disk 2 after the crash. Earlier we mentioned that the checksum column should contain only 0's. It should be clear by now, why it has to be so. If not: Do the exercise again. Sometimes one hears/sees/reads something about very complicated and advanced mathematics used to construct very fault tolerant and very redundant storage systems. It is so complicated, that the architecture cannot be explained. In the future: Just think of a simple serial communication line with even parity! The good old famous Digital VT100 Terminal (R) supports Even Parity. */ Select num_to_bin(DISK0) bin_disk0, num_to_bin(DISK1) bin_disk1, num_to_bin(DISK2) bin_disk2, num_to_bin(DISK3) bin_disk3, num_to_bin(Parity_Disk) bin_parity_disk, bitxor(bitxor(bitxor(bitxor(DISK0, DISK1), DISK2),DISK3),Parity_Disk) checksum >From RAID_Raw order by PAddr ; REM =============== ------- ================= REM =============== EPILOGE ================= REM =============== ------- ================= REM REM Parity party is over. REM REM WARNING: SIMULATION ENDS. REALITY BEGINS. REM REM Beyond this point: Parity is Pain! REM Spool OFF EXIT REM Optional cleanup code Drop FUNCTION BITXOR ; Drop FUNCTION NUM_TO_BIN ; Drop TABLE RAID_Raw ; Drop VIEW RAID_Volume ;