Geert De Paep

Subscribe to Geert De Paep feed Geert De Paep
An extract of my Oracle related opinions
Updated: 3 hours 4 min ago

ERR_TOO_MANY_REDIRECTS in Apex login for http and not for https

Sun, 2019-02-10 07:02

I have had this mysterious error before and I never knew where it came from or how to solve it. Sometimes it was there, sometimes not. And I was very scared, because if you don’t know where it comes from and suddenly it occurs in production, what then…?

Now I spent a lot of time in investigating this, and I think I found the explanation.

I was setting up a new apex environment and the behaviour was that accessing the login page via https is ok, while accessing it via http gives the error ERR_TOO_MANY_REDIRECTS.

Of course, you can argue that it is not wise to mix http and https, but there may be reasons to do so and the topic of this discussion is to give insight in the error above.

When accessing my site via https, these are the requests, obtained from the developer option in Chrome (F12), and extracted only the relevant parts of the requests sent to the server and the responses received:

Request URL (https): 

https://myserver.mydomain.com/myalias/f?p=450

Response:

Location: 

https://myserver.mydomain.com/myalias/f?p=450:1::::::

–> So we get a redirect to page 1, which is the login page.

The browser automatically continues with:

Request URL: 

https://myserver.mydomain.com/myalias/f?p=450:1::::::

Response:

Location: 

https://myserver.mydomain.com/myalias/f?p=450:LOGIN_DESKTOP:14782492558009:::::

Set-Cookie: 

ORA_WWV_APP_450=ORA_WWV-ca0VJbR5JTZgDZ16HZYWS7Hu; secure; HttpOnly

–> A session id is generated by Apex (14782492558009)

–> We get a redirect to LOGIN_DESKTOP page again, but with a url containing our session id

–> But note that a cookie with name “ORA_WWV_APP_450” is being sent to the browser (Set-Cookie).

–> Also note that the cookie has the “secure” attribute set. So it will only be sent when the url contains https, which is ok because we are using https

The browser automatically continues with:

Request URL: 

https://myserver.mydomain.com/myalias/f?p=450:LOGIN_DESKTOP:14782492558009:::::

Cookie sent

ORA_WWV_APP_450=ORA_WWV-ca0VJbR5JTZgDZ16HZYWS7Hu

–> We go again to the login page with a url containing our session id

–> Note that the cookie that was set in the browser in the previous request, is now sent to the server and Apex is happy. I.e. is show the login page.

In other words, the fact that the url contains a valid session id and the cookie matches, means this is a valid session

Response:

The apex login page. All ok.

 

Now in the same browser, we repeat this process using http instead of https and we get:

Request URL (no https):

http://myserver.mydomain.com/myalias/f?p=450

Location:

http://myserver.mydomain.com/myalias/f?p=450:1::::::

–> Same as above

The browser automatically continues with:

Request URL:

http://myserver.mydomain.com/myalias/f?p=450:1::::::

Location:

http://myserver.mydomain.com/myalias/f?p=450:LOGIN_DESKTOP:14897208542039:::::

Set-Cookie:

ORA_WWV_APP_450=ORA_WWV-PO71knA50OiuT9n5MY3zEayQ; HttpOnly

–> Basically the same as above

–> Note however that also a cookie is set, but it does not contain the “secure” attribute because this is not an https connection

Now the fun begins, the browser continues and the log shows:

Request URL:

http://myserver.mydomain.com/myalias/f?p=450:LOGIN_DESKTOP:14897208542039:::::

Response:

Location:

http://myserver.mydomain.com/myalias/f?p=450:1:9970034434813:::::

Set-Cookie:

ORA_WWV_APP_450=ORA_WWV-JP6Aq5b-MH74eD-FiXO4fBBr; HttpOnly

 

Immediately followed by:

 

Request URL:

http://myserver.mydomain.com/myalias/f?p=450:1:9970034434813:::::

Response:

Location:

http://myserver.mydomain.com/myalias/f?p=450:LOGIN_DESKTOP:5362634758337:::::

Set-Cookie:

ORA_WWV_APP_450=ORA_WWV-qVBvmxuGxWUgGXqFZgVYb1b2; HttpOnly

 

And these 2 requests keep repeating after each other very fast until the browser complains about too many redirects.

 

What is happening? Why are we stuck in a redirect loop until we get a browser error?

Did you notice that in the above 2 requests, no cookie is sent to the server anymore (no Set-Cookie header)? As a consequence, Apex does not trust this as a valid session because the cookie is not received. The session id in the url and cookie must match otherwise someone could just hijack your session by copying your url. As a consequence, Apex creates new sessions over and over again, and as the cookie is never sent, none of these sessions becomes a valid one. So this is in fact intended behavior.

The question is of course, why is the cookie not sent? If you look at the Set-Cookie header, there is no reason NOT to set it:

  • No domain is given, so it is sent for all domains
  • No path is given, so it is sent for all paths
  • HttpOnly has nothing to do with this, this is to avoid something like avoiding evil javascript to read your cookie

It took me while to figure this out, but when looking at the cookies stored in the browser and their attributes, it became clear (via Chrome -> stings -> advanced -> content settings -> cookies). There is only 1 cookie in my browser with name ORA_WWV_APP_450. And surprise surprise, it contains the “secure” attribute, which means it is only sent on https connections. And hence, not on my session, resulting in the ERR_TOO_MANY_REDIRECTS.

So this is what happens according to me:

  • The initial apex-login using https has set the cookie in the browser with the secure attribute
  • The second apex-login using http (no https) sets the same cookie to a different value. But it does not ‘clear’ the secure flag (I don’t know if that is possible anyway). The only thing that the Set-Cookie header says, is: set the value to ORA_WWV-qVBvmxusomethingKhdshkh and set HttpOnly on. It does not say “clear the secure flag”, or “remove the old cookie and create a new one without the secure flag”.
  • So the secure flag is not cleared and the cookie is never sent over http, only over https

Summary:

  • Once a https cookie (secure) was sent, it is not possible anymore to use a non-https session
  • The only solution is to manually remove the cookie from the browser. Then it works again, until you use https again followed by http.
  • That also explains why this error may seem to occur randomly. As long as the secure cookie is not present, all works, and then suddenly after 1 https request, the problem is there.

 

Is this a bug? You can discuss about this, but according to me: yes. An application should never come into such a redirect loop no matter how you use it. And yes, you can say that http and https should not be mixed, but there can be reasons to do so (fallback if you have issues with your ssl config – or an internal-only connection that will always work, even if ssl is temporarily unavailable).

Using a different cookie name for a https session vs an http session might solve this problem, but I don’t know if the apex developers will do so, because this is in the core of the authentication logic of Apex.

Questions, if someone can help:

  • Do you know of a (supported) way to avoid this, I mean any setting in Apex or so that I overlooked?
  • What would be the easiest way to clear the secure flag in the cookie, if you would need to do so? Manually removing the cookie from the browser is somewhat complex and I think many people (end users) would not know how to do this.
  • Do you know of a way to better handle this error? E.g. do you know if you can have Apache redirect to some error page in case of ERR_TOO_MANY_REDIRECTS (because this is not an http error code like 404 – Not Found), or even better, to a page where you have a button to clear the cookie?

I hope you have better insight now in what happens when you encounter this error.

Real ODA “one button patching”, 12.1.2.6 to 12.1.2.10

Mon, 2017-06-19 13:46

Today I had to patch a virtualized Oracle Database Appliance X5-2 HA at a customer site. Current version was 12.1.2.6, installed in June 2016, and it had to be upgraded to 12.1.2.10. I’ve heard a lot of scary stories of colleagues and friends about possible issues during patching, so I was a little reluctant to do this. Anyway, no choice, what needs to be done, needs to be done.

I had one big advantage: this ODA was installed by myself and I was 100% sure that nobody had ever done something ‘unsupported’ with it. During installation in June 2016, I have taken very much care about following the books accurately and ensuring that nothing went wrong. This was a real clean install without any issues, now in production for 1 year.

Now one year later I have to say that I am very happy with “my ODA”. The patching today went smoothly. Not a single issue was encountered during the patching. Actually I was a little surprised, because as I said, I have heard about many issues of others during patching.

More specifically I did the following:

I followed Mos: README for 25499210

Luckily version 12.1.2.6 can be directly upgraded to 12.1.2.10 (lower versions need to go in 2 or more steps).

Server

I did the Server part “locally”, i.e. one node at a time and hence 2 times patching. Allthough I had downtime for the full ODA, I preferred to do it node by node. If something would go completely wrong on node 1, I would still have a valid node 2.

Warnings and tips:

  • Ensure each node (oda_base) has more than 15Gb free on /u01
  • Verify if rpm orclovn-user-6.0.r7494-1.el5 exists in Dom0 on each node (cfr one of those issues – someone I know had this rpm missing in Dom0…)
  • Run the patching one node at a time, not both together
  • It is best to stop any VM’s manually before starting the patching (but at the end you will have to start them manually again as well)

This command performs the patching:

/opt/oracle/oak/bin/oakcli update -patch 12.1.2.10.0 –server –local

You may be interested in the timings, if you need to do this as well and have to plan downtime:

These are the current and proposed versions (I think the amount of components to  be patched will influence the time it will take):

Component Name            Installed Version         Proposed Patch Version  
---------------           ------------------        -----------------       
Controller_INT            4.230.40-3739             Up-to-date               
Controller_EXT            06.00.02.00               10.00.00.00             
Expander                  0018                      Up-to-date              
SSD_SHARED {                                                                
[ c1d20,c1d21,c1d22,      A29A                      Up-to-date              
c1d23 ]                                                                     
[ c1d16,c1d17,c1d18,      A29A                      Up-to-date              
c1d19 ]}                                                               
HDD_LOCAL                 A720                      A72A                    
HDD_SHARED                P901                      PAG1                    
ILOM                      3.2.4.52 r101649          3.2.8.24 r114580        
BIOS                      30050100                  30100400                
IPMI                      1.8.12.4                  Up-to-date              
HMP                       2.3.4.0.1                 2.3.5.2.8               
OAK                       12.1.2.6.0                12.1.2.10.0             
OL                        6.7                       6.8                     
OVM                       3.2.9                     3.2.11                  
GI_HOME                   12.1.0.2.160119(2194      12.1.0.2.170117(2473    
                          8354,21948344)            2082,24828633)          
DB_HOME                   12.1.0.2.160119(2194      12.1.0.2.170117(2473    
                          8354,21948344)            2082,24828633)

Each node took about 75 minutes to do the patching. However at the end a reboot is triggered and you should calculate about 15 minutes extra before everything is up again and you can verify that all went well. But good to know, while node 1 is patched, node 2 remains fully available and vice versa.

Shared storage

After this, the shared storage part needs to be patched using

/opt/oracle/oak/bin/oakcli update -patch 12.1.2.10.0 –storage

You run this command only on node 1.

This took 24 minutes, but at the end both nodes reboot (both at the same time, so this is not rolling and you have db unavailability, even with RAC). So calculate some extra time here.

Database homes

Finally the database homes (I got 4 of them) need to be patched. In fact only 3 of the homes needed to be patched in my case, because we first want to validate the patch before patching production (4th home). On node 1 you can do:

/opt/oracle/oak/bin/oakcli update -patch 12.1.2.10.0 –database

Note, I didn’t use the “–local” option in this case, so the oracle_home on both nodes are patched in 1 command.

Tip:

  • I don’t know if it is required, but it is safe to set first (again one of those issues of others):
    • export NLS_LANG=American_America.US7ASCII
  • You don’t have to stop the databases in advance. This will be done, on one node at a time, while patching.
  • “datapatch” will be run automatically during the patching. So after the patching all databases are open again and no additional steps are required.

After starting the command, a list of homes to be patched is displayed and you can select to patch them all together or not. In my case I answered N and then I could choose which homes NOT(!) to patch. Yes, is somewhat strange, instead of asking which homes I want to patch, I had to enter the opposite:

…
Would you like to patch all the above homes: Y | N ? : N
Please enter the comma separated list of Database Home names that you do NOT want to patch:OraDb12102_home4,OraDb12102_home1,OraDb12102_home2
…

In the mean while I filed an enhancement request at Oracle Support for adding a confirmation after entering these homes. Because at the moment, after pressing RETURN, the script immediately starts patching. I don’t want to know what will happen if you would have a typo in your list of home names. It might be that the script will patching the wrong home in that case… So be careful.

I noticed that datapatch was automatically run for each database and I could verify this in:

SELECT patch_id, version, status, bundle_id, bundle_series
       FROM dba_registry_sqlpatch;

patch_id version  status  bundle_id bundle_series
21948354 12.1.0.2 SUCCESS    160119 PSU         
24732082 12.1.0.2 SUCCESS    170117 PSU

Regarding timings, in my case it took about 20 minutes for each home. However I had 4 databases in each home. If you have more, or less databases, timings may be somewhat different, but at least you have an idea.

Summary

In summary, all patch steps went through without a single error or issue. So this is the real ODA “one button patching”. Very nice.

However I don’t know how many patchings go like this. It looks to me that you can easily do (unsupported) things on your ODA that may cause issues during patching, like e.g. installing custom rpm’s, changing OS settings, not creating databases using the oakcli provided command or manually adding other Oracle products in the oraInventory. Yes I know, it is hard to treat an appliance like an appliance if all backdoors are open…

So if you install or manage an ODA, understand the importance of keeping it clean and get rewarded during patching. You really don’t want issues during patching, especially if your downtime is e.g. in the middle of the night.


Why DB_BIG_TABLE_CACHE_PERCENT_TARGET was a good solution in my case

Fri, 2017-05-12 14:07

During the preparation of my Oracle DBA 12c Certified Master exam, I learnt about the DB_BIG_TABLE_CACHE_PERCENT_TARGET parameter in 12c. This parameter allows you to reserve a certain percentage of the buffer cache for storing large tables. You know that full table scans on large tables are normally NOT stored in the buffer cache (they also use direct reads, bypassing the buffer cache), because it should not be the intention to keep scanning large tables over and over again.

My customer had one of those queries that are hard to tune. Is was a join of 4 large tables (8Gb, and 3 x 2Gb) with variations of where conditions (depending on the search criteria entered in the application). SQL Plan management etc is out of scope because each query was built dynamically and different from the others (so no fixed sql_id). The problem was that each of these table returned a lot of rows, and only after joining with the other table(s) and applying all the where conditions there as well, most rows were eliminated.

This makes HASH JOIN the best solution to join these tables together, but this means full table scans on all 4 tables (nested loops is not good when your first table returns a lot of rows for which you have to go to the other tables). Another solution could be to build a materialized view on the join of these tables, however in my case the queries require real-time data and the base tables are changing very much during the day. So a materialized view with refresh on commit would be no option; the permanent update of the materialized view would slow down the changes too much.

O yes, the problem query I was talking about was a central piece of the application, executed very often, and the customer wanted instant response. Hard to achieve on that 8+2+2+2 Gb dataset (bad idea, developers should think about this in advance…).

After discussion with the local DBA’s it turned out that the buffer cache could be increased from 20Gb to 35Gb. So we did and I configured DB_BIG_TABLE_CACHE_PERCENT_TARGET to 50 (%). So half of the buffer cache was available for large tables and the rest behaves like before. So the ‘other’ activity in the database should still have about 17,5Gb of buffer cache, as before, and should have any negative impact.

My problem query, after running it once, indeed stored my 4 large tables completely in the reserved part of the buffer cache. The first execution (disk reads) took 58 seconds, the second execution (no disk reads, all data came from cache) took 18 seconds (yes, there were still millions of rows to join together, so some cpu was expected).

Considering that these tables are constantly queried by similar, but varying queries, it was a good solution to keep all this data in memory. Also the view V$BT_SCAN_OBJ_TEMPS is useful to monitor this behaviour. Each table has a ‘temperature’ which is a measure for how often it was accessed by queries. This is a different algorithm than the traditional buffer cache for deciding which tables to keep in memory (i.e. when you have more large tables than provided space in the buffer cache). This is how it looks like:

BigTableCache

Final thoughts: usually it is a bad idea to solve performance problems by just ‘buying more memory’. You should first analyze the data model, rewrite the query or try to use other features to attack the root cause of the bad performance. In my case however it is a third party application with auto generated queries, so I didn’t have much options. Also the tables are queried very often so it makes sense to cache them in memory.


The ultimate story about OCR, OCRMIRROR and 2 storage boxes – Conclusion

Thu, 2009-10-15 09:00

This is a follow-up of chapter 5.
The most important thing in this story is the fact that it is perfectly possible to configure your Oracle RAC cluster with 2 storage boxes in a safe way. You just need an independent location for the 3rd voting disk, but if you have that, you can be sure that your cluster will remain running when one of those storage boxes fail. You will even be able to repair it without downtime after e.g. buying a new storage box (call Uptime for good prices…:)

So were all these tests then really needed? Yes, I do think so, because of the following reasons


  • Seeing that it works gives much more confidence than just reading it in the Oracle doc, or on Metalink
  • The story of the vote count is really interesting. There is (almost) nothing to find about this in the Oracle doc or metalink. With the information in this blog, you will be able to better understand and interpret the error messages in the log files. You will also know better when to (not) update the vote count manually.
  • The concept of OCR master is nice to know. Again, it gives your more insight in the messages in the logfiles.

But apart from these straightforward conclusions, there is one thing I find most interesting. The different scenarios have produced different output, and in one case (scenario 5) even real error messages, allthough they all did the same thing: removing the ocrmirror. With the different scenario’s above, you know why the output can be different. Because if ever you have to handle a RAC case with Oracle support, and you get as reply “we are unable to reproduce your case”, you may now be able to give them more info about what parameter can make a difference (who is ocr master, where is crs stopped, …). Otherwise it can be so frustrating that something fails in your situation and does work in somebody else’s situation.
But now I may be getting too philosophical (which I tend to have after another good “Trappist“)…

Good luck with it!

P.S. And yes, I do have them all in my cellar…


The ultimate story about OCR, OCRMIRROR and 2 storage boxes – Chapter 5

Mon, 2009-10-12 09:00
Scenario 5: Loss of ocrmirror from non-ocr-master – reloaded

This is a follow-up of chapter 4.
In this final scenario, we do the same thing as in scenario 4. I.e. while crs is running on both nodes, we hide the ocrmirror from the non-ocr-master node, which is node 2 now.
So node 1 is the master, we hide ocrmirror from node 2 and we verify on node 2:
(nodeb01 /app/oracle/crs/log/nodeb01) $ dd if=/dev/oracle/ocrmirror of=/dev/null bs=64k count=1
dd: /dev/oracle/ocrmirror: open: I/O error

What happens?

As we know from scenario 4, ocrcheck on node 2 now fails with:
(nodeb01 /app/oracle/crs/log/nodeb01) $ ocrcheck
PROT-602: Failed to retrieve data from the cluster registry

On node 1 all is ok. This is still the same as scenario 4, but in scenario 4 we now stopped crs on the ocr master who can see both luns. In this scenario we will now stop crs on the non-master node (node 2) who can see only ocr.

And now it gets interesting….
-bash-3.00# crsctl stop crs
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage

Did I say “really interesting”? We don’t seem to be able to stop crs anymore on the non-ocr-master node. Maybe it is worth referring to the RAC FAQ on Metalink that says “If the corruption happens while the Oracle Clusterware stack is up and running, then the corruption will be tolerated and the Oracle Clusterware will continue to funtion without interruptions”. That’s true, but they don’t seem to speak about stopping crs. Anyway, the real “playing” continues:

Let’s try to tell Oracle CRS that the ocr is the correct version to continue with, and ask kindly to increase its votecount to 2. We do this on node 2 and get:
ocrconfig -overwrite
PROT-19: Cannot proceed while clusterware is running. Shutdown clusterware first

Deadlock on node 2! We can’t stop crs, but in order trying to correct the problem, crs has to be down…

Moreover, at this time, it is not possible anymore to modify the OCR. Both nodes now give:
(nodea01 /app/oracle/crs/log/nodea01/client) $ srvctl remove service -d ARES -s aressrv
PRKR-1007 : getting of cluster database ARES configuration failed, PROC-5: User does not have permission to perform a cluster registry operation on this key. Authentication error [User does not have permission to perform this operation] [0]
PRKO-2005 : Application error: Failure in getting Cluster Database Configuration for: ARES

And doing the above command on each node gives always in the alert logfile of node 1 (who is the master):
[  OCRAPI][29]a_check_permission_int: Other doesn’t have permission

Note: “srvctl add service” doesn’t work either.

Now it seems like things are really messed up. We have never seen permission errors before. Please be aware now that the steps below are the steps I took trying to get things right again. There may be other options, but I only did this scenario once, with the steps below:

As the original root cause of the problem was making the ocrmirror unavailable, let’s try to tell the cluster to forget about this ocrmirror, and continue only with ocr, which is still visible on both nodes.

So in order to remove ocrmirror from the configuration, we do as root on node 2:
-bash-3.00# ocrconfig -replace ocrmirror “”

Note: specifying an empty string (“”) is used to remove the raw device from the configuration.

At that time in the crs logfile of node 1:
2008-07-23 11:11:18.136: [  OCRRAW][29]proprioo: for disk 0 (/dev/oracle/ocr), id match (0), my id set (1385758746,1028247821) total id sets (2), 1st set (1385758746,1866209186), 2nd set (1385758746,1866209186) my votes (1), total votes (2)
2008-07-23 11:11:18.136: [  OCRRAW][29]propriowv_bootbuf: Vote information on disk 0 [/dev/oracle/ocr] is adjusted from [1/2] to [2/2]
2008-07-23 11:11:18.195: [  OCRMAS][25]th_master: Deleted ver keys from cache (master)
2008-07-23 11:11:18.195: [  OCRMAS][25]th_master: Deleted ver keys from cache (master)

That looks ok. We will be left with one ocr device having 2 votes. This is intended behaviour.

In the alert file of node 1, we see:
2008-07-23 11:11:18.125
[crsd(26268)]CRS-1010:The OCR mirror location /dev/oracle/ocrmirror was removed.

and in the crs logfile of node 2:
2008-07-23 11:11:18.155: [  OCRRAW][34]proprioo: for disk 0 (/dev/oracle/ocr), id match (1), my id set (1385758746,1028247821) total id sets (2), 1st set (1385758746,1866209186), 2nd set (1385758746,1028247821) my votes (2), total votes (2)
2008-07-23 11:11:18.223: [  OCRMAS][25]th_master: Deleted ver keys from cache (non master)
2008-07-23 11:11:18.223: [  OCRMAS][25]th_master: Deleted ver keys from cache (non master)

(node 2 updates its local cache) and in the alert file of node 2:
2008-07-23 11:11:18.150
[crsd(10831)]CRS-1010:The OCR mirror location /dev/oracle/ocrmirror was removed.

Now we do an ocrcheck on node 2:

(nodeb01 /app/oracle/crs/log/nodeb01) $ ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          2
         Total space (kbytes)     :     295452
         Used space (kbytes)      :       5600
         Available space (kbytes) :     289852
         ID                       : 1930338735
         Device/File Name         : /dev/oracle/ocr
                                    Device/File integrity check succeeded
<br />                                    Device/File not configured
         Cluster registry integrity check succeeded

Now the configuration looks ok again, but the error remains on node 2 (we do this as user oracle):
(nodeb01 /app/oracle/crs/log/nodeb01) $ srvctl remove service -d ARES -s aressrv
PRKR-1007 : getting of cluster database ARES configuration failed, PROC-5: User does not have permission to perform a cluster registry operation on this key. Authentication error [User does not have permission to perform this operation] [0]
PRKO-2005 : Application error: Failure in getting Cluster Database Configuration for: ARES

However doing the same command as root on node 2 succeeds:
-bash-3.00# srvctl remove service -d ARES -s aressrv
aressrv PREF: ARES1 AVAIL: ARES2
Service aressrv is disabled.
Remove service aressrv from the database ARES? (y/[n]) y

After this, managing the resources as user oracle succeeds again:
(nodeb01 /app/oracle/crs/log/nodeb01) $ srvctl add service -d ARES -s aressrv2 -r ARES1
(nodeb01 /app/oracle/crs/log/nodeb01) $ srvctl remove service -d ARES -s aressrv2
aressrv2 PREF: ARES1 AVAIL:
Remove service aressrv2 from the database ARES? (y/[n]) y

At this point, unfortunately the internals end. At the moment of my testing, I had no time to investigate this further, and since then I had no time to make and test a similar setup (that’s why this blog posting took so long, I would have loved to do more research on this). However I remember I have done some more testing in some place at some customer site (but I have no tracscript of that, so no details to write here) and I can still tell the following:

For some reason, the ownership of the ARES resource in OCR seems to be changed from oracle to root. A way to get out of this as well is using the following commands:
 crs_getperm
 crs_setperm -o oracle | -g dba

This allows to change ownership back to oracle, and then all will become ok again.

I can’t say where it went wrong. Maybe I have done something as root, instead of oracle, without knowing (however I double checked my transcripts). I think it went wrong at the moment where I first tried to stop crs as root on node 2 and then did an “ocrconfig -overwrite” as root on node 2. I wonder if something has then been sent to node 1 (who is ocr master), i.e. as root, that may have changed some permission in the ocr…? If anyone has time and resources to investigate this further, please don’t hesitate to do so, and inform me about the results. In this way, you may gain perpetual honour in my personal in-memory list of great Oracle guys.

Conclusion

Altthough crs is very robust and 2 storage boxes are ok, there may be a situation where you get unexpected error messages. Hopefully this chapter will help you in getting out of this without problems, and strengthen your confidence in Oracle RAC.

Let’s make a final conclusion in the next chapter…


The ultimate story about OCR, OCRMIRROR and 2 storage boxes – Chapter 4

Wed, 2009-10-07 09:00
Scenario 4: Loss of ocrmirror from the non-OCR MASTER

This is a vollow-up of chapter 3.
Let’s try to do the same thing as scenario 3, however now hiding the lun from a node NOT being the OCR MASTER, while crs is running on both nodes.

What happens?

So we hide the ocrmirror lun from node 1, because after the previous test node 2 is still the master. At the moment of hiding the lun nothing apears in any logfile on any node. This is an interesting fact, because when we removed the ocrmirror from the ocr master in scenario 3, we got the messages “problem writing the buffer …” and “Vote information on disk 0 [/dev/oracle/ocr] is adjusted from [1/2] to [2/2]” immediately in the crsd logfile. So this indicates that only the ocr master reads/writes(?) constantly in the ocr/ocrmirror and hence, detects IO errors immediately. The non-ocr-master doesn’t do anything.

To prove that the ocrmirror is really invisble for the non-ocr-master, we do on node 1:

(nodea01 /app/oracle/bin) $ dd if=/dev/oracle/ocrmirror of=/dev/null bs=64k count=1
dd: /dev/oracle/ocrmirror: open: I/O error

ocrcheck on node 2 (the master) has no problem, as it still sees both devices:

         Device/File Name         : /dev/oracle/ocr
                                    Device/File integrity check succeeded
         Device/File Name         : /dev/oracle/ocrmirror
                                    Device/File integrity check succeeded
         Cluster registry integrity check succeeded

But doing ocrcheck on node 1 gives:

PROT-602: Failed to retrieve data from the cluster registry

and its alert file says:

2008-07-23 09:33:43.221
[client(14867)]CRS-1011:OCR cannot determine that the OCR content contains the latest updates. Details in /app/oracle/crs/log/nodea01/client/ocrcheck_14867.log.

and the associated client logfile shows:

Oracle Database 10g CRS Release 10.2.0.4.0 Production Copyright 1996, 2008 Oracle. All rights reserved.
2008-07-23 09:33:43.210: [OCRCHECK][1]ocrcheck starts…
2008-07-23 09:33:43.220: [ OCRRAW][1]proprioini: disk 0 (/dev/oracle/ocr) doesn’t have enough votes (1,2)
2008-07-23 09:33:43.221: [ OCRRAW][1]proprinit: Could not open raw device
2008-07-23 09:33:43.221: [ default][1]a_init:7!: Backend init unsuccessful : [26]
2008-07-23 09:33:43.221: [OCRCHECK][1]Failed to access OCR repository: [PROC-26: Error while accessing the physical storage]
2008-07-23 09:33:43.221: [OCRCHECK][1]Failed to initialize ocrchek2
2008-07-23 09:33:43.221: [OCRCHECK][1]Exiting [status=failed]…

Now let’s see if node 1 really has a problem with this or not?

(nodea01 /app/oracle/bin) $ crsstat<br />HA Resource                                   Target     State<br />-----------                                   ------     -----<br />ora.ARES.ARES1.inst                           ONLINE     ONLINE on nodea01<br />ora.ARES.ARES2.inst                           ONLINE     ONLINE on nodeb01<br />ora.ARES.db                                   ONLINE     ONLINE on nodeb01<br />ora.AMIGO.AMIGO1.inst                         ONLINE     ONLINE on nodea01<br />ora.AMIGO.AMIGO2.inst                         ONLINE     ONLINE on nodeb01<br />ora.AMIGO.db                                  ONLINE     ONLINE on nodeb01<br />ora.nodea01.ASM1.asm                          ONLINE     ONLINE on nodea01<br />ora.nodea01.LSNRARES_NODEA01.lsnr             ONLINE     ONLINE on nodea01<br />ora.nodea01.LSNRAMIGO_NODEA01.lsnr            ONLINE     ONLINE on nodea01<br />ora.nodea01.gsd                               ONLINE     ONLINE on nodea01<br />ora.nodea01.ons                               ONLINE     ONLINE on nodea01<br />ora.nodea01.vip                               ONLINE     ONLINE on nodea01<br />ora.nodeb01.ASM2.asm                          ONLINE     ONLINE on nodeb01<br />ora.nodeb01.LSNRARES_NODEB01.lsnr             ONLINE     OFFLINE<br />ora.nodeb01.LSNRAMIGO_NODEB01.lsnr            ONLINE     OFFLINE<br />ora.nodeb01.gsd                               ONLINE     ONLINE on nodeb01<br />ora.nodeb01.ons                               ONLINE     ONLINE on nodeb01<br />ora.nodeb01.vip                               ONLINE     ONLINE on nodea01

Obviously not. I seem to be able to query the ocr without any problem. Even modifying the OCR succeeds from node 1:

srvctl add service -d ARES -s aressrv -r ARES1 -a ARES2

gives in the crs logfile:

2008-07-23 09:41:04.656: [ CRSRES][46723] Resource Registered: ora.ARES.aressrv.cs
2008-07-23 09:41:05.786: [ CRSRES][46724] Resource Registered: ora.ARES.aressrv.ARES1.srv

This seems to indicate again that all OCR manipulation (read + write) goes though the ocr master (node 2 who still sees both ocr and ocrmirror). Still during all these actions nothing has appeared in any logfile of node 2 (the master who still sees both luns).

Now let’s do an interesting test and stop the master. We may assume that node 1 will become the new master then, however node 1 at this time sees only one device with one vote. So it cannot run CRS like that…

Let’s see what happens. After stopping crs on node 2, the crs logfile on node 1 shows:

2008-07-23 09:44:08.406: [ OCRMAS][25]th_master:13: I AM THE NEW OCR MASTER at incar 2. Node Number 1
2008-07-23 09:44:08.415: [ OCRRAW][25]proprioo: for disk 0 (/dev/oracle/ocr), id match (1), my id set (1385758746,1866209186) total id sets (2), 1st set (1385758746,1866209186), 2nd set (1385758746,1866209186) my votes (1), total votes (2)
2008-07-23 09:44:08.418: [ OCRRAW][25]propriowv_bootbuf: Vote information on disk 0 [/dev/oracle/ocr] is adjusted from [1/2] to [2/2]

This makes sense. Node 1 becomes the master because node 2 is leaving. However it evaluates its configuration and sees an ocr with one vote and no ocrmirror. This violates rule 3 and hence it updates the vote count (he can do that, he is the new master), and luckily he does NOT decide to crash…

And indeed all seems as expected now on node 1:

         Device/File Name         : /dev/oracle/ocr<br />                                    Device/File integrity check succeeded<br />         Device/File Name         : /dev/oracle/ocrmirror<br />                                    Device/File unavailable<br />

The situation is now that the unavailable ocrmirror still has one vote (because node 1 could not update its vote count) and the ocr has just received 2 votes from node 1.

Now we restart crs again on node 2 and we see in its logfile:

2008-07-23 09:47:19.583: [ OCRRAW][1]proprioo: for disk 0 (/dev/oracle/ocr), id match (1), my id set (1385758746,1866209186) total id sets (1), 1st set (1385758746,1866209186), 2nd set (0,0) my votes (2), total votes (2)
2008-07-23 09:47:19.583: [ OCRRAW][1]proprioo: for disk 1 (/dev/oracle/ocrmirror), id match (1), my id set (1385758746,1866209186) total id sets (2), 1st set (1385758746,1866209186), 2nd set (1385758746,1866209186) my votes (1), total votes (2)

Node 2 may be a little confused, because when both ocr and ocrmirror are available, he would expect each of them to have one vote…

Now lets do an ocrcheck again on node 2 and get:

         Device/File Name         : /dev/oracle/ocr<br />                                    Device/File integrity check succeeded<br />         Device/File Name         : /dev/oracle/ocrmirror<br />                                    Device/File needs to be synchronized with the other device<br />

Aha, this is the right output when ocrmirror has 2 votes and ocr has 1 vote.

So let’s do as what Oracle tells in the output above, let’s synchronize the ocrmirror again with the other device:

So we do an “ocrconfig -replace ocrmirror /dev/oracle/ocrmirror” on node 2 (who can see both ocr and ocrmirror). Bad luck, this fails, because node 2 is not the master. Node 1 has become the master (see above) and node 1 cannot see ocrmirror. So node 1 cannot verify or correct the vote count on ocrmirror. Hence this last command gives in the alert file of node 1:

2008-07-23 09:57:26.251: [ OCROSD][35]utdvch:0:failed to open OCR file/disk /dev/oracle/ocrmirror, errno=5, os err string=I/O error
2008-07-23 09:57:26.251: [ OCRRAW][35]dev_replace: master could not verify the new disk (8)
[ OCRSRV][35]proas_replace_disk: Failed in changing configurations in the Master 8

After making the lun visible again on node 1 and repeating the last command, all succeeds again without error: the crs logfile of node 1 then shows:

2008-07-23 10:04:48.419: [ OCRRAW][33]proprioo: for disk 0 (/dev/oracle/ocr), id match (1), my id set (1385758746,1866209186) total id sets (1), 1st set (1385758746,1866209186), 2nd set (0,0) my votes (2), total votes (2)
2008-07-23 10:04:48.419: [ OCRRAW][33]propriogid:1: INVALID FORMAT
2008-07-23 10:04:48.484: [ OCRRAW][33]propriowv_bootbuf: Vote information on disk 1 [/dev/oracle/ocrmirror] is adjusted from [0/0] to [1/2]
2008-07-23 10:04:48.485: [ OCRRAW][33]propriowv_bootbuf: Vote information on disk 0 [/dev/oracle/ocr] is adjusted from [2/2] to [1/2]
2008-07-23 10:04:48.557: [ OCRMAS][25]th_master: Deleted ver keys from cache (master)
2008-07-23 10:04:48.557: [ OCRMAS][25]th_master: Deleted ver keys from cache (master)

and in the same file on node 2 we see:

2008-07-23 10:04:48.492: [ OCRRAW][40]proprioo: for disk 0 (/dev/oracle/ocr), id match (1), my id set (1385758746,1866209186) total id sets (2), 1st set (1385758746,1866209186), 2nd set (1385758746,1866209186) my votes (1), total votes (2)
2008-07-23 10:04:48.493: [ OCRRAW][40]proprioo: for disk 1 (/dev/oracle/ocrmirror), id match (1), my id set (1385758746,1866209186) total id sets (2), 1st set (1385758746,1866209186), 2nd set (1385758746,1866209186) my votes (1), total votes (2)
2008-07-23 10:04:48.504: [ OCRMAS][25]th_master: Deleted ver keys from cache (non master)
2008-07-23 10:04:48.504: [ OCRMAS][25]th_master: Deleted ver keys from cache (non master)

The fact that the logfile on node 2 shows these messages indicates that the vote update which is done by the master (node 1) is propagated in some way to the other nodes, who in term update some kind of local cache (I think).

Conclusion of Scenario 4

Hiding the lun from the non-ocr-master still can’t confuse CRS. However it takes longer for the cluster to detect that there is a problem with the storage, as only the master is able to detect io errors on the ocr/ocrmirror. But in the end it can be recovered again without downtime.

So you should be convinced now that we can’t confuse CRS, right?… Then you don’t know me yet, I still have scenario 5. Read on in the next chapter.


The ultimate story about OCR, OCRMIRROR and 2 storage boxes – Chapter 3

Fri, 2009-10-02 10:36
Scenario 3: Loss of OCRmirror from the OCR MASTER only

This is a followup of chapter 2.

As we have seen in scenario 1, the OCR MASTER will update the votecount. Now let’s hide the ocrmirror from only 1 node: the node being the OCR MASTER, while the other node continues to see the ocrmirror. Will CRS get confused about this?

Note: while doing this test, crs is running on both nodes.

In this scenario, node 2 is the OCR MASTER. In fact I haven’t found any command to query who is the master. The only way to find out is to compare the crsd logfiles on all nodes to find the most recent message “I AM THE NEW OCR MASTER”. If anyone knows a better way for determining this, please let me know.

So when hiding the ocrmirror from node 2, we see in its alert file, as expected:

2008-07-23 09:14:53.921
[crsd(8215)]CRS-1006:The OCR location /dev/oracle/ocrmirror is inaccessible. Details in /app/oracle/crs/log/nodeb01/crsd/crsd.log.

and in its logfile:

2008-07-23 09:14:53.920: [ OCROSD][14]utwrite:3: problem writing the buffer 1a33000 buflen 4096 retval -1 phy_offset 143360 retry 0
2008-07-23 09:14:53.920: [ OCROSD][14]utwrite:4: problem writing the buffer errno 5 errstring I/O error
2008-07-23 09:14:53.922: [ OCRRAW][34]propriowv_bootbuf: Vote information on disk 0 [/dev/oracle/ocr] is adjusted from [1/2] to [2/2]

Nothing appears in the logfiles of the non-ocr-master, i.e. node 1. So until now this situation is still identical as in scenario 1: it is the ocr master who updates the votecount after loosing the other ocr.

The ocrcheck on node 2 (master) now gives:

         Device/File Name         : /dev/oracle/ocr
                                    Device/File integrity check succeeded
         Device/File Name         : /dev/oracle/ocrmirror
                                    <span style="color: rgb(255, 128, 0);">Device/File unavailable</span>
         Cluster registry integrity check succeeded

But the output on node 1 (non-master) is different:

         Device/File Name         : /dev/oracle/ocr<br />                                    Device/File integrity check succeeded<br />         Device/File Name         : /dev/oracle/ocrmirror                                    <br />                                    <span style="color: rgb(255, 128, 0);">Device/File needs to be synchronized with the other device</span><br />         Cluster registry integrity check succeeded

This makes sense, because node 2 cannot see the device (device/file unavailable) while node 1 sees both devices with different vote count (2 votes for ocr and 1 vote for ocrmirror, so it asks to resync just as in scenario 1 after the ocrmirror was visible again).

So now I want to try to confuse CRS. I will try to resync the ocrmirror again from node 1. So this will update the vote count of each device to 1. Technically this is possible because node 1 can see both devices, but if it succeeds node 2 will be left with one ocr device having one vote., and we know from rule 3 that CRS cannot run in that case. Will crs then crash on node 2?…

So we do on node 1:

-bash-3.00# ocrconfig -replace ocrmirror /dev/oracle/ocrmirror

Bad luck, it fails with

PROT-21: Invalid parameter

Very clear, right… The interesting part however appears in the crsd logfile of node 2:

2008-07-23 09:19:34.712: [ OCROSD][32]utdvch:0:failed to open OCR file/disk /dev/oracle/ocrmirror, errno=5, os err string=I/O error
2008-07-23 09:19:34.712: [ OCRRAW][32]dev_replace: master could not verify the new disk (8)
[ OCRSRV][32]proas_replace_disk: Failed in changing configurations in the Master 8

So this learns us that, when the ocrconfig command is done on node 1 not being the master, that it will send this to node 2 being the master and node 2 will execute it. What NOT happens is that crs crashes, nor that node 1 takes over the mastership of node 2. Nice to know.
Very unlogical however is that, when doing the last command above, the crs alert file of node 2 shows

2008-07-23 09:19:34.711
[crsd(8215)]CRS-1007:The OCR/OCR mirror location was replaced by /dev/oracle/ocrmirror.

This is WRONG. The ocrmirror was not replaced. The message should be: “Trying to replace the OCR/OCR mirror location by /dev/oracle/ocrmirror”. It is just that you know it.
The logs on node 1 are correct. Find the latest log in the “client” directory and read:

Oracle Database 10g CRS Release 10.2.0.4.0 Production Copyright 1996, 2008 Oracle. All rights reserved.
2008-07-23 09:19:34.694: [ OCRCONF][1]ocrconfig starts…
2008-07-23 09:19:34.716: [ OCRCLI][1]proac_replace_dev:[/dev/oracle/ocrmirror]: Failed. Retval [8]
2008-07-23 09:19:34.716: [ OCRCONF][1]The input OCR device either is identical to the other device or cannot be opened
2008-07-23 09:19:34.716: [ OCRCONF][1]Exiting [status=failed]…

Conclusion: we cannot confuse the crs!

After making the ocrmirror visible again and reissuing the replace command on node 1, we get in the crs logfile on node 2 (master):

2008-07-23 09:27:15.384: [ OCRRAW][32]proprioo: for disk 0 (/dev/oracle/ocr), id match (1), my id set (1385758746,1866209186) total id sets (2), 1st set (1385758746,1866209186), 2nd set (1385758746,1866209186) my votes (2), total votes (2)
2008-07-23 09:27:15.384: [ OCRRAW][32]propriogid:1: INVALID FORMAT
2008-07-23 09:27:15.516: [ OCRRAW][32]propriowv_bootbuf: Vote information on disk 1 [/dev/oracle/ocrmirror] is adjusted from [0/0] to [1/2]
2008-07-23 09:27:15.517: [ OCRRAW][32]propriowv_bootbuf: Vote information on disk 0 [/dev/oracle/ocr] is adjusted from [2/2] to [1/2]
2008-07-23 09:27:15.518: [ OCRMAS][25]th_master: Deleted ver keys from cache (master)
2008-07-23 09:27:15.628: [ OCRMAS][25]th_master: Deleted ver keys from cache (master)

and the crs logfile of node 1 (non-master):

2008-07-23 09:27:15.543: [ OCRRAW][36]proprioo: for disk 0 (/dev/oracle/ocr), id match (1), my id set (1385758746,1866209186) total id sets (2), 1st set (1385758746,1866209186), 2nd set (1385758746,1866209186) my votes (1), total votes (2)
2008-07-23 09:27:15.543: [ OCRRAW][36]proprioo: for disk 1 (/dev/oracle/ocrmirror), id match (1), my id set (1385758746,1866209186) total id sets (2), 1st set (1385758746,1866209186), 2nd set (1385758746,1866209186) my votes (1), total votes (2)
2008-07-23 09:27:15.571: [ OCRMAS][25]th_master: Deleted ver keys from cache (non master)
2008-07-23 09:27:15.572: [ OCRMAS][25]th_master: Deleted ver keys from cache (non master)

and the client logfile of node 1:

Oracle Database 10g CRS Release 10.2.0.4.0 Production Copyright 1996, 2008 Oracle. All rights reserved.
2008-07-23 09:27:15.346: [ OCRCONF][1]ocrconfig starts…
2008-07-23 09:27:15.572: [ OCRCONF][1]Successfully replaced OCR and set block 0
2008-07-23 09:27:15.572: [ OCRCONF][1]Exiting [status=success]…

and all is ok again. Each device has one vote again, and we are back in the ‘normal’ situation.

Conclusion

We cannot confuse the CRS when ocr or ocrmirror disappears from the ocr master node only.

But what is it disappears from the non-master node…? That’s stuff for the next chapter.


The ultimate story about OCR, OCRMIRROR and 2 storage boxes – Chapter 2

Sat, 2009-09-19 14:29
Scenario 2: loss of ocrmirror, both nodes down

(This is the follow-up of chapter 1)

Let’s investigate the vote count a little further by doing the following test:

  • First stop crs on both nodes
  • Then make the lun with ocrmirror unavailable to both nodes

What happens?

Let’s check the ocr status before starting crs on any node:

bash-3.00# ocrcheck
PROT-602: Failed to retrieve data from the cluster registry

The crs alert file shows:

2008-07-18 15:57:36.438
[client(24204)]CRS-1011:OCR cannot determine that the OCR content contains the latest updates. Details in /app/oracle/crs/log/nodea01/client/ocrcheck_24204.log.

and the mentioned ocrcheck_24204.log file:

Oracle Database 10g CRS Release 10.2.0.4.0 Production Copyright 1996, 2008 Oracle.
All rights reserved.
2008-07-18 15:57:36.405: [OCRCHECK][1]ocrcheck starts…
2008-07-18 15:57:36.437: [ OCRRAW][1]proprioini: disk 0 (/dev/oracle/ocr) doesn’t
have enough votes (1,2)

2008-07-18 15:57:36.438: [ OCRRAW][1]proprinit: Could not open raw device
2008-07-18 15:57:36.438: [ default][1]a_init:7!: Backend init unsuccessful : [26]
2008-07-18 15:57:36.439: [OCRCHECK][1]Failed to access OCR repository: [PROC-26: Error while accessing the physical storage]
2008-07-18 15:57:36.439: [OCRCHECK][1]Failed to initialize ocrchek2
2008-07-18 15:57:36.439: [OCRCHECK][1]Exiting [status=failed]…

I didn’t try to start the CRS at this time, however I am sure it would result in the same error messages. Note the colored messages. The second one explains what the real problem is: one of the ocr devices is unavailable: error while accessing the physical storage. This is exactly the information you need to troubleshoot a failing crs start. The other message tells us more about the internals: the remaining ocr has only 1 vote, which isn’t enough. So that’s rule 3 in the world of CRS. So read and remember for once and for all:

  1. Rule 1: CRS can start if it finds 2 ocr devices each having one vote (the normal case)
  2. Rule 2: CRS can start if it finds 1 ocr having 2 votes (the case after loosing the ocrmirror).
  3. Rule 3: CRS CANNOT start if it finds only one ocr device having only 1 vote

Now if this is a production environment and we really want to get the cluster + databases up, how do we proceed? Well we can do so by manually telling the cluster that the remaining ocr is valid and up-to-date. Note however that this is an important decision. It is up to you to know that the remaining ocr is valid. If you have been playing too much with missing luns, adding services, missing the other lun etc… it may be that the contents of the ‘invisible’ ocrmirror are maybe more recent than those of the visible ocr. If in that case you tell crs that the ocr is valid, you may loose important information from your ocrmirror. Anyway in most cases you will know very well what to do, and issue as root:

ocrconfig -overwrite

Now find the most recent file in $ORA_CRS_HOME/log/nodename/client and see that it contains:

Oracle Database 10g CRS Release 10.2.0.4.0 Production Copyright 1996, 2008 Oracle.
All rights reserved.
2008-07-18 15:59:56.828: [ OCRCONF][1]ocrconfig starts…
2008-07-18 15:59:58.644: [ OCRRAW][1]propriowv_bootbuf: Vote information on disk
0 [/dev/oracle/ocr] is adjusted from [1/2] to [2/2]

2008-07-18 15:59:58.644: [ OCRCONF][1]Successfully overwrote OCR configuration on
disk
2008-07-18 15:59:58.644: [ OCRCONF][1]Exiting [status=success]…

So now we are in the situation of scenario 1: one ocr device available having 2 votes. This gives:

Status of Oracle Cluster Registry is as follows :
         Version                  :          2
         Total space (kbytes)     :     295452
         Used space (kbytes)      :       5112
         Available space (kbytes) :     290340
         ID                       : 1930338735
         Device/File Name         : /dev/oracle/ocr
                                    Device/File integrity check succeeded
         Device/File Name         : /dev/oracle/ocrmirror
                                    Device/File unavailable

         Cluster registry integrity check succeeded

And the crs startup happens without problem:

-bash-3.00# crsctl start crs<br />Attempting to start CRS stack<br />The CRS stack will be started shortly

Note however that you still have to recover from this as in scenario 1 using “ocrconfig -replace ocrmirror /dev/…” once the storage box containing the ocrmirror is available again.

Conclusion of scenario 2

When loosing an ocr or ocrmirror while crs is down on both nodes, Oracle is not able to update the vote count of the remaining ocr (no crs processes are running to do this). As a consequence it is up to you to do that by using the “overwrite” option of ocrconfig. After this, CRS can start as normal and later on you can recover from this when the ocrmirror becomes available again or when you can use another new device for ocrmirror.

So this looks great, let’s buy that additional storage box now.

But still I am not satisfied yet. Until now we had ‘clean errors’. I.e. both nodes were up or down, and the storage disappeared from both nodes at the same time. Let’s play a little more in the next chapters…


The ultimate story about OCR, OCRMIRROR and 2 storage boxes – Chapter 1

Tue, 2009-09-01 14:44
Scenario 1: loss of ocrmirror, both nodes up

(This is the followup of article “Introduction“)

Facts
  • CRS is running on all nodes
  • The storage box containing the OCRmirror is made unavailable to both hosts (simulating a crash of one storage box).

What happens?

The crs alertfile ($ORA_CRS_HOME/log/hostname/alert.log) of node 1 shows:

2008-07-18 15:30:23.176
[crsd(6563)]CRS-1006:The OCR location /dev/oracle/ocrmirror is inaccessible. Details in /app/oracle/crs/log/nodea01/crsd/crsd.log.

And the CRS logfile of node 1 shows:

2008-07-18 15:30:23.176: [ OCROSD][14]utwrite:3: problem writing the buffer 1c03000 buflen 4096 retval -1 phy_offset 102400 retry 0
2008-07-18 15:30:23.176: [ OCROSD][14]utwrite:4: problem writing the buffer errno 5 errstring I/O error
2008-07-18 15:30:23.177: [ OCRRAW][768]propriowv_bootbuf: Vote information on disk 0 [/dev/oracle/ocr] is adjusted from [1/2] to [2/2]

There is nothing in the crs alertfile or crsd logfile of node 2 (allthough node 2 can’t see the lun either).
On both nodes we have:

(/app/oracle) $ ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          2
         Total space (kbytes)     :     295452
         Used space (kbytes)      :       5112
         Available space (kbytes) :     290340
         ID                       : 1930338735
         Device/File Name         : /dev/oracle/ocr
                                    Device/File integrity check succeeded
         Device/File Name         : /dev/oracle/ocrmirror
                                    <font color="#ff9900">Device/File unavailable</font>
         Cluster registry integrity check succeeded

CRS continues to work normally on both nodes

Discussion

This test indicates that the loss of the ocrmirror leaves the cluster running normally. In other words, a crash of a storage box would allow us to continue our production normally. Great!

However I’m not easily satisfied and hence I still have a lot of questions: how to recover from this, what happens internally, can we now change/update the ocr, …? Lets’ investigate.

The most interesting things are colored in the output above. The fact that the ocrmirror device file is unavailable makes sense. Remember however the other message: vote count updated from 1 to 2.
Let’s see what happens if we now stop and start CRS on node 1 (while ocrmirror is still unavailable):
Stopping CRS on node 1 happens as usual, no error messages. However at the time of stopping CRS on node 1, we see a very interesting message in the crsd logfile of node 2:

2008-07-18 15:34:38.504: [ OCRMAS][23]th_master:13: I AM THE NEW OCR MASTER at incar 2. Node Number 2
2008-07-18 15:34:38.511: [ OCRRAW][23]proprioo: for disk 0 (/dev/oracle/ocr), id match (1), my id set (1385758746,1866209186) total id sets (1), 1st set (138575874 6,1866209186), 2nd set (0,0) my votes (2), total votes (2)
2008-07-18 15:34:38.514: [ OCROSD][23]utread:3: problem reading buffer 162e000 buflen 4096 retval -1 phy_offset 106496 retry 0
2008-07-18 15:34:38.514: [ OCROSD][23]utread:4: problem reading the buffer errno 5 errstring I/O error
2008-07-18 15:34:38.559: [ OCRMAS][23]th_master: Deleted ver keys from cache (master)

I am the new master??? So it looks as if node 1 was the master until we stopped CRS there. This makes a link to the fact that, when the lun became unavailable, that only node 1 wrote messages in its logfiles. At that time, nothing was written into the logfile of node 2, because node 2 was not the master! A very interesting concept: in a RAC cluster, one node is the the crs master and is responsible for updating the vote count in the OCR. I never read that in the doc…. Also note that the new master also identifies that the ocr has 2 votes now: “my votes (2)”.

Also, at the time of stopping CRS on node 1, the crs alert file of node 2 showed:

2008-07-18 15:34:38.446
[evmd(18282)]CRS-1006:The OCR location /dev/oracle/ocrmirror is inaccessible. Details in /app/oracle/crs/log/nodeb01/evmd/evmd.log.
2008-07-18 15:34:38.514
[crsd(18594)]CRS-1006:The OCR location /dev/oracle/ocrmirror is inaccessible. Details in /app/oracle/crs/log/nodeb01/crsd/crsd.log.
2008-07-18 15:34:38.558
[crsd(18594)]CRS-1005:The OCR upgrade was completed. Version has changed from 169870336 to 169870336. Details in /app/oracle/crs/log/nodeb01/crsd/crsd.log.
2008-07-18 15:34:55.153

So it looks as if node 2 is checking again the availability of the ocrmirror and sees it is not available.

Now let’s start crs on node 1 again, maybe he becomes master again?… Not really. The only thing we see in the crsd logfile is:

2008-07-18 15:39:19.603: [ CLSVER][1] Active Version from OCR:10.2.0.4.0
2008-07-18 15:39:19.603: [ CLSVER][1] Active Version and Software Version are same
2008-07-18 15:39:19.603: [ CRSMAIN][1] Initializing OCR
2008-07-18 15:39:19.619: [ OCRRAW][1]proprioo: for disk 0 (/dev/oracle/ocr), id match (1), my id set (1385758746,1866209186) total id sets (1), 1st set (1385758746,1866209186), 2nd set (0,0) my votes (2), total votes (2)

Recovery

Now how do we get things back to normal? Let’s first make the lun visible again on the san switch. At that time nothing happens in any logfile, so CRS doesn’t seem to poll to see if the ocrmirror is back. However when we execute now an ocrcheck, we get:

Status of Oracle Cluster Registry is as follows :<br />Version                  :          2<br />Total space (kbytes)     :     295452<br />Used space (kbytes)      :       5112<br />Available space (kbytes) :     290340<br />ID                       : 1930338735<br />Device/File Name         : /dev/oracle/ocr<br />Device/File integrity check succeeded<br />Device/File Name         : /dev/oracle/ocrmirror<br /><span style="color: rgb(255, 128, 0);">Device/File needs to be synchronized with the other device</span><br /><br />Cluster registry integrity check succeeded

Again, this makes sense. While the ocrmirror was unavailable, you may have added services, instances or whatever, so the contents of the (old) ocrmirror may be different from those of the current ocr. In our case however, nothing was changed on cluster level, so theoretically the contents of ocr and ocrmirror should still be the same. Still we get the message above. Anyway, the way to synchronize this ocr is to issue as root:

ocrconfig -replace ocrmirror /dev/oracle/ocrmirror

This will copy the contents of the ocr over the ocrmirror being located at /dev/oracle/ocrmirror. In other words, it will create a new ocrmirror in location /dev/oracle/ocrmirror as a copy of the existing ocr. Be careful with the syntax; do not use “-replace ocr” when the ocrmirror is corrupt.
At that time, we see in the crs logfile on both nodes:

2008-07-18 15:51:06.254: [ OCRMAS][25]th_master: Deleted ver keys from cache (non master)

2008-07-18 15:51:06.263: [ OCRRAW][30]proprioo: for disk 0 (/dev/oracle/ocr), id match (1), my id set (1385758746,1866209186) total id sets (2), 1st set (1385758746,1866209186), 2nd set (1385758746,1866209186) my votes (1), total votes (2)

2008-07-18 15:51:06.263: [ OCRRAW][30]proprioo: for disk 1 (/dev/oracle/ocrmirror), id match (1), my id set (1385758746,1866209186) total id sets (2), 1st set (1385758746,1866209186), 2nd set (1385758746,1866209186) my votes (1), total votes (2)

2008-07-18 15:51:06.364: [ OCRMAS][25]th_master: Deleted ver keys from cache (non master)

and in the alert file:

2008-07-18 15:51:06.246
[crsd(13848)]CRS-1007:The OCR/OCR mirror location was replaced by /dev/oracle/ocrmirror.

Note again the highlighted messages above: each ocr again has 1 vote. And all is ok again:

Status of Oracle Cluster Registry is as follows :<br />Version                  :          2<br />Total space (kbytes)     :     295452<br />Used space (kbytes)      :       5112<br />Available space (kbytes) :     290340<br />ID                       : 1930338735<br />Device/File Name         : /dev/oracle/ocr<br />Device/File integrity check succeeded<br />Device/File Name         : /dev/oracle/ocrmirror<br />Device/File integrity check succeeded<br />Cluster registry integrity check succeeded
Conclusion of scenario 1

Loosing the storage box containing the ocrmirror is no problem (the same is true for loosing ocr while ocrmirror remains available). Moreover it can be recovered without having to stop the cluster (the restart of crs on node 1 above was for educational purposes only). This corresponds with what is told in the RAC FAQ on Metalink Note 220970.1: “If the corruption happens while the Oracle Clusterware stack is up and running, then the corruption will be tolerated and the Oracle Clusterware will continue to funtion without interruptions” (however I think that the logfiles above give you much more insight in what really happens).

However another important concept is the story of the votecount. The test above shows that CRS is able to start if it finds 2 ocr devices each having one vote (the normal case) or if it finds 1 ocr having 2 votes (the case after loosing the ocrmirror). Note that at the moment of the failure, the vote count of the ocr could be increased by oracle from 1 to 2, because CRS was running.

In the next chapter, we will do this over again, but with both nodes down…


The ultimate story about OCR, OCRMIRROR and 2 storage boxes – Introduction

Fri, 2009-08-28 13:57

Some time ago I wrote a blog about stretched clusters and the OCR. The final conclusion at that time was that there was no easy way to get your OCR safe on both storages, and hence I disrecommended clusters with 2 storage boxes. However, after some more investigation I may have to change my mind. I did extended testing on the OCR and in this blog I want to share my experiences.

This is the setup:

  • 2-node RAC cluster (10.2.0.4 on Solaris), located in 2 server rooms
  • 2 storage boxes, one in each server room
  • ASM mirroring of all data (diskgroups with normal redudancy)
  • One voting disk on one storage box, 2nd voting disk on the other box, 3rd voting disk on nfs on a server in a 3rd location (outside the 2 server rooms)

For the components above, this setup is safe against server room failure:

  • The data is mirrored in ASM and will remain available on the other box.
  • The cluster can continue because it still sees 2 voting disks (one in the surviving server room and one on nfs).

But what about the OCR?

We did as what looks logical: OCR on storage box 1 and OCRmirror on storage box 2, resulting in:

         Device/File Name         : /dev/oracle/ocr
                                    Device/File integrity check succeeded
         Device/File Name         : /dev/oracle/ocrmirror
                                    Device/File integrity check succeeded

Now we can start playing. For the unattended reader, “playing” means: closing ports on the fibre switches in such a way that a storage box becomes totally unavailable to the servers. This simulates a storage box failure.

The result is a story of 5 chapters and a conclusion. Please standby for the next upcoming blog posts.


Apex beats Access

Sun, 2008-09-21 14:18

This year I became part of the parents council of my childrens school. And like every small organisation in Belgium (football clubs, gymnastics, schools, …) it is the time of year to do the annual mussel-event for fund raising (mmm, delicious). My school happened to use an Access-IIS-ASP application to do the billing when people leave the event (do the calculation for their food and drink consumptions). But o fortuna, the evening before the event, the application turned out to systematically crash when entering data, and murphy oh murphy, the author of it had just become father and could not be reached. Panic of course.

Wasn’t it that I happened to have an old laptop with Oracle XE (and hence Apex) installed and I said: give me two hours. The data model was very simple: one table with all food, drinks and prices, a second table for the history of all orders and payments and then some screens calculating amounts multiplied by price and summing it up.

And then you see how great Apex is. In two hours indeed, I could make a fully functional, nice looking and robust application with an operational and an admin part, including various reports. It worked like a charm and the school team was astounished that such a core part of their administration could be made in such little time. Considering the alternative they would have considered otherwise (Excel, not keeping any history of all consumptions, so preventing us to plan our inventory for next year), this was a great solution. I hope it may open the way for more Apex applications in the school, because Apex is really a great tool.

P.S. I won’t use my blog for not telling that I won’t go to OOW with no flight number and not staying in any hotel, as I think that it is quite a waste of bandwidth spending complete blog posts on the practical arrangements of OOW trips. I am really waiting for the real technical posts and experiences of all those people there.


A really recommended ASM patch – failing lun

Wed, 2008-08-06 10:22

The following is a real life experience about failing disks/luns and how ASM reacts to this. We used 10.2.0.4 on Solaris with a HDS storage box and MPXio. We made an ASM diskgroup of 2 mirrorred disks. Then we made the luns unavailable to the hosts (hide lun). The result was not really what we expected.After hiding the lun, the ASM alert file shows the following ‘normal’ messages:

15:26:57+: Errors in file /app/oracle/admin/+ASM/bdump/+asm1_gmon_10611.trc:
15:26:57+: ORA-27091: unable to queue I/O
15:26:57+: ORA-27072: File I/O error
15:26:57+: SVR4 Error: 5: I/O error
15:26:57+: Additional information: 4
15:26:57+: Additional information: 2048
15:26:57+: Additional information: -1

and some time later

15:32:23+: WARNING: offlining disk 0.4042332304 (ARESDATAA) with mask 0x3
15:32:23+: WARNING: offlining disk 0.4042332304 (ARESDATAA) with mask 0x3

and

15:32:33+: WARNING: kfk failed to open a disk[/dev/oracle/asm/aresdata-a]
15:32:33+: Errors in file /app/oracle/admin/+ASM/udump/+asm1_ora_18313.trc:
15:32:33+: ORA-15025: could not open disk ‘/dev/oracle/asm/aresdata-a’
15:32:33+: ORA-27041: unable to open file
15:32:33+: SVR4 Error: 5: I/O error
15:32:33+: Additional information: 3
15:32:33+: NOTE: PST update: grp = 1, dsk = 0, mode = 0x6
15:32:35+: NOTE: PST update: grp = 1, dsk = 0, mode = 0x4
15:32:35+: NOTE: group ARESDATA: relocated PST to: disk 0001 (PST copy 0)
15:32:35+: NOTE: PST update: grp = 1, dsk = 0, mode = 0x4
15:32:35+: NOTE: cache closing disk 0 of grp 1: ARESDATAA
15:32:35+: NOTE: cache closing disk 0 of grp 1: ARESDATAA
15:33:50+: WARNING: PST-initiated drop disk 1(780265568).0(4042332304) (ARESDATAA)
15:33:50+: NOTE: PST update: grp = 1
15:33:50+: NOTE: group ARESDATA: relocated PST to: disk 0001 (PST copy 0)
15:33:50+: NOTE: requesting all-instance membership refresh for group=1
15:33:50+: NOTE: membership refresh pending for group 1/0x2e81e860 (ARESDATA)
15:33:50+: SUCCESS: refreshed membership for 1/0x2e81e860 (ARESDATA)
15:33:53+: SUCCESS: PST-initiated disk drop completed
15:33:53+: SUCCESS: PST-initiated disk drop completed
15:33:56+: NOTE: starting rebalance of group 1/0x2e81e860 (ARESDATA) at power 1
15:33:56+: Starting background process ARB0
15:33:56+: ARB0 started with pid=21, OS id=19285
15:33:56+: NOTE: assigning ARB0 to group 1/0x2e81e860 (ARESDATA)
15:33:56+: NOTE: F1X0 copy 1 relocating from 0:2 to 1:2
15:33:56+: NOTE: F1X0 copy 2 relocating from 1:2 to 0:2
15:33:56+: NOTE: F1X0 copy 3 relocating from 65534:4294967294 to 65534:4294967294
15:33:56+: NOTE: X->S down convert bast on F1B3 bastCount=2

15:34:14+: NOTE: group ARESDATA: relocated PST to: disk 0001 (PST copy 0)
15:34:14+: WARNING: offline disk number 0 has references (1394 AUs)
15:34:14+: NOTE: PST update: grp = 1
15:34:14+: NOTE: group ARESDATA: relocated PST to: disk 0001 (PST copy 0)

However, every time we do a query on the v$asm_disk view, we experience hangs of 30..90 seconds. Same when adding a tablespace in a database. It looks as if ASM, whenever trying to access the failed disk, waits for OS timeouts. I assume that every operation that needs access to ASM disks (e.g. data file autoextend that needs to allocate space in the asm disk, creation of archivelog, …) suffers from these timeout. Not really acceptable for a production environment. I do want to mention that dd, and even orion for asynch IO, detect the error immediately without waiting for any timeout.

You can clearly see the ASM waits when you truss the server process of the asm sqlplus session when you do a select on v$asm_disk. For each failed disk you get:

9377/1: 17.3106 open(“/dev/oracle/asm/aresdata-a”, O_RDWR|O_DSYNC) Err#5 EIO
9377/1: 18.3195 nanosleep(0xFFFFFFFF7FFF8FE0, 0xFFFFFFFF7FFF8FD0) = 0
9377/1: 18.3198 open(“/dev/oracle/asm/aresdata-a”, O_RDWR|O_DSYNC) Err#5 EIO
9377/1: 19.3295 nanosleep(0xFFFFFFFF7FFF8FE0, 0xFFFFFFFF7FFF8FD0) = 0
9377/1: 19.3298 open(“/dev/oracle/asm/aresdata-a”, O_RDWR|O_DSYNC) Err#5 EIO
9377/1: 20.3395 nanosleep(0xFFFFFFFF7FFF8FE0, 0xFFFFFFFF7FFF8FD0) = 0
9377/1: 20.3398 open(“/dev/oracle/asm/aresdata-a”, O_RDWR|O_DSYNC) Err#5 EIO
9377/1: 21.3495 nanosleep(0xFFFFFFFF7FFF8FE0, 0xFFFFFFFF7FFF8FD0) = 0
9377/1: 21.3497 open(“/dev/oracle/asm/aresdata-a”, O_RDWR|O_DSYNC) Err#5 EIO
9377/1: 22.3595 nanosleep(0xFFFFFFFF7FFF8FE0, 0xFFFFFFFF7FFF8FD0) = 0
9377/1: 22.3598 open(“/dev/oracle/asm/aresdata-a”, O_RDWR|O_DSYNC) Err#5 EIO
9377/1: 22.3605 open(“/dev/oracle/asm/aresdata-a”, O_RDONLY|O_DSYNC) Err#5 EIO
9377/1: 23.3695 nanosleep(0xFFFFFFFF7FFF8FE0, 0xFFFFFFFF7FFF8FD0) = 0
9377/1: 23.3697 open(“/dev/oracle/asm/aresdata-a”, O_RDONLY|O_DSYNC) Err#5 EIO
9377/1: 24.3795 nanosleep(0xFFFFFFFF7FFF8FE0, 0xFFFFFFFF7FFF8FD0) = 0
9377/1: 24.3798 open(“/dev/oracle/asm/aresdata-a”, O_RDONLY|O_DSYNC) Err#5 EIO
9377/1: 25.3895 nanosleep(0xFFFFFFFF7FFF8FE0, 0xFFFFFFFF7FFF8FD0) = 0
9377/1: 25.3897 open(“/dev/oracle/asm/aresdata-a”, O_RDONLY|O_DSYNC) Err#5 EIO
9377/1: 26.3995 nanosleep(0xFFFFFFFF7FFF8FE0, 0xFFFFFFFF7FFF8FD0) = 0
9377/1: 26.3998 open(“/dev/oracle/asm/aresdata-a”, O_RDONLY|O_DSYNC) Err#5 EIO
9377/1: 27.4095 nanosleep(0xFFFFFFFF7FFF8FE0, 0xFFFFFFFF7FFF8FD0) = 0
9377/1: 27.4097 open(“/dev/oracle/asm/aresdata-a”, O_RDONLY|O_DSYNC) Err#5 EIO
9377/1: 27.4105 write(5, ” * * * 2 0 0 8 – 0 7 -“.., 27) = 27
9377/1: 27.4106 write(5, “\n”, 1) = 1
9377/1: 27.4109 write(5, ” W A R N I N G : k f k”.., 62) = 62
9377/1: 27.4109 write(5, “\n”, 1) = 1
9377/1: 27.4111 close(6) = 0
9377/1: 27.4111 open(“/app/oracle/admin/+ASM/bdump/alert_+ASM2.log”, O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE, 0660) = 6
9377/1: 27.4118 time() = 1217417698
9377/1: 27.4119 writev(6, 0xFFFFFFFF7FFF8080, 3) = 88

So it tries to access each disk 6 times in read write mode and 6 times again in read only mode. A loss of 12 valuable seconds….

At the same time, the os messages file generates the following messages every second:

Jul 30 11:01:05 node1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g60060e800562f400000062f4000000d5 (ssd63):
Jul 30 11:01:05 node1 offline or reservation conflict

Jul 30 11:01:06 node1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g60060e800562f400000062f4000000d5 (ssd
63):
Jul 30 11:01:06 node1 offline or reservation conflict

We would expect that ASM is intelligent enough to detect that the disk failed, but obviously it keeps trying to access it including the waits and timeouts.
FYI, the state of the disks after the failure has become:

DISKGROUP  PATH                            Total   Used  %Usd ST  Header    FAILGROUP   STATE      DiskName
---------- ------------------------------ ------ ------ ----- --- --------- ----------- ---------- ---------------
ARESDATA   /dev/oracle/asm/aresdata-b      46068   1396     3 ONL MEMBER    B           MOUNTED    ARESDATAB

ARESDATA                                   46068   1396     3 OFF UNKNOWN   A           MOUNTED    ARESDATAA

I opened an SR on Metalink and I uploaded all possible traces I could generate (*). And guess what, due to some reason (maybe (*)), I immediately came to an excellent engineer who identified the problem immediately as a known bug, and asked development to provide a patch for 10.2.0.4 (which did not exist yet at that time). It took only 5 days for the patch to be available, and that patch solves the problem completely. After applying it, every select on v$asm_disk returns immediately.

This is it:

Patch 6278034
Description WHEN SWITCHING OFF ONE ARRAY CONTAINING ONE FAILGROUP, THE PERFORMANCE TURNS BAD
Product RDBMS Server
Select a Release 10.2.0.310.2.0.4
Platform: Sun Solaris SPARC (64-bit)
Last Updated 04-AUG-2008
Size 97K (99336 bytes)
Classification General

The patch exist as well for 10.2.0.3. I would recommend to install it on your oracle_home where asm runs. However I have no idea if the problem is applicable as well to non-Solaris environments.

Note: To resync the disks after the lun is available again, use the ALTER DISKGROUP ADD FAILGROUP x DISK ‘/dev/path/somedevice’ NAME some_new_name [FORCE] command. Not so straightforward, it turns out that trying to offline or drop the disk will not work. I.e.:

==================== OVERVIEW OF ASM DISKS ======================================

DISKGROUP PATH Total Used %Usd ST Header FAILGROUP STATE DiskName MOUNT_S
———- —————————— —— —— —– — ——— ———– ———- ————— ——-
ARESARCHA /dev/oracle/asm/aresarch-a 30708 11226 36 ONL MEMBER AREAARCHA NORMAL AREAARCHA CACHED
ARESARCHB /dev/oracle/asm/aresarch-b 30708 11202 36 ONL MEMBER ARESARCHB NORMAL ARESARCHB CACHED
ARESDATA /dev/oracle/asm/aresdata-a 46068 1412 3 ONL MEMBER A NORMAL ARESDATAA CACHED

ARESDATA 0 0 OFF UNKNOWN B HUNG ARESDATAB MISSING

Trying to add it with the same name as before:
SQL> alter diskgroup ARESDATA add failgroup B disk ‘/dev/oracle/asm/aresdata-b’ name ARESDATAB force;
alter diskgroup ARESDATA add failgroup B disk ‘/dev/oracle/asm/aresdata-b’ name ARESDATAB force
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15010: name is already used by an existing ASM disk

Adding it using a new name:

SQL> alter diskgroup ARESDATA add failgroup B disk ‘/dev/oracle/asm/aresdata-b’ name ARESDATAB2 force;
alter diskgroup ARESDATA add failgroup B disk ‘/dev/oracle/asm/aresdata-b’ name ARESDATAB2 force
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15034: disk ‘/dev/oracle/asm/aresdata-b’ does not require the FORCE option

SQL> alter diskgroup ARESDATA add failgroup B disk ‘/dev/oracle/asm/aresdata-b’ name ARESDATAB2;

Diskgroup altered.

I assume that the need to use the ‘force’ option depends on the kind of error you got.

Sometimes I see that the rebalance does not start automatically. Then you get the following status:
DISKGROUP PATH Total Used %Usd ST Header FAILGROUP STATE DiskName MOUNT_S
———- —————————— —— —— —– — ——— ———– ———- ————— ——-
ARESARCHA /dev/oracle/asm/aresarch-a 30708 11322 36 ONL MEMBER AREAARCHA NORMAL AREAARCHA CACHED
ARESARCHB /dev/oracle/asm/aresarch-b 30708 11298 36 ONL MEMBER ARESARCHB NORMAL ARESARCHB CACHED
ARESDATA /dev/oracle/asm/aresdata-a 46068 1412 3 ONL MEMBER A NORMAL ARESDATAA CACHED
/dev/oracle/asm/aresdata-b 46068 2 0 ONL MEMBER B NORMAL ARESDATAB2 CACHED

ARESDATA 0 0 OFF UNKNOWN B FORCING ARESDATAB MISSING

In that case, start it manually using:

SQL> alter diskgroup ARESDATA rebalance;

Diskgroup altered.


Oracle VM and multiple local disks

Mon, 2008-06-09 15:06

For my Oracle VM test environment I have a server available with multiple internal disks of different size and speed. So I was wondering if it is possible to have all these disks used together for my virtual machines in Oracle VM.

If all disks would have been the same size and speed, I could easily use the internal raid controller to put them in mirror, stripe or raid5 and end up with one large volume, alias disk, for my Oracle VM. However due to the differences in characteristics of the disks (speed/size) this is not a good idea. So I started to look in Oracle VM Manager (the java console) to see what is possible.

It turned out soon to me that Oracle VM is designed for a different architecture: in fact the desired setup is to dispose of a (large) SAN box with shared storage that is available to multiple servers. Then all these servers can be put in a server pool, sharing the same storage. This setup allows live migration of running machines to another physical server. Of course this makes sense because it fits nicely in the concept of grid computing: if any physical server fails, just restart your virtual machine on another one, and add machines according to your performance needs. But it doesn’t help me: I don’t have got one storage with multiple servers, but I have one server with multiple disks.

So I started to browse a little in all the executables of the OVM installation, and I found under /usr/lib/ovs the ovs-makerepo script. According to me the architecture is as follows (as far as I can find on the internet, because there is not much clear documentation on this): when installing OVM, you have a /boot a / and a swap partition (just as in traditional linux) and OVM requires one large partition to be used for virtual machines, which will be mounted under /OVS. In this partition you find subdirectories “running_pool” which contains all the virtual machines that you have created and that you can start, and a subdirectory “seed_pool” which contains templates you can start from for creating new machines. There is also “local”, “remote” and “publish_pool”, however they were irrelevant for me at the moment and I didn’t try to figure out what they are used for.

With this in mind I can install Oracle VM on my first disk and end up with 4 partitions on /dev/sda:

   Filesystem 1K-blocks     Used Available Use% Mounted on
   /dev/sda1     248895    25284    210761  11% /boot
   (sda2 is swap)
   /dev/sda3    4061572   743240   3108684  20% /
   /dev/sda4   24948864 22068864   2880000  89% /OVS

With this in mind I now want to add the space on my second disk (/dev/sdb) to this setup. So first I create one large partition on the disk using fdisk. Then I create an ocfs file system on it as follows:

[root@nithog ovs]# mkfs.ocfs2 /dev/sdb1
mkfs.ocfs2 1.2.7
Filesystem label=
Block size=4096 (bits=12)
Cluster size=4096 (bits=12)
Volume size=72793694208 (17771898 clusters) (17771898 blocks)
551 cluster groups (tail covers 31098 clusters, rest cover 32256 clusters)
Journal size=268435456
Initial number of node slots: 4
Creating bitmaps: done
Initializing superblock: done
Writing system files: done
Writing superblock: done
Writing backup superblock: 4 block(s)
Formatting Journals: done
Writing lost+found: done
mkfs.ocfs2 successful

Initially I created the file system as ext3 which worked well. However there was one strange thing. This is what you get:

  • Create a new (paravirtualized) (linux) virtual machine in this new (ext3-based) repository (see later how exactly)
  • Specify a disk of e.g. 2Gb
  • Complete the wizard
  • This prepares a machine where you can start using the linux installer on the console to install the machine (do not start to install yet)
  • Now look in …/running_pool/machine_name and see a file of 2Gb
  • Now do du -sk on …/running_pool/machine and see that only 20Kb is used
  • From the moment you start to partition your disk inside the virtual machine, the output of “du -sk” grows the same amount as the data you really put in it. So it behaves a bit like ‘dynamic provisioning’.
  • Note however that ls -l shows a file of 2Gb at any time

I don’t know for the moment if this behaviour is caused by the fact that the file system is ext3, but anyway, I leave it up to you to judge if this is an advantage or a disadvantage.

Now when trying to add my new sdb1 partition as an extra repository, I got:

Usage:

[root@nithog ~]# /usr/lib/ovs/ovs-makerepo
 usage: /usr/lib/ovs/ovs-makerepo <source> <shared> <description>
        source: block device or nfs path to filesystem
        shared: filesystem shared between hosts?  1 or 0
        description: descriptive text to be displayed in manager

Execution:

   [root@nithog ovs]# /usr/lib/ovs/ovs-makerepo /dev/sdb1 0 "Repo on disk 2" 
   ocfs2_hb_ctl: Unable to access cluster service while starting heartbeat mount.ocfs2: 
   Error when attempting to run /sbin/ocfs2_hb_ctl: "Operation not permitted" 
   Error mounting /dev/sdb1

Seems like the script expects something like a cluster, but I just have a standalone node… I think that this script is intended to add a shared repository to a cluster of nodes. No problem, let’s try to convert our standalone machine to a one-node cluster: create the file /etc/ocfs2/cluster.conf:

cluster:
        node_count = 1
        name = ocfs2
node:
        ip_port = 7777
        ip_address = 10.7.64.160
        number = 1
        name = nithog
        cluster = ocfs2

Note that the indented lines MUST start with a <TAB> and then the parameter with its value. After creating this file I could do:

   [root@nithog ovs]# /etc/init.d/o2cb online ocfs2
   Starting O2CB cluster ocfs2: OK

and then
[root@nithog ovs]# /usr/lib/ovs/ovs-makerepo /dev/sdb1 0 "Repo on disk 2" Initializing NEW repository /dev/sdb1 SUCCESS: Mounted /OVS/877DECC5B658433D9E0836AFC8843F1B Updating local repository list. ovs-makerepo complete

As you can see, an extra subdirectory is created in the /OVS file system, with a strange UUID as its name. Under this directory my new file system /dev/sdb1 is mounted. This file system is a real new repository, because under /OVS/877DECC5B658433D9E0836AFC8843F1B you find as well the running_pool and seed_pool directories. It is also listed in /etc/ovs/repositories (but it is NOT recommended to edit this file manually).

Then I looked in the Oracle VM Manager (the java based web gui) but I didn’t find anything of this new repository. It looks as if this gui is not (yet) designed to handle multiple repositories. However I started to figure out if my new disk could really be used for virtual machines, and my results are:

  • When creating a new virtual machine, you have no chance of specifying in which repository it has to come
  • It seems to come in the repository where there is the most amount of free space (but I should do more testing to get 100% certainty)
  • When adding a new disk to an existing virtual machine (an extra file on oracle-vm level) the file will come in the same repository, even the same directory as where the initial files of your virtual machine are located. If there is NOT enough free space on the disk, Oracle VM will NOT put your file in another repository on another disk.
  • You can move the datafiles of your virtual machine to any other location while the machine is not running, and while changing the reference to the file in /etc/xen/<machine_name>
  • So actually it looks that on xen-level you can put your vm datafiles in any directory; the concept of the repositories seems to be oracle-vm specific.
  • So if you create a new virtual machine and Oracle puts it in the wrong repository, it is not difficult at all to move it afterwards to another filesystem/repostory. It just requires a little manual intervention. However it seems recommended to keep your machines always in an oracle-vm repository, in the running_pool, because only in that way it can be managed by the Oracle-vm gui.

I am sure that there are many things that have an abvious explanation, but I have to admit that I didn’t read the manuals of ocfs and oracle vm completely from the start to the end. Also I think that Oracle

Conclusion: Oracle VM seems to be capable of having multiple repositories on different disks, but the GUI is not ready to handle them. But with a minimum of manual intervention, it is easy to do all desired tasks in command-line mode.