Re: Problem with RAC on Solaris

From: Subodh Deshpande <deshpande.subodh_at_gmail.com>
Date: Sun, 10 Jul 2011 12:49:01 +0530
Message-ID: <CAJsOtB5LixjvdK005dBOv2eCp-fSBpquejp4YonYOcyHH12FOQ_at_mail.gmail.com>



Mars reboot venus ok
Does venus also reboots mars
Which mos note you followed to install this rac When both nodes are running then did you checked the connectivity between both nodes from one to another

On 10 Jul 2011 07:33, "Grzesiak, Robert (LNG-POL)" < Robert.Grzesiak_at_lexisnexis.pl> wrote:

  • ** ** **

Welcom, ****

  • **

I'm the first time on this oracle-l freelists... and sorry for my poor English J****

  • **

I have Oracle 11.2.0.2 RAC system on the OS Solaria 10. I have same problems with stability of this system. After few days 1 node restarts the second one, and it works next few day and it repeats... I have 2 nodes, named Mars and Venus. For example today Mars node have restarted the second one (Venus), but I don't know why... In the db alert log is information that it was same problem with communication - timeout and it's all...****

  • **

It happened on about 02:30...****

  • **

Db alert log on Mars:****

  • **

Thread 2 advanced to log sequence 5461 (LGWR switch)****

  Current log# 10 seq# 5461 mem# 0:
+DATA/lnred/onlinelog/group_10.308.743814917****

Thu Jul 07 01:02:23 2011****

Archived Log entry 3047 added for thread 2 sequence 5460 ID 0xa9ba4d33 dest 1:****

Thu Jul 07 02:31:28 2011****

minact-scn master exiting with err:12751****

Thu Jul 07 02:31:43 2011****

IPC Send timeout detected. Sender: ospid 1940 [oracle_at_Mars (MMON)]****

Receiver: inst 1 binc 36502 ospid 1949****

IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 26****

IPC Send timeout: Terminating pid 26 osid 1940****

Thu Jul 07 02:31:46 2011****

IPC Send timeout detected. Sender: ospid 18305 [oracle_at_Mars (M000)]****

Receiver: inst 1 binc 36502 ospid 1949****

IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 75****

IPC Send timeout: Terminating pid 75 osid 18305****

Thu Jul 07 02:32:07 2011****

IPC Send timeout detected. Sender: ospid 1908 [oracle_at_Mars (LMD0)]****

Receiver: inst 1 binc 36502 ospid 1949****

Thu Jul 07 02:32:10 2011****

Restarting dead background process MMON****

Thu Jul 07 02:32:11 2011****

MMON started with pid=26, OS id=19687 ****

Thu Jul 07 02:32:15 2011****

IPC Send timeout detected. Sender: ospid 1928 [oracle_at_Mars (LGWR)]****

Receiver: inst 1 binc 36528 ospid 1951****

IPC Send timeout to 1.1 inc 32 for msg type 73 from opid 20****

Thu Jul 07 02:32:19 2011****

Communications reconfiguration: instance_number 1****

Thu Jul 07 02:32:24 2011****

IPC Send timeout to 1.0 inc 32 for msg type 65521 from opid 12****

Thu Jul 07 02:32:29 2011****

IPC Send timeout detected. Sender: ospid 1928 [oracle_at_Mars (LGWR)]****

Receiver: inst 1 binc 36528 ospid 1955****

IPC Send timeout to 1.2 inc 32 for msg type 73 from opid 20****

Thu Jul 07 02:33:18 2011****

Evicting instance 1 from cluster****

Waiting for instances to leave: 1 ****

Thu Jul 07 02:33:38 2011****

Remote instance kill is issued with system inc 34****

Remote instance kill map (size 1) : 1 ****

LMON received an instance eviction notification from instance 2****

The instance eviction reason is 0x20000000****

The instance eviction map is 1 ****

Thu Jul 07 02:33:48 2011****

Waiting for instances to leave: 1 ****

Thu Jul 07 02:34:18 2011****

Waiting for instances to leave: 1 ****

Thu Jul 07 02:34:40 2011****

Reconfiguration started (old inc 32, new inc 36)****

List of instances:****

 2 (myinst: 2) ****

 Global Resource Directory frozen****

  • dead instance detected - domain 0 invalid = TRUE ****

 Communication channels reestablished****

 Master broadcasted resource hash value bitmaps****

 Non-local Process blocks cleaned out****

Thu Jul 07 02:34:40 2011****

 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived****

Thu Jul 07 02:34:40 2011****

 LMS 1: 1 GCS shadows cancelled, 0 closed, 0 Xw survived****

 Set master node info ****

 Submitted all remote-enqueue requests****

 Dwn-cvts replayed, VALBLKs dubious****

 All grantable enqueues granted****

 Post SMON to start 1st pass IR****

Thu Jul 07 02:34:40 2011****

Instance recovery: looking for dead threads****

Beginning instance recovery of 1 threads****

Started redo scan****

Completed redo scan****

 read 3662 KB redo, 135 data blocks need recovery****

 Submitted all GCS remote-cache requests****

 Post SMON to start 1st pass IR****

 Fix write in gcs resources****

Reconfiguration complete****

Started redo application at****

 Thread 1: logseq 10409, block 19647****

Recovery of Online Redo Log: Thread 1 Group 5 Seq 10409 Reading mem 0****

  Mem# 0: +DATA/lnred/onlinelog/group_5.303.743814897****

Completed redo application of 0.02MB****

Completed instance recovery at****

 Thread 1: logseq 10409, block 26971, scn 2612073761****

 62 data blocks read, 121 data blocks written, 3662 redo k-bytes read****

Thread 1 advanced to log sequence 10410 (thread recovery)****

Redo thread 1 internally disabled at seq 10410 (SMON)****

Thu Jul 07 02:34:41 2011****

Archived Log entry 3048 added for thread 1 sequence 10409 ID 0xa9ba4d33 dest 1:****

Thu Jul 07 02:34:42 2011****

ALTER SYSTEM SET service_names='lnwyd_service','lnred_service','backup' SCOPE=MEMORY SID='lnred2';****

Thu Jul 07 02:34:42 2011****

ARC3: Archiving disabled thread 1 sequence 10410****

Archived Log entry 3049 added for thread 1 sequence 10410 ID 0xa9ba4d33 dest 1:****

Thu Jul 07 02:34:53 2011****

minact-scn: Inst 2 is now the master inc#:36 mmon proc-id:19687 status:0x7**
**

minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.9bb067c4 gcalc-scn:0x0000.9bb06ebf****

Thu Jul 07 02:35:51 2011****

SKGXP: ospid 1908: network interface with IP address 169.254.220.65 no longer running (check cable)****

Thu Jul 07 02:36:52 2011****

SKGXP: ospid 1908: network interface with IP addressTs is now running****

Thu Jul 07 02:37:28 2011****

IPC Send timeout detected. Sender: ospid 1886 [oracle_at_Mars (PMON)]****

Receiver: inst 1 binc 36502 ospid 1949****

IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 2****

Thu Jul 07 02:41:17 2011****

db_recovery_file_dest_size of 512000 MB is 80.48% used. This is a****

user-specified limit on the amount of space that will be used by this****

database for recovery-related files, and does not reflect the amount of****

space available in the underlying filesystem or ASM diskgroup.****

Thu Jul 07 02:42:28 2011****

Reconfiguration started (old inc 36, new inc 38)****

List of instances:****

 1 2 (myinst: 2) ****

 Global Resource Directory frozen****

 Communication channels reestablished****

Thu Jul 07 02:42:28 2011****

  • domain 0 valid = 1 according to instance 1 ****

 Master broadcasted resource hash value bitmaps****

 Non-local Process blocks cleaned out****

Thu Jul 07 02:42:28 2011****

 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived****

Thu Jul 07 02:42:28 2011****

 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived****

 Set master node info ****

 Submitted all remote-enqueue requests****

 Dwn-cvts replayed, VALBLKs dubious****

 All grantable enqueues granted****

 Submitted all GCS remote-cache requests****

 Fix write in gcs resources****

Thu Jul 07 02:42:29 2011****

minact-scn: Master returning as live inst:1 has inc# mismatch instinc:0 cur:38 errcnt:0****

Reconfiguration complete****

Thu Jul 07 02:56:18 2011****

db_recovery_file_dest_size of 512000 MB is 80.48% used. This is a****

user-specified limit on the amount of space that will be used by this****

database for recovery-related files, and does not reflect the amount of****

space available in the underlying filesystem or ASM diskgroup.****

Thu Jul 07 04:09:37 2011****

Thread 2 advanced to log sequence 5462 (LGWR switch)****

  Current log# 8 seq# 5462 mem# 0:
+DATA/lnred/onlinelog/group_8.306.743814913****

Thu Jul 07 04:09:39 2011****

  • **
  • **

Db alert on Venus:****

  • **

ed Jul 06 22:09:06 2011****

Deleted Oracle managed file
+FRA/lnred/archivelog/2011_07_06/thread_1_seq_10394.415.755808529****

Deleted Oracle managed file
+FRA/lnred/archivelog/2011_07_06/thread_2_seq_5418.421.755808553****

Archived Log entry 3044 added for thread 1 sequence 10408 ID 0xa9ba4d33 dest 1:****

Thu Jul 07 02:42:16 2011****

Starting ORACLE instance (normal)****

LICENSE_MAX_SESSION = 0**** LICENSE_SESSIONS_WARNING = 0**** Private Interface 'aggr2:1' configured from GPnP for use as a private interconnect.****

  [name='aggr2:1', type=1, ip=169.254.16.251, mac=00-1b-21-49-f4-34, net= 169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62]****

Public Interface 'aggr1' configured from GPnP for use as a public interface.


  [name='aggr1', type=1, ip=10.122.10.150, mac=00-1b-21-49-f4-30, net= 10.122.10.0/23, mask=255.255.254.0, use=public/1]****

Picked latch-free SCN scheme 3****

Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST**** Autotune of undo retention is turned on. ****

LICENSE_MAX_USERS = 0**** SYS auditing is disabled****

Starting up:****

Oracle Database 11g Release 11.2.0.2.0 - 64bit Production****

With the Real Application Clusters option.****

Using parameter settings in server-side pfile /md1/app/oracle/product/11.2.0/db1/dbs/initlnred1.ora****

System parameters with non-default values:****

  processes                = 300****

  sessions                 = 480****

  timed_statistics         = TRUE****

  spfile                   = "+DATA/lnred/spfilelnred.ora"****

  nls_language             = "POLISH"****

  nls_territory            = "****POLAND****"****

  nls_length_semantics     = "CHAR"****

  resource_manager_plan = ""****

  sga_target = 8G****

  control_files = "+DATA/lnred/controlfile/current.256.741561271"


  db_block_size = 8192****

  compatible = "11.2.0.0.0"****

  cluster_database = TRUE****

  db_create_file_dest = "+DATA"****

  db_recovery_file_dest = "+FRA"****

  db_recovery_file_dest_size= 500G****

  thread                   = 1****

  undo_tablespace          = "UNDOTBS1"****

  instance_number          = 1****

  remote_login_passwordfile= "EXCLUSIVE"****

  db_domain                = ""****

  dispatchers              = "(PROTOCOL=TCP) (SERVICE=lnredXDB)"****

  remote_listener          = "lnscan.adln.local:1521"****

  audit_file_dest          = "/md1/app/oracle/admin/lnred/adump"****

  optimizer_features_enable= "11.1.0.7"****

  audit_trail = "DB"****

  db_name                  = "lnred"****

  open_cursors             = 300****

  pga_aggregate_target     = 8G****

  aq_tm_processes          = 1****

  diagnostic_dest          = "/md1/app/oracle"****

Cluster communication is configured to use the following interface(s) for this instance****

  169.254.16.251****

cluster interconnect IPC version:Oracle UDP/IP (generic)****

IPC Vendor 1 proto 2****

Thu Jul 07 02:42:25 2011****

PMON started with pid=2, OS id=1914 ****

  • **
  • **

OS messages on Venus:****

  • **

Jul 7 02:34:38 Venus unix: [ID 836849 kern.notice] ****

Jul 7 02:34:38 Venus ^Mpanic[cpu6]/thread=ffffffff9bec7e60: ****

Jul 7 02:34:38 Venus genunix: [ID 156897 kern.notice] forced crash dump initiated at user request****

Jul 7 02:34:38 Venus unix: [ID 100000 kern.notice] ****

Jul 7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017ceeb0 genunix:kadmin+517 ()****

Jul 7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017cef00 genunix:uadmin+c7 ()****

Jul 7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017cef10 unix:brand_sys_syscall+21d ()****

Jul 7 02:34:38 Venus unix: [ID 100000 kern.notice] ****

Jul 7 02:34:38 Venus genunix: [ID 672855 kern.notice] syncing file systems...****

Jul 7 02:34:39 Venus genunix: [ID 433738 kern.notice] [6]****

Jul 7 02:34:39 Venus genunix: [ID 733762 kern.notice] 220****

Jul 7 02:34:40 Venus genunix: [ID 433738 kern.notice] [6]****

Jul 7 02:34:40 Venus genunix: [ID 733762 kern.notice] 205****

Jul 7 02:34:42 Venus genunix: [ID 433738 kern.notice] [6]****

  • **

So... it isn't much information... but... on Venus grid log grid\log\venus\crsd\crsd.log I found:****

  • **

2011-07-07 02:02:10.052: [ CRSPE][47] {1:3804:25345} ****Processing** **PE
**** command id=25249. Description: [Stat Resource : 3703910]****

2011-07-07 02:02:10.054: [UiServer][48] {1:3804:25345} Done for ctx=32ca350*
***

*Here are same rubbishes like (maybe it has same problem with comunication): [NULL][NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL]* [ clsdmt][3]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=venusDBG_CRSD))****

2011-07-07 02:41:48.808: [ clsdmt][3]PID for the Process [1619], connkey 1*
***

  • **

But on Mars the grid log grid\log\mars\crsd\crsd.log :****

  • **

2011-07-07 02:19:42.393: [ AGFW][42] {0:8:6} Agfw Proxy Server received the message: RESOURCE_STATUS[Proxy] ID 20481:58****

2011-07-07 02:19:42.393: [ AGFW][42] {0:8:6} Agfw Proxy Server replying to the message: RESOURCE_STATUS[Proxy] ID 20481:58****

*2011-07-07 02:20:48.482: [ CRSCOMM][39][FFAIL] Ipc: Couldnt clscreceive message, no message: 11*

*2011-07-07 02:20:48.482: [ CRSCOMM][39] Ipc: Client disconnected.*

*2011-07-07 02:20:48.482: [ CRSCOMM][39][FFAIL] IpcL: Listener got clsc error 11 for memNum. 8*

*2011-07-07 02:20:48.482: [ CRSCOMM][39] IpcL: connection to member 8 has been removed*

*2011-07-07 02:20:48.483: [CLSFRAME][39] Removing IPC Member:{Relative|Node:0|Process:8|Type:3}*

*2011-07-07 02:20:48.483: [CLSFRAME][39] Disconnected from AGENT process: {Relative|Node:0|Process:8|Type:3}*

*2011-07-07 02:20:48.483: [ AGFW][42] {2:44971:2690} Agfw Proxy Server received process disconnected notification, count=1*

*2011-07-07 02:20:48.483: [ AGFW][42] {2:44971:2690} /md1/app/11.2.0/grid/bin/oraagent_oracle disconnected.*

*2011-07-07 02:20:48.483: [ AGFW][42] {2:44971:2690} Agent /md1/app/11.2.0/grid/bin/oraagent_oracle[18546] stopped!*

*2011-07-07 02:20:48.483: [ CRSCOMM][42] {2:44971:2690} IpcL: removeConnection: Member 8 does not exist.*

*2011-07-07 02:20:48.483: [ AGFW][42] {2:44971:2690} Created alert : (:CRSAGF00129:) : The agent: /md1/app/11.2.0/grid/bin/oraagent has been disabled because it exceeded the maximum failures (5) over a period of 600 seconds.*

  • **

*this time the Venus OS has been rebooted (I think via IPMI from Mars)*

  • **

2011-07-07 02:34:39.963: [ CRSSE][32] Forwarding Node Leave to PE for: venus****

2011-07-07 02:34:39.965: [ CRSCCL][34]Reconfig event received by clssgsgrpstat ****

  • **

2011-07-07 02:34:39.965: [ CRSCCL][34]cclGetMemberData called****

2011-07-07 02:34:39.979: [ CRSCCL][34]Member (2, 307107, mars:11.2.0.2.0) _at_ found.****

  • **

2011-07-07 02:34:39.999: [CLSFRAME][34] CCL MEMBER LEFT:1:1:CRSD:venus****

2011-07-07 02:34:39.999: [CLSFRAME][34] Disconnected from CRSD:venus process:****

  • **
  • **

And... what to do... ? ****

  • **

Have you same ideas ?****

  • **
  • **

Regards,****

----------------------------------------------------****

Robert Grzesiak****

*Senior Developer, Oracle DBA***

  • **

*LexisNexis Polska Sp. z o.o.*****

tel office: + 48 12 426 00 03
email: Robert.Grzesiak_at_LexisNexis.pl ****

www.lexisnexis.pl****

  • **

Biurowiec Galileo, ul. Armii Krajowej 18, 30-150 Kraków****

XII Wydział Gospodarczy Krajowego Rejestru Sądowego KRS 0000035401 ****

Właściciel: Reed Elsevier Overseas B.V. Kapitał zakładowy: 8 200 000 PLN, **
**

REGON: 010860143, NIP: 526-10-13-257****

----------------------------------------------------****

Save a tree! Print this message
only if it's absolutely necessary****

  • **
--
http://www.freelists.org/webpage/oracle-l
Received on Sun Jul 10 2011 - 02:19:01 CDT

Original text of this message