Problem with RAC on Solaris

From: Grzesiak, Robert (LNG-POL) <"Grzesiak,>
Date: Sun, 10 Jul 2011 04:00:25 +0200
Message-ID: <908C0DD49CA1CD4795E2F2C3C26CB6555F89A17053_at_exchangesvr>



Welcom,

I'm the first time on this oracle-l freelists... and sorry for my poor English :)

I have Oracle 11.2.0.2 RAC system on the OS Solaria 10. I have same problems with stability of this system. After few days 1 node restarts the second one, and it works next few day and it repeats... I have 2 nodes, named Mars and Venus. For example today Mars node have restarted the second one (Venus), but I don't know why... In the db alert log is information that it was same problem with communication - timeout and it's all...

It happened on about 02:30...

Db alert log on Mars:

Thread 2 advanced to log sequence 5461 (LGWR switch)   Current log# 10 seq# 5461 mem# 0: +DATA/lnred/onlinelog/group_10.308.743814917 Thu Jul 07 01:02:23 2011
Archived Log entry 3047 added for thread 2 sequence 5460 ID 0xa9ba4d33 dest 1: Thu Jul 07 02:31:28 2011
minact-scn master exiting with err:12751 Thu Jul 07 02:31:43 2011
IPC Send timeout detected. Sender: ospid 1940 [oracle_at_Mars (MMON)] Receiver: inst 1 binc 36502 ospid 1949
IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 26 IPC Send timeout: Terminating pid 26 osid 1940 Thu Jul 07 02:31:46 2011
IPC Send timeout detected. Sender: ospid 18305 [oracle_at_Mars (M000)] Receiver: inst 1 binc 36502 ospid 1949
IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 75 IPC Send timeout: Terminating pid 75 osid 18305 Thu Jul 07 02:32:07 2011
IPC Send timeout detected. Sender: ospid 1908 [oracle_at_Mars (LMD0)] Receiver: inst 1 binc 36502 ospid 1949
Thu Jul 07 02:32:10 2011
Restarting dead background process MMON
Thu Jul 07 02:32:11 2011
MMON started with pid=26, OS id=19687
Thu Jul 07 02:32:15 2011
IPC Send timeout detected. Sender: ospid 1928 [oracle_at_Mars (LGWR)] Receiver: inst 1 binc 36528 ospid 1951
IPC Send timeout to 1.1 inc 32 for msg type 73 from opid 20 Thu Jul 07 02:32:19 2011
Communications reconfiguration: instance_number 1 Thu Jul 07 02:32:24 2011
IPC Send timeout to 1.0 inc 32 for msg type 65521 from opid 12 Thu Jul 07 02:32:29 2011
IPC Send timeout detected. Sender: ospid 1928 [oracle_at_Mars (LGWR)] Receiver: inst 1 binc 36528 ospid 1955
IPC Send timeout to 1.2 inc 32 for msg type 73 from opid 20 Thu Jul 07 02:33:18 2011
Evicting instance 1 from cluster
Waiting for instances to leave: 1
Thu Jul 07 02:33:38 2011
Remote instance kill is issued with system inc 34 Remote instance kill map (size 1) : 1
LMON received an instance eviction notification from instance 2 The instance eviction reason is 0x20000000 The instance eviction map is 1
Thu Jul 07 02:33:48 2011
Waiting for instances to leave: 1
Thu Jul 07 02:34:18 2011
Waiting for instances to leave: 1
Thu Jul 07 02:34:40 2011
Reconfiguration started (old inc 32, new inc 36) List of instances:
 2 (myinst: 2)
 Global Resource Directory frozen

  • dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Thu Jul 07 02:34:40 2011 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Thu Jul 07 02:34:40 2011 LMS 1: 1 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Thu Jul 07 02:34:40 2011 Instance recovery: looking for dead threads Beginning instance recovery of 1 threads Started redo scan Completed redo scan read 3662 KB redo, 135 data blocks need recovery Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete Started redo application at Thread 1: logseq 10409, block 19647 Recovery of Online Redo Log: Thread 1 Group 5 Seq 10409 Reading mem 0 Mem# 0: +DATA/lnred/onlinelog/group_5.303.743814897 Completed redo application of 0.02MB Completed instance recovery at Thread 1: logseq 10409, block 26971, scn 2612073761 62 data blocks read, 121 data blocks written, 3662 redo k-bytes read Thread 1 advanced to log sequence 10410 (thread recovery) Redo thread 1 internally disabled at seq 10410 (SMON) Thu Jul 07 02:34:41 2011 Archived Log entry 3048 added for thread 1 sequence 10409 ID 0xa9ba4d33 dest 1: Thu Jul 07 02:34:42 2011 ALTER SYSTEM SET service_names='lnwyd_service','lnred_service','backup' SCOPE=MEMORY SID='lnred2'; Thu Jul 07 02:34:42 2011 ARC3: Archiving disabled thread 1 sequence 10410 Archived Log entry 3049 added for thread 1 sequence 10410 ID 0xa9ba4d33 dest 1: Thu Jul 07 02:34:53 2011 minact-scn: Inst 2 is now the master inc#:36 mmon proc-id:19687 status:0x7 minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.9bb067c4 gcalc-scn:0x0000.9bb06ebf Thu Jul 07 02:35:51 2011 SKGXP: ospid 1908: network interface with IP address 169.254.220.65 no longer running (check cable) Thu Jul 07 02:36:52 2011 SKGXP: ospid 1908: network interface with IP addressTs is now running Thu Jul 07 02:37:28 2011 IPC Send timeout detected. Sender: ospid 1886 [oracle_at_Mars (PMON)] Receiver: inst 1 binc 36502 ospid 1949 IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 2 Thu Jul 07 02:41:17 2011 db_recovery_file_dest_size of 512000 MB is 80.48% used. This is a user-specified limit on the amount of space that will be used by this database for recovery-related files, and does not reflect the amount of space available in the underlying filesystem or ASM diskgroup. Thu Jul 07 02:42:28 2011 Reconfiguration started (old inc 36, new inc 38) List of instances: 1 2 (myinst: 2) Global Resource Directory frozen Communication channels reestablished Thu Jul 07 02:42:28 2011
  • domain 0 valid = 1 according to instance 1 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Thu Jul 07 02:42:28 2011 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Thu Jul 07 02:42:28 2011 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Thu Jul 07 02:42:29 2011 minact-scn: Master returning as live inst:1 has inc# mismatch instinc:0 cur:38 errcnt:0 Reconfiguration complete Thu Jul 07 02:56:18 2011 db_recovery_file_dest_size of 512000 MB is 80.48% used. This is a user-specified limit on the amount of space that will be used by this database for recovery-related files, and does not reflect the amount of space available in the underlying filesystem or ASM diskgroup. Thu Jul 07 04:09:37 2011 Thread 2 advanced to log sequence 5462 (LGWR switch) Current log# 8 seq# 5462 mem# 0: +DATA/lnred/onlinelog/group_8.306.743814913 Thu Jul 07 04:09:39 2011

Db alert on Venus:

ed Jul 06 22:09:06 2011
Deleted Oracle managed file +FRA/lnred/archivelog/2011_07_06/thread_1_seq_10394.415.755808529 Deleted Oracle managed file +FRA/lnred/archivelog/2011_07_06/thread_2_seq_5418.421.755808553 Archived Log entry 3044 added for thread 1 sequence 10408 ID 0xa9ba4d33 dest 1: Thu Jul 07 02:42:16 2011
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Private Interface 'aggr2:1' configured from GPnP for use as a private interconnect.   [name='aggr2:1', type=1, ip=169.254.16.251, mac=00-1b-21-49-f4-34, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62] Public Interface 'aggr1' configured from GPnP for use as a public interface.   [name='aggr1', type=1, ip=10.122.10.150, mac=00-1b-21-49-f4-30, net=10.122.10.0/23, mask=255.255.254.0, use=public/1] Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST Autotune of undo retention is turned on. LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Release 11.2.0.2.0 - 64bit Production With the Real Application Clusters option. Using parameter settings in server-side pfile /md1/app/oracle/product/11.2.0/db1/dbs/initlnred1.ora System parameters with non-default values:

  processes                = 300
  sessions                 = 480
  timed_statistics         = TRUE
  spfile                   = "+DATA/lnred/spfilelnred.ora"
  nls_language             = "POLISH"
  nls_territory            = "POLAND"
  nls_length_semantics     = "CHAR"
  resource_manager_plan    = ""
  sga_target               = 8G
  control_files            = "+DATA/lnred/controlfile/current.256.741561271"
  db_block_size            = 8192
  compatible               = "11.2.0.0.0"
  cluster_database         = TRUE
  db_create_file_dest      = "+DATA"

  db_recovery_file_dest = "+FRA"
  db_recovery_file_dest_size= 500G
  thread                   = 1
  undo_tablespace          = "UNDOTBS1"
  instance_number          = 1
  remote_login_passwordfile= "EXCLUSIVE"
  db_domain                = ""
  dispatchers              = "(PROTOCOL=TCP) (SERVICE=lnredXDB)"
  remote_listener          = "lnscan.adln.local:1521"
  audit_file_dest          = "/md1/app/oracle/admin/lnred/adump"
  optimizer_features_enable= "11.1.0.7"
  audit_trail              = "DB"
  db_name                  = "lnred"
  open_cursors             = 300
  pga_aggregate_target     = 8G
  aq_tm_processes          = 1
  diagnostic_dest          = "/md1/app/oracle"
Cluster communication is configured to use the following interface(s) for this instance   169.254.16.251
cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2
Thu Jul 07 02:42:25 2011
PMON started with pid=2, OS id=1914

OS messages on Venus:

Jul  7 02:34:38 Venus unix: [ID 836849 kern.notice]
Jul  7 02:34:38 Venus ^Mpanic[cpu6]/thread=ffffffff9bec7e60:
Jul  7 02:34:38 Venus genunix: [ID 156897 kern.notice] forced crash dump initiated at user request
Jul  7 02:34:38 Venus unix: [ID 100000 kern.notice]
Jul  7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017ceeb0 genunix:kadmin+517 ()
Jul  7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017cef00 genunix:uadmin+c7 ()
Jul  7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017cef10 unix:brand_sys_syscall+21d ()
Jul  7 02:34:38 Venus unix: [ID 100000 kern.notice]
Jul  7 02:34:38 Venus genunix: [ID 672855 kern.notice] syncing file systems...
Jul  7 02:34:39 Venus genunix: [ID 433738 kern.notice]  [6]
Jul  7 02:34:39 Venus genunix: [ID 733762 kern.notice]  220
Jul  7 02:34:40 Venus genunix: [ID 433738 kern.notice]  [6]
Jul  7 02:34:40 Venus genunix: [ID 733762 kern.notice]  205
Jul  7 02:34:42 Venus genunix: [ID 433738 kern.notice]  [6]

So... it isn't much information... but... on Venus grid log grid\log\venus\crsd\crsd.log<file:///\\venus\root\md1\app\11.2.0\grid\log\venus\crsd\crsd.log> I found:

2011-07-07 02:02:10.052: [ CRSPE][47] {1:3804:25345} Processing PE command id=25249. Description: [Stat Resource : 3703910] 2011-07-07 02:02:10.054: [UiServer][48] {1:3804:25345} Done for ctx=32ca350 Here are same rubbishes like (maybe it has same problem with comunication): [NULL][NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [ clsdmt][3]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=venusDBG_CRSD)) 2011-07-07 02:41:48.808: [ clsdmt][3]PID for the Process [1619], connkey 1

But on Mars the grid log grid\log\mars\crsd\crsd.log<file:///\\venus\root\md1\app\11.2.0\grid\log\venus\crsd\crsd.log> :

2011-07-07 02:19:42.393: [    AGFW][42] {0:8:6} Agfw Proxy Server received the message: RESOURCE_STATUS[Proxy] ID 20481:58
2011-07-07 02:19:42.393: [    AGFW][42] {0:8:6} Agfw Proxy Server replying to the message: RESOURCE_STATUS[Proxy] ID 20481:58
2011-07-07 02:20:48.482: [ CRSCOMM][39][FFAIL] Ipc: Couldnt clscreceive message, no message: 11
2011-07-07 02:20:48.482: [ CRSCOMM][39] Ipc: Client disconnected.
2011-07-07 02:20:48.482: [ CRSCOMM][39][FFAIL] IpcL: Listener got clsc error 11 for memNum. 8
2011-07-07 02:20:48.482: [ CRSCOMM][39] IpcL: connection to member 8 has been removed
2011-07-07 02:20:48.483: [CLSFRAME][39] Removing IPC Member:{Relative|Node:0|Process:8|Type:3}
2011-07-07 02:20:48.483: [CLSFRAME][39] Disconnected from AGENT process: {Relative|Node:0|Process:8|Type:3}
2011-07-07 02:20:48.483: [    AGFW][42] {2:44971:2690} Agfw Proxy Server received process disconnected notification, count=1
2011-07-07 02:20:48.483: [    AGFW][42] {2:44971:2690} /md1/app/11.2.0/grid/bin/oraagent_oracle disconnected.
2011-07-07 02:20:48.483: [    AGFW][42] {2:44971:2690} Agent /md1/app/11.2.0/grid/bin/oraagent_oracle[18546] stopped!
2011-07-07 02:20:48.483: [ CRSCOMM][42] {2:44971:2690} IpcL: removeConnection: Member 8 does not exist.
2011-07-07 02:20:48.483: [    AGFW][42] {2:44971:2690} Created alert : (:CRSAGF00129:) :  The agent: /md1/app/11.2.0/grid/bin/oraagent has been disabled because it exceeded the maximum failures (5) over a period of 600 seconds.

this time the Venus OS has been rebooted (I think via IPMI from Mars)

2011-07-07 02:34:39.963: [ CRSSE][32] Forwarding Node Leave to PE for: venus 2011-07-07 02:34:39.965: [ CRSCCL][34]Reconfig event received by clssgsgrpstat

2011-07-07 02:34:39.965: [ CRSCCL][34]cclGetMemberData called 2011-07-07 02:34:39.979: [ CRSCCL][34]Member (2, 307107, mars:11.2.0.2.0) _at_ found.

2011-07-07 02:34:39.999: [CLSFRAME][34] CCL MEMBER LEFT:1:1:CRSD:venus 2011-07-07 02:34:39.999: [CLSFRAME][34] Disconnected from CRSD:venus process:

And... what to do... ?

Have you same ideas ?

Regards,



Robert Grzesiak

Senior Developer, Oracle DBA

LexisNexis Polska Sp. z o.o.
tel office: + 48 12 426 00 03
email: Robert.Grzesiak_at_LexisNexis.pl<mailto:Robert.Grzesiak_at_LexisNexis.pl> www.lexisnexis.pl<http://www.lexisnexis.pl/>

Biurowiec Galileo, ul. Armii Krajowej 18, 30-150 Kraków XII Wydział Gospodarczy Krajowego Rejestru Sądowego KRS 0000035401 Właściciel: Reed Elsevier Overseas B.V. Kapitał zakładowy: 8 200 000 PLN, REGON: 010860143, NIP: 526-10-13-257



Save a tree! Print this message
only if it's absolutely necessary
--
http://www.freelists.org/webpage/oracle-l
Received on Sat Jul 09 2011 - 21:00:25 CDT

Original text of this message