Problem with RAC on Solaris
Date: Sun, 10 Jul 2011 04:00:25 +0200
Message-ID: <908C0DD49CA1CD4795E2F2C3C26CB6555F89A17053_at_exchangesvr>
Welcom,
I'm the first time on this oracle-l freelists... and sorry for my poor English :)
I have Oracle 11.2.0.2 RAC system on the OS Solaria 10. I have same problems with stability of this system. After few days 1 node restarts the second one, and it works next few day and it repeats... I have 2 nodes, named Mars and Venus. For example today Mars node have restarted the second one (Venus), but I don't know why... In the db alert log is information that it was same problem with communication - timeout and it's all...
It happened on about 02:30...
Db alert log on Mars:
Thread 2 advanced to log sequence 5461 (LGWR switch)
Current log# 10 seq# 5461 mem# 0: +DATA/lnred/onlinelog/group_10.308.743814917
Thu Jul 07 01:02:23 2011
Archived Log entry 3047 added for thread 2 sequence 5460 ID 0xa9ba4d33 dest 1:
Thu Jul 07 02:31:28 2011
minact-scn master exiting with err:12751
Thu Jul 07 02:31:43 2011
IPC Send timeout detected. Sender: ospid 1940 [oracle_at_Mars (MMON)]
Receiver: inst 1 binc 36502 ospid 1949
IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 26
IPC Send timeout: Terminating pid 26 osid 1940
Thu Jul 07 02:31:46 2011
IPC Send timeout detected. Sender: ospid 18305 [oracle_at_Mars (M000)]
Receiver: inst 1 binc 36502 ospid 1949
IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 75
IPC Send timeout: Terminating pid 75 osid 18305
Thu Jul 07 02:32:07 2011
IPC Send timeout detected. Sender: ospid 1908 [oracle_at_Mars (LMD0)]
Receiver: inst 1 binc 36502 ospid 1949
Thu Jul 07 02:32:10 2011
Restarting dead background process MMON
Thu Jul 07 02:32:11 2011
MMON started with pid=26, OS id=19687
Thu Jul 07 02:32:15 2011
IPC Send timeout detected. Sender: ospid 1928 [oracle_at_Mars (LGWR)]
Receiver: inst 1 binc 36528 ospid 1951
IPC Send timeout to 1.1 inc 32 for msg type 73 from opid 20
Thu Jul 07 02:32:19 2011
Communications reconfiguration: instance_number 1
Thu Jul 07 02:32:24 2011
IPC Send timeout to 1.0 inc 32 for msg type 65521 from opid 12
Thu Jul 07 02:32:29 2011
IPC Send timeout detected. Sender: ospid 1928 [oracle_at_Mars (LGWR)]
Receiver: inst 1 binc 36528 ospid 1955
IPC Send timeout to 1.2 inc 32 for msg type 73 from opid 20
Thu Jul 07 02:33:18 2011
Evicting instance 1 from cluster
Waiting for instances to leave: 1
Thu Jul 07 02:33:38 2011
Remote instance kill is issued with system inc 34
Remote instance kill map (size 1) : 1
LMON received an instance eviction notification from instance 2
The instance eviction reason is 0x20000000
The instance eviction map is 1
Thu Jul 07 02:33:48 2011
Waiting for instances to leave: 1
Thu Jul 07 02:34:18 2011
Waiting for instances to leave: 1
Thu Jul 07 02:34:40 2011
Reconfiguration started (old inc 32, new inc 36)
List of instances:
2 (myinst: 2)
Global Resource Directory frozen
- dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Thu Jul 07 02:34:40 2011 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Thu Jul 07 02:34:40 2011 LMS 1: 1 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Thu Jul 07 02:34:40 2011 Instance recovery: looking for dead threads Beginning instance recovery of 1 threads Started redo scan Completed redo scan read 3662 KB redo, 135 data blocks need recovery Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete Started redo application at Thread 1: logseq 10409, block 19647 Recovery of Online Redo Log: Thread 1 Group 5 Seq 10409 Reading mem 0 Mem# 0: +DATA/lnred/onlinelog/group_5.303.743814897 Completed redo application of 0.02MB Completed instance recovery at Thread 1: logseq 10409, block 26971, scn 2612073761 62 data blocks read, 121 data blocks written, 3662 redo k-bytes read Thread 1 advanced to log sequence 10410 (thread recovery) Redo thread 1 internally disabled at seq 10410 (SMON) Thu Jul 07 02:34:41 2011 Archived Log entry 3048 added for thread 1 sequence 10409 ID 0xa9ba4d33 dest 1: Thu Jul 07 02:34:42 2011 ALTER SYSTEM SET service_names='lnwyd_service','lnred_service','backup' SCOPE=MEMORY SID='lnred2'; Thu Jul 07 02:34:42 2011 ARC3: Archiving disabled thread 1 sequence 10410 Archived Log entry 3049 added for thread 1 sequence 10410 ID 0xa9ba4d33 dest 1: Thu Jul 07 02:34:53 2011 minact-scn: Inst 2 is now the master inc#:36 mmon proc-id:19687 status:0x7 minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.9bb067c4 gcalc-scn:0x0000.9bb06ebf Thu Jul 07 02:35:51 2011 SKGXP: ospid 1908: network interface with IP address 169.254.220.65 no longer running (check cable) Thu Jul 07 02:36:52 2011 SKGXP: ospid 1908: network interface with IP addressTs is now running Thu Jul 07 02:37:28 2011 IPC Send timeout detected. Sender: ospid 1886 [oracle_at_Mars (PMON)] Receiver: inst 1 binc 36502 ospid 1949 IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 2 Thu Jul 07 02:41:17 2011 db_recovery_file_dest_size of 512000 MB is 80.48% used. This is a user-specified limit on the amount of space that will be used by this database for recovery-related files, and does not reflect the amount of space available in the underlying filesystem or ASM diskgroup. Thu Jul 07 02:42:28 2011 Reconfiguration started (old inc 36, new inc 38) List of instances: 1 2 (myinst: 2) Global Resource Directory frozen Communication channels reestablished Thu Jul 07 02:42:28 2011
- domain 0 valid = 1 according to instance 1 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Thu Jul 07 02:42:28 2011 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Thu Jul 07 02:42:28 2011 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Thu Jul 07 02:42:29 2011 minact-scn: Master returning as live inst:1 has inc# mismatch instinc:0 cur:38 errcnt:0 Reconfiguration complete Thu Jul 07 02:56:18 2011 db_recovery_file_dest_size of 512000 MB is 80.48% used. This is a user-specified limit on the amount of space that will be used by this database for recovery-related files, and does not reflect the amount of space available in the underlying filesystem or ASM diskgroup. Thu Jul 07 04:09:37 2011 Thread 2 advanced to log sequence 5462 (LGWR switch) Current log# 8 seq# 5462 mem# 0: +DATA/lnred/onlinelog/group_8.306.743814913 Thu Jul 07 04:09:39 2011
Db alert on Venus:
ed Jul 06 22:09:06 2011
Deleted Oracle managed file +FRA/lnred/archivelog/2011_07_06/thread_1_seq_10394.415.755808529
Deleted Oracle managed file +FRA/lnred/archivelog/2011_07_06/thread_2_seq_5418.421.755808553
Archived Log entry 3044 added for thread 1 sequence 10408 ID 0xa9ba4d33 dest 1:
Thu Jul 07 02:42:16 2011
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Private Interface 'aggr2:1' configured from GPnP for use as a private interconnect.
[name='aggr2:1', type=1, ip=169.254.16.251, mac=00-1b-21-49-f4-34, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62]
Public Interface 'aggr1' configured from GPnP for use as a public interface.
[name='aggr1', type=1, ip=10.122.10.150, mac=00-1b-21-49-f4-30, net=10.122.10.0/23, mask=255.255.254.0, use=public/1]
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Release 11.2.0.2.0 - 64bit Production
With the Real Application Clusters option.
Using parameter settings in server-side pfile /md1/app/oracle/product/11.2.0/db1/dbs/initlnred1.ora
System parameters with non-default values:
processes = 300 sessions = 480 timed_statistics = TRUE spfile = "+DATA/lnred/spfilelnred.ora" nls_language = "POLISH" nls_territory = "POLAND" nls_length_semantics = "CHAR" resource_manager_plan = "" sga_target = 8G control_files = "+DATA/lnred/controlfile/current.256.741561271" db_block_size = 8192 compatible = "11.2.0.0.0" cluster_database = TRUE db_create_file_dest = "+DATA"
db_recovery_file_dest = "+FRA"
db_recovery_file_dest_size= 500G
thread = 1 undo_tablespace = "UNDOTBS1" instance_number = 1 remote_login_passwordfile= "EXCLUSIVE" db_domain = "" dispatchers = "(PROTOCOL=TCP) (SERVICE=lnredXDB)" remote_listener = "lnscan.adln.local:1521" audit_file_dest = "/md1/app/oracle/admin/lnred/adump" optimizer_features_enable= "11.1.0.7" audit_trail = "DB" db_name = "lnred" open_cursors = 300 pga_aggregate_target = 8G aq_tm_processes = 1 diagnostic_dest = "/md1/app/oracle"Cluster communication is configured to use the following interface(s) for this instance 169.254.16.251
cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2
Thu Jul 07 02:42:25 2011
PMON started with pid=2, OS id=1914
OS messages on Venus:
Jul 7 02:34:38 Venus unix: [ID 836849 kern.notice] Jul 7 02:34:38 Venus ^Mpanic[cpu6]/thread=ffffffff9bec7e60: Jul 7 02:34:38 Venus genunix: [ID 156897 kern.notice] forced crash dump initiated at user request Jul 7 02:34:38 Venus unix: [ID 100000 kern.notice] Jul 7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017ceeb0 genunix:kadmin+517 () Jul 7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017cef00 genunix:uadmin+c7 () Jul 7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017cef10 unix:brand_sys_syscall+21d () Jul 7 02:34:38 Venus unix: [ID 100000 kern.notice] Jul 7 02:34:38 Venus genunix: [ID 672855 kern.notice] syncing file systems... Jul 7 02:34:39 Venus genunix: [ID 433738 kern.notice] [6] Jul 7 02:34:39 Venus genunix: [ID 733762 kern.notice] 220 Jul 7 02:34:40 Venus genunix: [ID 433738 kern.notice] [6] Jul 7 02:34:40 Venus genunix: [ID 733762 kern.notice] 205 Jul 7 02:34:42 Venus genunix: [ID 433738 kern.notice] [6]
So... it isn't much information... but... on Venus grid log grid\log\venus\crsd\crsd.log<file:///\\venus\root\md1\app\11.2.0\grid\log\venus\crsd\crsd.log> I found:
2011-07-07 02:02:10.052: [ CRSPE][47] {1:3804:25345} Processing PE command id=25249. Description: [Stat Resource : 3703910] 2011-07-07 02:02:10.054: [UiServer][48] {1:3804:25345} Done for ctx=32ca350 Here are same rubbishes like (maybe it has same problem with comunication): [NULL][NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [ clsdmt][3]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=venusDBG_CRSD)) 2011-07-07 02:41:48.808: [ clsdmt][3]PID for the Process [1619], connkey 1
But on Mars the grid log grid\log\mars\crsd\crsd.log<file:///\\venus\root\md1\app\11.2.0\grid\log\venus\crsd\crsd.log> :
2011-07-07 02:19:42.393: [ AGFW][42] {0:8:6} Agfw Proxy Server received the message: RESOURCE_STATUS[Proxy] ID 20481:58 2011-07-07 02:19:42.393: [ AGFW][42] {0:8:6} Agfw Proxy Server replying to the message: RESOURCE_STATUS[Proxy] ID 20481:58 2011-07-07 02:20:48.482: [ CRSCOMM][39][FFAIL] Ipc: Couldnt clscreceive message, no message: 11 2011-07-07 02:20:48.482: [ CRSCOMM][39] Ipc: Client disconnected. 2011-07-07 02:20:48.482: [ CRSCOMM][39][FFAIL] IpcL: Listener got clsc error 11 for memNum. 8 2011-07-07 02:20:48.482: [ CRSCOMM][39] IpcL: connection to member 8 has been removed 2011-07-07 02:20:48.483: [CLSFRAME][39] Removing IPC Member:{Relative|Node:0|Process:8|Type:3} 2011-07-07 02:20:48.483: [CLSFRAME][39] Disconnected from AGENT process: {Relative|Node:0|Process:8|Type:3} 2011-07-07 02:20:48.483: [ AGFW][42] {2:44971:2690} Agfw Proxy Server received process disconnected notification, count=1 2011-07-07 02:20:48.483: [ AGFW][42] {2:44971:2690} /md1/app/11.2.0/grid/bin/oraagent_oracle disconnected. 2011-07-07 02:20:48.483: [ AGFW][42] {2:44971:2690} Agent /md1/app/11.2.0/grid/bin/oraagent_oracle[18546] stopped! 2011-07-07 02:20:48.483: [ CRSCOMM][42] {2:44971:2690} IpcL: removeConnection: Member 8 does not exist. 2011-07-07 02:20:48.483: [ AGFW][42] {2:44971:2690} Created alert : (:CRSAGF00129:) : The agent: /md1/app/11.2.0/grid/bin/oraagent has been disabled because it exceeded the maximum failures (5) over a period of 600 seconds.
this time the Venus OS has been rebooted (I think via IPMI from Mars)
2011-07-07 02:34:39.963: [ CRSSE][32] Forwarding Node Leave to PE for: venus 2011-07-07 02:34:39.965: [ CRSCCL][34]Reconfig event received by clssgsgrpstat
2011-07-07 02:34:39.965: [ CRSCCL][34]cclGetMemberData called 2011-07-07 02:34:39.979: [ CRSCCL][34]Member (2, 307107, mars:11.2.0.2.0) _at_ found.
2011-07-07 02:34:39.999: [CLSFRAME][34] CCL MEMBER LEFT:1:1:CRSD:venus 2011-07-07 02:34:39.999: [CLSFRAME][34] Disconnected from CRSD:venus process:
And... what to do... ?
Have you same ideas ?
Regards,
Robert Grzesiak
Senior Developer, Oracle DBA
LexisNexis Polska Sp. z o.o.
tel office: + 48 12 426 00 03
email: Robert.Grzesiak_at_LexisNexis.pl<mailto:Robert.Grzesiak_at_LexisNexis.pl>
www.lexisnexis.pl<http://www.lexisnexis.pl/>
Biurowiec Galileo, ul. Armii Krajowej 18, 30-150 Kraków XII Wydział Gospodarczy Krajowego Rejestru Sądowego KRS 0000035401 Właściciel: Reed Elsevier Overseas B.V. Kapitał zakładowy: 8 200 000 PLN, REGON: 010860143, NIP: 526-10-13-257
Save a tree! Print this message
only if it's absolutely necessary
-- http://www.freelists.org/webpage/oracle-lReceived on Sat Jul 09 2011 - 21:00:25 CDT