DB Instance failure Issue

From: Lok P <loknath.73_at_gmail.com>
Date: Wed, 14 Sep 2022 01:07:23 +0530
Message-ID: <CAKna9VZfEecq9SRkDk6KyaDcva=jcuqbaktDbSbtf3ppx4Zfqw_at_mail.gmail.com>

Hello Listers,

Its Oracle version- 19.15.0.0.0. Its 4 node RAC. Its Exadata X8M, image version 19.3.7. It was upgraded from 11.2 a few days back. All the Nodes crashed suddenly. And alert logs pointing to mainly Node-2 and a specific cellserver. Below are the snippets of the alert logs from DB and the cellserver.

We are suspecting any bugs which we have hit. We are checking with support. But want to understand from experts here, if anything similar anybody has encountered?

*Database log:- *

2022-09-12T15:01:53.278393-04:00

requested by (instance=2, osid=97030 (DIA0)), summary=[process state dump request (kjznHA)]

2022-09-12T15:02:54.901871-04:00

NOTE: process _lgwr_DB12 (97118) initiating optional offline of disk 22.4042277007 (RECO_XGMB_CD_06_XXGMCL13) with mask 0x7e in group 2
(RECO_XGMB) with client assisting

2022-09-12T15:02:54.946461-04:00

NOTE: updating disk modes to 0x5 from 0x7 for disk 12
(RECO_XGMB_CD_05_XXGMCL13) in group 2 (RECO_XGMB): lflags 0x0

NOTE: updating disk modes to 0x5 from 0x7 for disk 13
(RECO_XGMB_CD_02_XXGMCL13) in group 2 (RECO_XGMB): lflags 0x0

2022-09-12T15:02:54.961121-04:00

Errors in file
/u01/app/oracle/diag/rdbms/DB1_xXGMb/DB12/trace/DB12_lgwr_97118.trc:

ORA-15080: synchronous I/O operation failed to write block 32665540 of disk 22 in disk group RECO_XGMB

2022-09-12T15:02:54.961169-04:00

WARNING: failed to write mirror side 2 of virtual extent 3987 logical extent 1 of file 625 in group 2 on disk 22 allocation unit 1294

NOTE: process _lgwr_DB12 (97118) initiating offline of disk 22.4042277007
(RECO_XGMB_CD_06_XGMCL13) with mask 0x7e in group 2 (RECO_XGMB) with client
assisting

2022-09-12T15:03:09.809095-04:00

NOTE: ASMB process initiating disk discovery for grp 2
(reqid:5125151858335298229)

2022-09-12T15:03:11.517788-04:00

NOTE: Found o/XXXXXXXX/RECO_XGMB_CD_05_XGMcl13 for disk RECO_XGMB_CD_05_XGMCL13 SUCCESS: disk RECO_XGMB_CD_05_XGMCL13 (12.4042276997) replaced in diskgroup RECO_XGMB path: o/XXXXXXXX/RECO_XGMB_CD_05_XGMcl13

SUCCESS: completed reconfiguration of group 2 (RECO_XGMB)

2022-09-12T15:05:13.914953-04:00

LGWR (ospid: 97118) waits for event 'log file parallel write' for 138 secs.

2022-09-12T15:05:24.010945-04:00

LGWR (ospid: 97118) waits for event 'log file parallel write' for 148 secs.

Errors in file
/u01/app/oracle/diag/rdbms/DB1_XGMb/DB12/trace/DB12_lmhb_97063.trc
(incident=319107):

ORA-29770: global enqueue process LGWR (OSID 97118) is hung for more than 140 seconds

Incident details in:
/u01/app/oracle/diag/rdbms/DB1_XGMb/DB12/incident/incdir_319107/DB12_lmhb_97063_i319107.trc

2022-09-12T15:05:24.744688-04:00

NOTE: ASMB process initiating disk discovery for grp 2
(reqid:5125151858335314113)

2022-09-12T15:05:26.701077-04:00

LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_DB1_XGMB by memno 1

2022-09-12T15:05:27.038295-04:00

Process termination requested for pid 97118 [source = rdbms], [info = 0] [request issued by pid: 97063, uid: 260]

2022-09-12T15:05:27.038338-04:00

Process termination requested for pid 97041_97049 [source = rdbms], [info = 0] [request issued by pid: 97063, uid: 260]

2022-09-12T15:05:27.038441-04:00

Process termination requested for pid 97039_97050 [source = rdbms], [info = 0] [request issued by pid: 97063, uid: 260]

2022-09-12T15:05:27.038471-04:00

2022-09-12T15:05:27.054641-04:00

Dumping diagnostic data in directory=[cdmp_20220912150527], requested by
(instance=2, osid=97063 (LMHB)), summary=[abnormal instance termination].

LMHB (ospid: 97063): terminating the instance due to ORA error 29770

2022-09-12T15:05:27.064406-04:00

Cause - 'ERROR: Some process(s) is not making progress.

LMHB (ospid: 97063) is terminating the instance.

Please check LMHB trace file for more details.

Please also check the CPU load, I/O load and other system properties for anomalous behavior

ERROR: Some process(s'

2022-09-12T15:05:29.118991-04:00

*Cell server log :- *

sendMessage completed with error 2, clientHost: XGMdb06.XX.XXXXX.com[pid:148143]
destinationID: 1243439106 openCnt 1 requestCnt 4

Implicit fencing of this port will be performed

slos passed to ossmisc_dump_slos_alert routine is not in failure status

2022-09-11T22:10:01.317939-04:00

I/O Resource Manager: Setting intra-database plan for database DB1_XGMB: DEFAULT_MAINTENANCE_PLAN 2022-09-12T15:03:16.978494-04:00

OSS_IOCTL_GRIDDISK_FENCE for griddisk: RECO_XGMB_CD_06_XGMcl13 giuid: 673900a4-8854-4ad8-8a5b-2015942d1bbc failed with error 388 (Fencing IOCTL timed out due to pending I/O's)

2022-09-12T15:04:21.539419-04:00

OSS_IOCTL_GRIDDISK_FENCE for griddisk: RECO_XGMB_CD_06_XGMcl13 giuid: 673900a4-8854-4ad8-8a5b-2015942d1bbc failed with error 388 (Fencing IOCTL timed out due to

WARNING: Cellsrv UserThread: LWPID: 59568 userId: 108 kernelId: 72 pthreadID: 13766447306496 is stuck for more than 10025222 micro sec.

Please take a look at svtrc_57983_108.trc and System State dump for more info

Errors in file
/opt/oracle/cell/log/diag/asm/cell/x073cl13/trace/svtrc_57983_16.trc
(incident=17):

ORA-00600: internal error code, arguments: [Scheduler::One_Or_More_Cellsrv_Threads_Stuck], [1], [], [], [], [], [], [], [], [], [], []

Incident details in: /opt/oracle/cell/log/diag/asm/cell/XGMcl13 /incident/incdir_17/svtrc_57983_16_i17.trc

2022-09-12T15:48:06.190103-04:00

[RS] Monitoring process /opt/oracle/cell/cellsrv/bin/cellrsomt (pid: 57980, srvc_pid: 57983) returned with error: 128

CELLSRV error - ORA-600 internal error

*logwriter trace:-*

2022-09-12 15:02:54.947
:C343DCE7:KFNC:kfnc.c_at_12127:kfncSlaveMsgFree():1373:13291: [kfncDiskOffline]
0x247eeee5d8 alloc=0x1 msgtyp=0x1

2022-09-12 15:02:54.947
:C343DCE8:KFNC:kfnc.c_at_4048:kfncDiskOffline():1373:13291: 12 disks were taken offline for gnum [2.1193292329]

2022-09-12 15:02:54.961 :C343DE61:KFD:kfd.c_at_12756:kfdOfflinePriv():1373:13291: kfdOffline: mask=0x7e check=0xc flg=0x2 caller=1 dskCnt=1 client=1 clientAssisting=1 at kfio.c:8245

2022-09-12 15:02:54.961
:C343DE62:KFNC:kfnc.c_at_3929:kfncDiskOffline():1373:13291: gnum [2.1193292329], count 1 mask 126 ocheck 12

2022-09-12 15:02:54.961
:C343DE63:KFNC:kfn.c_at_19336:kfnSetupASMBIdx():1373:13291: Known group number 2, owner asmb index 0; inrenew = 0

2022-09-12 15:02:54.961
:C343DE64:KFNC:kfnc.c_at_7525:kfncGetCurrentPoolInt():1373:13291: asmb index = 0

2022-09-12 15:02:54.961
:C343DE65:KFNC:kfnc.c_at_12024:kfncSlaveMsgAllocInt():1373:13291: [kfncDiskOffline] usepga=0 workmsg=1 noslave=0 uga=0x7fabf6d2bc30 at kfnc.c:3952, asmb#=0

Regards

Lok

--
http://www.freelists.org/webpage/oracle-l

Received on Tue Sep 13 2022 - 21:37:23 CEST

This message: [ Message body ]
Next message: Mrunal Mody: "Fwd: Oracle DBA with Analyst Experience - Onsite in Greenville, SC"
Previous message: Krishnaprasad Yadav: "Re: LIKE OPERATOR and Index usage"
Next in thread: Pap: "Re: DB Instance failure Issue"
Reply: Pap: "Re: DB Instance failure Issue"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message