DB Instance failure Issue
Date: Wed, 14 Sep 2022 01:07:23 +0530
Message-ID: <CAKna9VZfEecq9SRkDk6KyaDcva=jcuqbaktDbSbtf3ppx4Zfqw_at_mail.gmail.com>
Hello Listers,
Its Oracle version- 19.15.0.0.0. Its 4 node RAC. Its Exadata X8M, image version 19.3.7. It was upgraded from 11.2 a few days back. All the Nodes crashed suddenly. And alert logs pointing to mainly Node-2 and a specific cellserver. Below are the snippets of the alert logs from DB and the cellserver.
We are suspecting any bugs which we have hit. We are checking with support. But want to understand from experts here, if anything similar anybody has encountered?
*Database log:- *
2022-09-12T15:01:53.278393-04:00
requested by (instance=2, osid=97030 (DIA0)), summary=[process state dump request (kjznHA)]
requested by (instance=2, osid=97030 (DIA0)), summary=[process state dump request (kjznHA)]
2022-09-12T15:02:54.901871-04:00
NOTE: process _lgwr_DB12 (97118) initiating optional offline of disk
22.4042277007 (RECO_XGMB_CD_06_XXGMCL13) with mask 0x7e in group 2
(RECO_XGMB) with client assisting
2022-09-12T15:02:54.946461-04:00
NOTE: updating disk modes to 0x5 from 0x7 for disk 12
(RECO_XGMB_CD_05_XXGMCL13) in group 2 (RECO_XGMB): lflags 0x0
NOTE: updating disk modes to 0x5 from 0x7 for disk 13
(RECO_XGMB_CD_02_XXGMCL13) in group 2 (RECO_XGMB): lflags 0x0
2022-09-12T15:02:54.961121-04:00
Errors in file
/u01/app/oracle/diag/rdbms/DB1_xXGMb/DB12/trace/DB12_lgwr_97118.trc:
ORA-15080: synchronous I/O operation failed to write block 32665540 of disk 22 in disk group RECO_XGMB
2022-09-12T15:02:54.961169-04:00
WARNING: failed to write mirror side 2 of virtual extent 3987 logical extent 1 of file 625 in group 2 on disk 22 allocation unit 1294
NOTE: process _lgwr_DB12 (97118) initiating offline of disk 22.4042277007
(RECO_XGMB_CD_06_XGMCL13) with mask 0x7e in group 2 (RECO_XGMB) with client
assisting
2022-09-12T15:03:09.809095-04:00
NOTE: ASMB process initiating disk discovery for grp 2
(reqid:5125151858335298229)
2022-09-12T15:03:11.517788-04:00
NOTE: Found o/XXXXXXXX/RECO_XGMB_CD_05_XGMcl13 for disk RECO_XGMB_CD_05_XGMCL13 SUCCESS: disk RECO_XGMB_CD_05_XGMCL13 (12.4042276997) replaced in diskgroup RECO_XGMB path: o/XXXXXXXX/RECO_XGMB_CD_05_XGMcl13
SUCCESS: completed reconfiguration of group 2 (RECO_XGMB)
2022-09-12T15:05:13.914953-04:00
LGWR (ospid: 97118) waits for event 'log file parallel write' for 138 secs.
2022-09-12T15:05:24.010945-04:00
LGWR (ospid: 97118) waits for event 'log file parallel write' for 148 secs.
Errors in file
/u01/app/oracle/diag/rdbms/DB1_XGMb/DB12/trace/DB12_lmhb_97063.trc
(incident=319107):
ORA-29770: global enqueue process LGWR (OSID 97118) is hung for more than 140 seconds
Incident details in:
/u01/app/oracle/diag/rdbms/DB1_XGMb/DB12/incident/incdir_319107/DB12_lmhb_97063_i319107.trc
2022-09-12T15:05:24.744688-04:00
NOTE: ASMB process initiating disk discovery for grp 2
(reqid:5125151858335314113)
2022-09-12T15:05:26.701077-04:00
LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_DB1_XGMB by memno 1
2022-09-12T15:05:27.038295-04:00
Process termination requested for pid 97118 [source = rdbms], [info = 0] [request issued by pid: 97063, uid: 260]
2022-09-12T15:05:27.038338-04:00
Process termination requested for pid 97041_97049 [source = rdbms], [info = 0] [request issued by pid: 97063, uid: 260]
2022-09-12T15:05:27.038441-04:00
Process termination requested for pid 97039_97050 [source = rdbms], [info = 0] [request issued by pid: 97063, uid: 260]2022-09-12T15:05:27.038471-04:00
2022-09-12T15:05:27.054641-04:00
Dumping diagnostic data in directory=[cdmp_20220912150527], requested by
(instance=2, osid=97063 (LMHB)), summary=[abnormal instance termination].
LMHB (ospid: 97063): terminating the instance due to ORA error 29770
2022-09-12T15:05:27.064406-04:00
Cause - 'ERROR: Some process(s) is not making progress.
LMHB (ospid: 97063) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior
ERROR: Some process(s'
2022-09-12T15:05:29.118991-04:00
*Cell server log :- *
sendMessage completed with error 2, clientHost:
XGMdb06.XX.XXXXX.com[pid:148143]
destinationID: 1243439106 openCnt 1 requestCnt 4
Implicit fencing of this port will be performed
slos passed to ossmisc_dump_slos_alert routine is not in failure status
2022-09-11T22:10:01.317939-04:00
I/O Resource Manager: Setting intra-database plan for database DB1_XGMB: DEFAULT_MAINTENANCE_PLAN 2022-09-12T15:03:16.978494-04:00
OSS_IOCTL_GRIDDISK_FENCE for griddisk: RECO_XGMB_CD_06_XGMcl13 giuid: 673900a4-8854-4ad8-8a5b-2015942d1bbc failed with error 388 (Fencing IOCTL timed out due to pending I/O's)
2022-09-12T15:04:21.539419-04:00
OSS_IOCTL_GRIDDISK_FENCE for griddisk: RECO_XGMB_CD_06_XGMcl13 giuid: 673900a4-8854-4ad8-8a5b-2015942d1bbc failed with error 388 (Fencing IOCTL timed out due to
WARNING: Cellsrv UserThread: LWPID: 59568 userId: 108 kernelId: 72 pthreadID: 13766447306496 is stuck for more than 10025222 micro sec.
Please take a look at svtrc_57983_108.trc and System State dump for more info
Errors in file
/opt/oracle/cell/log/diag/asm/cell/x073cl13/trace/svtrc_57983_16.trc
(incident=17):
ORA-00600: internal error code, arguments: [Scheduler::One_Or_More_Cellsrv_Threads_Stuck], [1], [], [], [], [], [], [], [], [], [], []
Incident details in: /opt/oracle/cell/log/diag/asm/cell/XGMcl13 /incident/incdir_17/svtrc_57983_16_i17.trc
2022-09-12T15:48:06.190103-04:00
[RS] Monitoring process /opt/oracle/cell/cellsrv/bin/cellrsomt (pid: 57980, srvc_pid: 57983) returned with error: 128
CELLSRV error - ORA-600 internal error
*logwriter trace:-*
2022-09-12 15:02:54.947
:C343DCE7:KFNC:kfnc.c_at_12127:kfncSlaveMsgFree():1373:13291:
[kfncDiskOffline]
0x247eeee5d8 alloc=0x1 msgtyp=0x1
2022-09-12 15:02:54.947
:C343DCE8:KFNC:kfnc.c_at_4048:kfncDiskOffline():1373:13291:
12 disks were taken offline for gnum [2.1193292329]
2022-09-12 15:02:54.961 :C343DE61:KFD:kfd.c_at_12756:kfdOfflinePriv():1373:13291: kfdOffline: mask=0x7e check=0xc flg=0x2 caller=1 dskCnt=1 client=1 clientAssisting=1 at kfio.c:8245
2022-09-12 15:02:54.961
:C343DE62:KFNC:kfnc.c_at_3929:kfncDiskOffline():1373:13291:
gnum [2.1193292329], count 1 mask 126 ocheck 12
2022-09-12 15:02:54.961
:C343DE63:KFNC:kfn.c_at_19336:kfnSetupASMBIdx():1373:13291:
Known group number 2, owner asmb index 0; inrenew = 0
2022-09-12 15:02:54.961
:C343DE64:KFNC:kfnc.c_at_7525:kfncGetCurrentPoolInt():1373:13291:
asmb index = 0
2022-09-12 15:02:54.961
:C343DE65:KFNC:kfnc.c_at_12024:kfncSlaveMsgAllocInt():1373:13291:
[kfncDiskOffline] usepga=0 workmsg=1 noslave=0 uga=0x7fabf6d2bc30 at
kfnc.c:3952, asmb#=0
Regards
Lok
-- http://www.freelists.org/webpage/oracle-lReceived on Tue Sep 13 2022 - 21:37:23 CEST