Re: DB Instance failure Issue

From: Pap <oracle.developer35_at_gmail.com>
Date: Wed, 14 Sep 2022 10:24:33 +0530
Message-ID: <CAEjw_ficp+zYoxPLzhzoKOAkNrj=ftR26QSXvekQR1HnF93scg_at_mail.gmail.com>



Your log says slow log writer and a disk offline issue. We have also encountered similar issues wrt log writer responses and it may be a bug. You may try to search the MOS.

On Wed, Sep 14, 2022 at 1:07 AM Lok P <loknath.73_at_gmail.com> wrote:

> Hello Listers,
>
> Its Oracle version- 19.15.0.0.0. Its 4 node RAC. Its Exadata X8M, image
> version 19.3.7. It was upgraded from 11.2 a few days back. All the Nodes
> crashed suddenly. And alert logs pointing to mainly Node-2 and a specific
> cellserver. Below are the snippets of the alert logs from DB and the
> cellserver.
>
> We are suspecting any bugs which we have hit. We are checking with
> support. But want to understand from experts here, if anything similar
> anybody has encountered?
>
> *Database log:- *
>
> 2022-09-12T15:01:53.278393-04:00
>
> requested by (instance=2, osid=97030 (DIA0)), summary=[process state dump
> request (kjznHA)]
>
> requested by (instance=2, osid=97030 (DIA0)), summary=[process state dump
> request (kjznHA)]
>
> 2022-09-12T15:02:54.901871-04:00
>
> NOTE: process _lgwr_DB12 (97118) initiating optional offline of disk
> 22.4042277007 (RECO_XGMB_CD_06_XXGMCL13) with mask 0x7e in group 2
> (RECO_XGMB) with client assisting
>
> 2022-09-12T15:02:54.946461-04:00
>
> NOTE: updating disk modes to 0x5 from 0x7 for disk 12
> (RECO_XGMB_CD_05_XXGMCL13) in group 2 (RECO_XGMB): lflags 0x0
>
> NOTE: updating disk modes to 0x5 from 0x7 for disk 13
> (RECO_XGMB_CD_02_XXGMCL13) in group 2 (RECO_XGMB): lflags 0x0
>
> ---------------
>
> 2022-09-12T15:02:54.961121-04:00
>
> Errors in file
> /u01/app/oracle/diag/rdbms/DB1_xXGMb/DB12/trace/DB12_lgwr_97118.trc:
>
> ORA-15080: synchronous I/O operation failed to write block 32665540 of
> disk 22 in disk group RECO_XGMB
>
> 2022-09-12T15:02:54.961169-04:00
>
> WARNING: failed to write mirror side 2 of virtual extent 3987 logical
> extent 1 of file 625 in group 2 on disk 22 allocation unit 1294
>
> NOTE: process _lgwr_DB12 (97118) initiating offline of disk 22.4042277007
> (RECO_XGMB_CD_06_XGMCL13) with mask 0x7e in group 2 (RECO_XGMB) with client
> assisting
>
> ---------------
>
> 2022-09-12T15:03:09.809095-04:00
>
> NOTE: ASMB process initiating disk discovery for grp 2
> (reqid:5125151858335298229)
>
> 2022-09-12T15:03:11.517788-04:00
>
> NOTE: Found o/XXXXXXXX/RECO_XGMB_CD_05_XGMcl13 for disk
> RECO_XGMB_CD_05_XGMCL13
>
> SUCCESS: disk RECO_XGMB_CD_05_XGMCL13 (12.4042276997) replaced in
> diskgroup RECO_XGMB path: o/XXXXXXXX/RECO_XGMB_CD_05_XGMcl13
>
> SUCCESS: completed reconfiguration of group 2 (RECO_XGMB)
>
> -------------
>
> 2022-09-12T15:05:13.914953-04:00
>
> LGWR (ospid: 97118) waits for event 'log file parallel write' for 138 secs.
>
> 2022-09-12T15:05:24.010945-04:00
>
> LGWR (ospid: 97118) waits for event 'log file parallel write' for 148 secs.
>
> Errors in file
> /u01/app/oracle/diag/rdbms/DB1_XGMb/DB12/trace/DB12_lmhb_97063.trc (incident=319107):
>
> ORA-29770: global enqueue process LGWR (OSID 97118) is hung for more than
> 140 seconds
>
> Incident details in:
> /u01/app/oracle/diag/rdbms/DB1_XGMb/DB12/incident/incdir_319107/DB12_lmhb_97063_i319107.trc
>
> 2022-09-12T15:05:24.744688-04:00
>
> NOTE: ASMB process initiating disk discovery for grp 2
> (reqid:5125151858335314113)
>
> 2022-09-12T15:05:26.701077-04:00
>
> LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_DB1_XGMB by memno 1
>
> 2022-09-12T15:05:27.038295-04:00
>
> Process termination requested for pid 97118 [source = rdbms], [info = 0]
> [request issued by pid: 97063, uid: 260]
>
> 2022-09-12T15:05:27.038338-04:00
>
> Process termination requested for pid 97041_97049 [source = rdbms], [info
> = 0] [request issued by pid: 97063, uid: 260]
>
> 2022-09-12T15:05:27.038441-04:00
>
> Process termination requested for pid 97039_97050 [source = rdbms], [info
> = 0] [request issued by pid: 97063, uid: 260]
>
> 2022-09-12T15:05:27.038471-04:00
>
> *****************
>
> 2022-09-12T15:05:27.054641-04:00
>
> Dumping diagnostic data in directory=[cdmp_20220912150527], requested by
> (instance=2, osid=97063 (LMHB)), summary=[abnormal instance termination].
>
> LMHB (ospid: 97063): terminating the instance due to ORA error 29770
>
> 2022-09-12T15:05:27.064406-04:00
>
> Cause - 'ERROR: Some process(s) is not making progress.
>
> LMHB (ospid: 97063) is terminating the instance.
>
> Please check LMHB trace file for more details.
>
> Please also check the CPU load, I/O load and other system properties for
> anomalous behavior
>
> ERROR: Some process(s'
>
> 2022-09-12T15:05:29.118991-04:00
>
>
> *Cell server log :- *
>
> sendMessage completed with error 2, clientHost: XGMdb06.XX.XXXXX.com[pid:148143]
> destinationID: 1243439106 openCnt 1 requestCnt 4
>
> Implicit fencing of this port will be performed
>
> slos passed to ossmisc_dump_slos_alert routine is not in failure status
>
> 2022-09-11T22:10:01.317939-04:00
>
> I/O Resource Manager: Setting intra-database plan for database DB1_XGMB:
> DEFAULT_MAINTENANCE_PLAN
>
> 2022-09-12T15:03:16.978494-04:00
>
> OSS_IOCTL_GRIDDISK_FENCE for griddisk: RECO_XGMB_CD_06_XGMcl13 giuid:
> 673900a4-8854-4ad8-8a5b-2015942d1bbc failed with error 388 (Fencing IOCTL
> timed out due to pending I/O's)
>
> 2022-09-12T15:04:21.539419-04:00
>
> OSS_IOCTL_GRIDDISK_FENCE for griddisk: RECO_XGMB_CD_06_XGMcl13 giuid:
> 673900a4-8854-4ad8-8a5b-2015942d1bbc failed with error 388 (Fencing IOCTL
> timed out due to
>
> -------------
>
> WARNING: Cellsrv UserThread: LWPID: 59568 userId: 108 kernelId: 72
> pthreadID: 13766447306496 is stuck for more than 10025222 micro sec.
>
> Please take a look at svtrc_57983_108.trc and System State dump for more
> info
>
> Errors in file
> /opt/oracle/cell/log/diag/asm/cell/x073cl13/trace/svtrc_57983_16.trc (incident=17):
>
> ORA-00600: internal error code, arguments:
> [Scheduler::One_Or_More_Cellsrv_Threads_Stuck], [1], [], [], [], [], [],
> [], [], [], [], []
>
> Incident details in: /opt/oracle/cell/log/diag/asm/cell/XGMcl13
> /incident/incdir_17/svtrc_57983_16_i17.trc
>
> 2022-09-12T15:48:06.190103-04:00
>
> [RS] Monitoring process /opt/oracle/cell/cellsrv/bin/cellrsomt (pid:
> 57980, srvc_pid: 57983) returned with error: 128
>
> CELLSRV error - ORA-600 internal error
>
> *logwriter trace:-*
>
> 2022-09-12 15:02:54.947 :C343DCE7:KFNC:kfnc.c_at_12127:kfncSlaveMsgFree():1373:13291: [kfncDiskOffline]
> 0x247eeee5d8 alloc=0x1 msgtyp=0x1
>
> 2022-09-12 15:02:54.947 :C343DCE8:KFNC:kfnc.c_at_4048:kfncDiskOffline():1373:13291:
> 12 disks were taken offline for gnum [2.1193292329]
>
> 2022-09-12 15:02:54.961 :C343DE61:KFD:kfd.c_at_12756:kfdOfflinePriv():1373:13291:
> kfdOffline: mask=0x7e check=0xc flg=0x2 caller=1 dskCnt=1 client=1
> clientAssisting=1 at kfio.c:8245
>
> 2022-09-12 15:02:54.961 :C343DE62:KFNC:kfnc.c_at_3929:kfncDiskOffline():1373:13291:
> gnum [2.1193292329], count 1 mask 126 ocheck 12
>
> 2022-09-12 15:02:54.961 :C343DE63:KFNC:kfn.c_at_19336:kfnSetupASMBIdx():1373:13291:
> Known group number 2, owner asmb index 0; inrenew = 0
>
> 2022-09-12 15:02:54.961 :C343DE64:KFNC:kfnc.c_at_7525:kfncGetCurrentPoolInt():1373:13291:
> asmb index = 0
>
> 2022-09-12 15:02:54.961 :C343DE65:KFNC:kfnc.c_at_12024:kfncSlaveMsgAllocInt():1373:13291:
> [kfncDiskOffline] usepga=0 workmsg=1 noslave=0 uga=0x7fabf6d2bc30 at
> kfnc.c:3952, asmb#=0
>
> Regards
>
> Lok
>

--
http://www.freelists.org/webpage/oracle-l
Received on Wed Sep 14 2022 - 06:54:33 CEST

Original text of this message