Re: DB Instance failure Issue
Date: Wed, 14 Sep 2022 11:55:52 +0530
Message-ID: <CAEjw_fj3mutwuswh8f1qcJBve1WujLE1-2EBZ+tR_XsMKUS4iQ_at_mail.gmail.com>
See this one..
On Wed, 14 Sep 2022, 10:24 am Pap, <oracle.developer35_at_gmail.com> wrote:
> Your log says slow log writer and a disk offline issue. We have also
Exadata: Database hang with high log file parallel write wait and the cells
report ORA-00600: [Scheduler::One_Or_More_Cellsrv_Threads_Stuck] after
upgrading to 19.2.x (Doc ID 2604805.1)
> encountered similar issues wrt log writer responses and it may be a bug.
> You may try to search the MOS.
>
> On Wed, Sep 14, 2022 at 1:07 AM Lok P <loknath.73_at_gmail.com> wrote:
>
>> Hello Listers,
>>
>> Its Oracle version- 19.15.0.0.0. Its 4 node RAC. Its Exadata X8M, image
>> version 19.3.7. It was upgraded from 11.2 a few days back. All the Nodes
>> crashed suddenly. And alert logs pointing to mainly Node-2 and a specific
>> cellserver. Below are the snippets of the alert logs from DB and the
>> cellserver.
>>
>> We are suspecting any bugs which we have hit. We are checking with
>> support. But want to understand from experts here, if anything similar
>> anybody has encountered?
>>
>> *Database log:- *
>>
>> 2022-09-12T15:01:53.278393-04:00
>>
>> requested by (instance=2, osid=97030 (DIA0)), summary=[process state
>> dump request (kjznHA)]
>>
>> requested by (instance=2, osid=97030 (DIA0)), summary=[process state
>> dump request (kjznHA)]
>>
>> 2022-09-12T15:02:54.901871-04:00
>>
>> NOTE: process _lgwr_DB12 (97118) initiating optional offline of disk
>> 22.4042277007 (RECO_XGMB_CD_06_XXGMCL13) with mask 0x7e in group 2
>> (RECO_XGMB) with client assisting
>>
>> 2022-09-12T15:02:54.946461-04:00
>>
>> NOTE: updating disk modes to 0x5 from 0x7 for disk 12
>> (RECO_XGMB_CD_05_XXGMCL13) in group 2 (RECO_XGMB): lflags 0x0
>>
>> NOTE: updating disk modes to 0x5 from 0x7 for disk 13
>> (RECO_XGMB_CD_02_XXGMCL13) in group 2 (RECO_XGMB): lflags 0x0
>>
>> ---------------
>>
>> 2022-09-12T15:02:54.961121-04:00
>>
>> Errors in file
>> /u01/app/oracle/diag/rdbms/DB1_xXGMb/DB12/trace/DB12_lgwr_97118.trc:
>>
>> ORA-15080: synchronous I/O operation failed to write block 32665540 of
>> disk 22 in disk group RECO_XGMB
>>
>> 2022-09-12T15:02:54.961169-04:00
>>
>> WARNING: failed to write mirror side 2 of virtual extent 3987 logical
>> extent 1 of file 625 in group 2 on disk 22 allocation unit 1294
>>
>> NOTE: process _lgwr_DB12 (97118) initiating offline of disk 22.4042277007
>> (RECO_XGMB_CD_06_XGMCL13) with mask 0x7e in group 2 (RECO_XGMB) with client
>> assisting
>>
>> ---------------
>>
>> 2022-09-12T15:03:09.809095-04:00
>>
>> NOTE: ASMB process initiating disk discovery for grp 2
>> (reqid:5125151858335298229)
>>
>> 2022-09-12T15:03:11.517788-04:00
>>
>> NOTE: Found o/XXXXXXXX/RECO_XGMB_CD_05_XGMcl13 for disk
>> RECO_XGMB_CD_05_XGMCL13
>>
>> SUCCESS: disk RECO_XGMB_CD_05_XGMCL13 (12.4042276997) replaced in
>> diskgroup RECO_XGMB path: o/XXXXXXXX/RECO_XGMB_CD_05_XGMcl13
>>
>> SUCCESS: completed reconfiguration of group 2 (RECO_XGMB)
>>
>> -------------
>>
>> 2022-09-12T15:05:13.914953-04:00
>>
>> LGWR (ospid: 97118) waits for event 'log file parallel write' for 138
>> secs.
>>
>> 2022-09-12T15:05:24.010945-04:00
>>
>> LGWR (ospid: 97118) waits for event 'log file parallel write' for 148
>> secs.
>>
>> Errors in file
>> /u01/app/oracle/diag/rdbms/DB1_XGMb/DB12/trace/DB12_lmhb_97063.trc (incident=319107):
>>
>> ORA-29770: global enqueue process LGWR (OSID 97118) is hung for more than
>> 140 seconds
>>
>> Incident details in:
>> /u01/app/oracle/diag/rdbms/DB1_XGMb/DB12/incident/incdir_319107/DB12_lmhb_97063_i319107.trc
>>
>> 2022-09-12T15:05:24.744688-04:00
>>
>> NOTE: ASMB process initiating disk discovery for grp 2
>> (reqid:5125151858335314113)
>>
>> 2022-09-12T15:05:26.701077-04:00
>>
>> LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_DB1_XGMB by memno 1
>>
>> 2022-09-12T15:05:27.038295-04:00
>>
>> Process termination requested for pid 97118 [source = rdbms], [info = 0]
>> [request issued by pid: 97063, uid: 260]
>>
>> 2022-09-12T15:05:27.038338-04:00
>>
>> Process termination requested for pid 97041_97049 [source = rdbms], [info
>> = 0] [request issued by pid: 97063, uid: 260]
>>
>> 2022-09-12T15:05:27.038441-04:00
>>
>> Process termination requested for pid 97039_97050 [source = rdbms], [info
>> = 0] [request issued by pid: 97063, uid: 260]
>>
>> 2022-09-12T15:05:27.038471-04:00
>>
>> *****************
>>
>> 2022-09-12T15:05:27.054641-04:00
>>
>> Dumping diagnostic data in directory=[cdmp_20220912150527], requested by
>> (instance=2, osid=97063 (LMHB)), summary=[abnormal instance termination].
>>
>> LMHB (ospid: 97063): terminating the instance due to ORA error 29770
>>
>> 2022-09-12T15:05:27.064406-04:00
>>
>> Cause - 'ERROR: Some process(s) is not making progress.
>>
>> LMHB (ospid: 97063) is terminating the instance.
>>
>> Please check LMHB trace file for more details.
>>
>> Please also check the CPU load, I/O load and other system properties for
>> anomalous behavior
>>
>> ERROR: Some process(s'
>>
>> 2022-09-12T15:05:29.118991-04:00
>>
>>
>> *Cell server log :- *
>>
>> sendMessage completed with error 2, clientHost: XGMdb06.XX.XXXXX.com[pid:148143]
>> destinationID: 1243439106 openCnt 1 requestCnt 4
>>
>> Implicit fencing of this port will be performed
>>
>> slos passed to ossmisc_dump_slos_alert routine is not in failure status
>>
>> 2022-09-11T22:10:01.317939-04:00
>>
>> I/O Resource Manager: Setting intra-database plan for database DB1_XGMB:
>> DEFAULT_MAINTENANCE_PLAN
>>
>> 2022-09-12T15:03:16.978494-04:00
>>
>> OSS_IOCTL_GRIDDISK_FENCE for griddisk: RECO_XGMB_CD_06_XGMcl13 giuid:
>> 673900a4-8854-4ad8-8a5b-2015942d1bbc failed with error 388 (Fencing IOCTL
>> timed out due to pending I/O's)
>>
>> 2022-09-12T15:04:21.539419-04:00
>>
>> OSS_IOCTL_GRIDDISK_FENCE for griddisk: RECO_XGMB_CD_06_XGMcl13 giuid:
>> 673900a4-8854-4ad8-8a5b-2015942d1bbc failed with error 388 (Fencing IOCTL
>> timed out due to
>>
>> -------------
>>
>> WARNING: Cellsrv UserThread: LWPID: 59568 userId: 108 kernelId: 72
>> pthreadID: 13766447306496 is stuck for more than 10025222 micro sec.
>>
>> Please take a look at svtrc_57983_108.trc and System State dump for more
>> info
>>
>> Errors in file
>> /opt/oracle/cell/log/diag/asm/cell/x073cl13/trace/svtrc_57983_16.trc (incident=17):
>>
>> ORA-00600: internal error code, arguments:
>> [Scheduler::One_Or_More_Cellsrv_Threads_Stuck], [1], [], [], [], [], [],
>> [], [], [], [], []
>>
>> Incident details in: /opt/oracle/cell/log/diag/asm/cell/XGMcl13
>> /incident/incdir_17/svtrc_57983_16_i17.trc
>>
>> 2022-09-12T15:48:06.190103-04:00
>>
>> [RS] Monitoring process /opt/oracle/cell/cellsrv/bin/cellrsomt (pid:
>> 57980, srvc_pid: 57983) returned with error: 128
>>
>> CELLSRV error - ORA-600 internal error
>>
>> *logwriter trace:-*
>>
>> 2022-09-12 15:02:54.947 :C343DCE7:KFNC:kfnc.c_at_12127:kfncSlaveMsgFree():1373:13291: [kfncDiskOffline]
>> 0x247eeee5d8 alloc=0x1 msgtyp=0x1
>>
>> 2022-09-12 15:02:54.947 :C343DCE8:KFNC:kfnc.c_at_4048:kfncDiskOffline():1373:13291:
>> 12 disks were taken offline for gnum [2.1193292329]
>>
>> 2022-09-12 15:02:54.961 :C343DE61:KFD:kfd.c_at_12756:kfdOfflinePriv():1373:13291:
>> kfdOffline: mask=0x7e check=0xc flg=0x2 caller=1 dskCnt=1 client=1
>> clientAssisting=1 at kfio.c:8245
>>
>> 2022-09-12 15:02:54.961 :C343DE62:KFNC:kfnc.c_at_3929:kfncDiskOffline():1373:13291:
>> gnum [2.1193292329], count 1 mask 126 ocheck 12
>>
>> 2022-09-12 15:02:54.961 :C343DE63:KFNC:kfn.c_at_19336:kfnSetupASMBIdx():1373:13291:
>> Known group number 2, owner asmb index 0; inrenew = 0
>>
>> 2022-09-12 15:02:54.961 :C343DE64:KFNC:kfnc.c_at_7525:kfncGetCurrentPoolInt():1373:13291:
>> asmb index = 0
>>
>> 2022-09-12 15:02:54.961 :C343DE65:KFNC:kfnc.c_at_12024:kfncSlaveMsgAllocInt():1373:13291:
>> [kfncDiskOffline] usepga=0 workmsg=1 noslave=0 uga=0x7fabf6d2bc30 at
>> kfnc.c:3952, asmb#=0
>>
>> Regards
>>
>> Lok
>>
>
-- http://www.freelists.org/webpage/oracle-lReceived on Wed Sep 14 2022 - 08:25:52 CEST