Re: DB Instance failure Issue

From: Lok P <loknath.73_at_gmail.com>
Date: Thu, 15 Sep 2022 02:08:55 +0530
Message-ID: <CAKna9VatvV14Ov6OkzX=Coe54zyL+PSfEpg-ZncjtS7HayppnA_at_mail.gmail.com>



Thank You so much.
This doc is saying the database hangs with 'log file parallel write', but in our case we were not seeing much of 'log file parallel write' wait events when the crash happened. And also it all just happened suddenly, we didn't experience hung kind of scenario. On the other hand, we see the Avg 'log file parallel write' response time becomes better post 19c migration (Avg ~10ms vs <2ms now during peak load), it may be that because they must be served from flash cache somehow as opposed to hard disk in 11.2 version.

We got some pointer towards Doc ID- 2694588.1 from support, which say large IO or multiple paralle slaves can cause the crash. But we are still struggling to connect the dots i.e. to understand what exact query were causing the large IO or paralle slaves and thus caused this issue. And what exactly is MR Key?

On Wed, Sep 14, 2022 at 11:56 AM Pap <oracle.developer35_at_gmail.com> wrote:

>
> See this one..
> Exadata: Database hang with high log file parallel write wait and the
> cells report ORA-00600: [Scheduler::One_Or_More_Cellsrv_Threads_Stuck]
> after upgrading to 19.2.x (Doc ID 2604805.1)
>
> On Wed, 14 Sep 2022, 10:24 am Pap, <oracle.developer35_at_gmail.com> wrote:
>
>> Your log says slow log writer and a disk offline issue. We have also
>> encountered similar issues wrt log writer responses and it may be a bug.
>> You may try to search the MOS.
>>
>> On Wed, Sep 14, 2022 at 1:07 AM Lok P <loknath.73_at_gmail.com> wrote:
>>
>>> Hello Listers,
>>>
>>> Its Oracle version- 19.15.0.0.0. Its 4 node RAC. Its Exadata X8M, image
>>> version 19.3.7. It was upgraded from 11.2 a few days back. All the Nodes
>>> crashed suddenly. And alert logs pointing to mainly Node-2 and a specific
>>> cellserver. Below are the snippets of the alert logs from DB and the
>>> cellserver.
>>>
>>> We are suspecting any bugs which we have hit. We are checking with
>>> support. But want to understand from experts here, if anything similar
>>> anybody has encountered?
>>>
>>> *Database log:- *
>>>
>>> 2022-09-12T15:01:53.278393-04:00
>>>
>>> requested by (instance=2, osid=97030 (DIA0)), summary=[process state
>>> dump request (kjznHA)]
>>>
>>> requested by (instance=2, osid=97030 (DIA0)), summary=[process state
>>> dump request (kjznHA)]
>>>
>>> 2022-09-12T15:02:54.901871-04:00
>>>
>>> NOTE: process _lgwr_DB12 (97118) initiating optional offline of disk
>>> 22.4042277007 (RECO_XGMB_CD_06_XXGMCL13) with mask 0x7e in group 2
>>> (RECO_XGMB) with client assisting
>>>
>>> 2022-09-12T15:02:54.946461-04:00
>>>
>>> NOTE: updating disk modes to 0x5 from 0x7 for disk 12
>>> (RECO_XGMB_CD_05_XXGMCL13) in group 2 (RECO_XGMB): lflags 0x0
>>>
>>> NOTE: updating disk modes to 0x5 from 0x7 for disk 13
>>> (RECO_XGMB_CD_02_XXGMCL13) in group 2 (RECO_XGMB): lflags 0x0
>>>
>>> ---------------
>>>
>>> 2022-09-12T15:02:54.961121-04:00
>>>
>>> Errors in file
>>> /u01/app/oracle/diag/rdbms/DB1_xXGMb/DB12/trace/DB12_lgwr_97118.trc:
>>>
>>> ORA-15080: synchronous I/O operation failed to write block 32665540 of
>>> disk 22 in disk group RECO_XGMB
>>>
>>> 2022-09-12T15:02:54.961169-04:00
>>>
>>> WARNING: failed to write mirror side 2 of virtual extent 3987 logical
>>> extent 1 of file 625 in group 2 on disk 22 allocation unit 1294
>>>
>>> NOTE: process _lgwr_DB12 (97118) initiating offline of disk
>>> 22.4042277007 (RECO_XGMB_CD_06_XGMCL13) with mask 0x7e in group 2
>>> (RECO_XGMB) with client assisting
>>>
>>> ---------------
>>>
>>> 2022-09-12T15:03:09.809095-04:00
>>>
>>> NOTE: ASMB process initiating disk discovery for grp 2
>>> (reqid:5125151858335298229)
>>>
>>> 2022-09-12T15:03:11.517788-04:00
>>>
>>> NOTE: Found o/XXXXXXXX/RECO_XGMB_CD_05_XGMcl13 for disk
>>> RECO_XGMB_CD_05_XGMCL13
>>>
>>> SUCCESS: disk RECO_XGMB_CD_05_XGMCL13 (12.4042276997) replaced in
>>> diskgroup RECO_XGMB path: o/XXXXXXXX/RECO_XGMB_CD_05_XGMcl13
>>>
>>> SUCCESS: completed reconfiguration of group 2 (RECO_XGMB)
>>>
>>> -------------
>>>
>>> 2022-09-12T15:05:13.914953-04:00
>>>
>>> LGWR (ospid: 97118) waits for event 'log file parallel write' for 138
>>> secs.
>>>
>>> 2022-09-12T15:05:24.010945-04:00
>>>
>>> LGWR (ospid: 97118) waits for event 'log file parallel write' for 148
>>> secs.
>>>
>>> Errors in file
>>> /u01/app/oracle/diag/rdbms/DB1_XGMb/DB12/trace/DB12_lmhb_97063.trc (incident=319107):
>>>
>>> ORA-29770: global enqueue process LGWR (OSID 97118) is hung for more
>>> than 140 seconds
>>>
>>> Incident details in:
>>> /u01/app/oracle/diag/rdbms/DB1_XGMb/DB12/incident/incdir_319107/DB12_lmhb_97063_i319107.trc
>>>
>>> 2022-09-12T15:05:24.744688-04:00
>>>
>>> NOTE: ASMB process initiating disk discovery for grp 2
>>> (reqid:5125151858335314113)
>>>
>>> 2022-09-12T15:05:26.701077-04:00
>>>
>>> LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_DB1_XGMB by memno
>>> 1
>>>
>>> 2022-09-12T15:05:27.038295-04:00
>>>
>>> Process termination requested for pid 97118 [source = rdbms], [info = 0]
>>> [request issued by pid: 97063, uid: 260]
>>>
>>> 2022-09-12T15:05:27.038338-04:00
>>>
>>> Process termination requested for pid 97041_97049 [source = rdbms],
>>> [info = 0] [request issued by pid: 97063, uid: 260]
>>>
>>> 2022-09-12T15:05:27.038441-04:00
>>>
>>> Process termination requested for pid 97039_97050 [source = rdbms],
>>> [info = 0] [request issued by pid: 97063, uid: 260]
>>>
>>> 2022-09-12T15:05:27.038471-04:00
>>>
>>> *****************
>>>
>>> 2022-09-12T15:05:27.054641-04:00
>>>
>>> Dumping diagnostic data in directory=[cdmp_20220912150527], requested by
>>> (instance=2, osid=97063 (LMHB)), summary=[abnormal instance termination].
>>>
>>> LMHB (ospid: 97063): terminating the instance due to ORA error 29770
>>>
>>> 2022-09-12T15:05:27.064406-04:00
>>>
>>> Cause - 'ERROR: Some process(s) is not making progress.
>>>
>>> LMHB (ospid: 97063) is terminating the instance.
>>>
>>> Please check LMHB trace file for more details.
>>>
>>> Please also check the CPU load, I/O load and other system properties for
>>> anomalous behavior
>>>
>>> ERROR: Some process(s'
>>>
>>> 2022-09-12T15:05:29.118991-04:00
>>>
>>>
>>> *Cell server log :- *
>>>
>>> sendMessage completed with error 2, clientHost: XGMdb06.XX.XXXXX.com[pid:148143]
>>> destinationID: 1243439106 openCnt 1 requestCnt 4
>>>
>>> Implicit fencing of this port will be performed
>>>
>>> slos passed to ossmisc_dump_slos_alert routine is not in failure status
>>>
>>> 2022-09-11T22:10:01.317939-04:00
>>>
>>> I/O Resource Manager: Setting intra-database plan for database DB1_XGMB:
>>> DEFAULT_MAINTENANCE_PLAN
>>>
>>> 2022-09-12T15:03:16.978494-04:00
>>>
>>> OSS_IOCTL_GRIDDISK_FENCE for griddisk: RECO_XGMB_CD_06_XGMcl13 giuid:
>>> 673900a4-8854-4ad8-8a5b-2015942d1bbc failed with error 388 (Fencing IOCTL
>>> timed out due to pending I/O's)
>>>
>>> 2022-09-12T15:04:21.539419-04:00
>>>
>>> OSS_IOCTL_GRIDDISK_FENCE for griddisk: RECO_XGMB_CD_06_XGMcl13 giuid:
>>> 673900a4-8854-4ad8-8a5b-2015942d1bbc failed with error 388 (Fencing IOCTL
>>> timed out due to
>>>
>>> -------------
>>>
>>> WARNING: Cellsrv UserThread: LWPID: 59568 userId: 108 kernelId: 72
>>> pthreadID: 13766447306496 is stuck for more than 10025222 micro sec.
>>>
>>> Please take a look at svtrc_57983_108.trc and System State dump for more
>>> info
>>>
>>> Errors in file
>>> /opt/oracle/cell/log/diag/asm/cell/x073cl13/trace/svtrc_57983_16.trc (incident=17):
>>>
>>> ORA-00600: internal error code, arguments:
>>> [Scheduler::One_Or_More_Cellsrv_Threads_Stuck], [1], [], [], [], [], [],
>>> [], [], [], [], []
>>>
>>> Incident details in: /opt/oracle/cell/log/diag/asm/cell/XGMcl13
>>> /incident/incdir_17/svtrc_57983_16_i17.trc
>>>
>>> 2022-09-12T15:48:06.190103-04:00
>>>
>>> [RS] Monitoring process /opt/oracle/cell/cellsrv/bin/cellrsomt (pid:
>>> 57980, srvc_pid: 57983) returned with error: 128
>>>
>>> CELLSRV error - ORA-600 internal error
>>>
>>> *logwriter trace:-*
>>>
>>> 2022-09-12 15:02:54.947 :C343DCE7:KFNC:kfnc.c_at_12127:kfncSlaveMsgFree():1373:13291: [kfncDiskOffline]
>>> 0x247eeee5d8 alloc=0x1 msgtyp=0x1
>>>
>>> 2022-09-12 15:02:54.947 :C343DCE8:KFNC:kfnc.c_at_4048:kfncDiskOffline():1373:13291:
>>> 12 disks were taken offline for gnum [2.1193292329]
>>>
>>> 2022-09-12 15:02:54.961 :C343DE61:KFD:kfd.c_at_12756:kfdOfflinePriv():1373:13291:
>>> kfdOffline: mask=0x7e check=0xc flg=0x2 caller=1 dskCnt=1 client=1
>>> clientAssisting=1 at kfio.c:8245
>>>
>>> 2022-09-12 15:02:54.961 :C343DE62:KFNC:kfnc.c_at_3929:kfncDiskOffline():1373:13291:
>>> gnum [2.1193292329], count 1 mask 126 ocheck 12
>>>
>>> 2022-09-12 15:02:54.961 :C343DE63:KFNC:kfn.c_at_19336:kfnSetupASMBIdx():1373:13291:
>>> Known group number 2, owner asmb index 0; inrenew = 0
>>>
>>> 2022-09-12 15:02:54.961 :C343DE64:KFNC:kfnc.c_at_7525:kfncGetCurrentPoolInt():1373:13291:
>>> asmb index = 0
>>>
>>> 2022-09-12 15:02:54.961 :C343DE65:KFNC:kfnc.c_at_12024:kfncSlaveMsgAllocInt():1373:13291:
>>> [kfncDiskOffline] usepga=0 workmsg=1 noslave=0 uga=0x7fabf6d2bc30 at
>>> kfnc.c:3952, asmb#=0
>>>
>>> Regards
>>>
>>> Lok
>>>
>>

--
http://www.freelists.org/webpage/oracle-l
Received on Wed Sep 14 2022 - 22:38:55 CEST

Original text of this message