ORA-01017 when DGMGRL observer re-authenticates primary during failover
Date: Fri, 18 Aug 2023 13:45:23 -0700
Message-ID: <ecb72507-ce9a-458a-b8e4-481547666026_at_gmail.com>
Friends and colleagues,
I have a customer testing automated failovers using the observer of Oracle DataGuard Broker with FastStart Failover (FSFO) enabled, and we're seeing a puzzling problem.
When we stop the entire VM on which the primary database resides, then everything works great. The observer sees the outage, gives the configured 30 second grace period, then starts the failover at the 30 second mark, completing the failover 20-30 seconds later.
When we SHUTDOWN ABORT just the primary database, things go awry. According to the observer.log, the observer sees the outage, but dies on an "authentication error" almost immediately...
DGMGRL for Linux: Release 19.0.0.0.0 - Production on Fri Aug 18
16:16:08 2023
Version 19.3.0.0.0 *<== observer on 19.3, but RDBMS is 19.18 on both
primary and standby*
Copyright (c) 1982, 2019, Oracle and/or its affiliates. All rights reserved.
Welcome to DGMGRL, type "help" for information.
DGMGRL> Connected to "dnabk01"
Connected as SYSDBA.
DGMGRL>
Configuration - FSF
Protection Mode: MaxAvailability Members: dnabk01 - Primary database dnabk02 - (*) Physical standby database
Fast-Start Failover: Enabled in Zero Data Loss Mode
Configuration Status:
SUCCESS (status updated 44 seconds ago)
*(...lots of lines edited out for brevity...)*
[S002 2023-08-18T16:24:59.672+00:00] Executing PING command.
[S002 2023-08-18T16:24:59.673+00:00] DO_OBSERVE command
<DO_OBSERVE version="19.3.0.0.0" obid="214916678"><PING miv="19"
target="2" waittime="0" fsfo_version="36"
flags="3"></PING></DO_OBSERVE>
ORA-03113: end-of-file on communication channel *<== caused by
SHUTDOWN ABORT of primary database (P001)*
Process ID: 20826
Session ID: 354 Serial number: 25580
[W000 2023-08-18T16:24:59.673+00:00] Command PING to thread P001
Warning: You are no longer connected to ORACLE.
returned status=-1
[W000 2023-08-18T16:24:59.673+00:00] Primary database cannot be reached.
[W000 2023-08-18T16:24:59.673+00:00] Fast-Start Failover threshold
has not exceeded. Retry for the next 30 seconds
[S002 2023-08-18T16:24:59.674+00:00] DO_OBSERVE result
<OBSVPING_RESULT status="0" ackseq="0" version="36" flags="3"
focond="N/A"></OBSVPING_RESULT>
[S002 2023-08-18T16:24:59.677+00:00] Ping returned SUCCESS
[W000 2023-08-18T16:25:00.673+00:00] Command PING to thread S002
returned status=0
[W000 2023-08-18T16:25:00.673+00:00] Try to connect to the primary.
[W000 2023-08-18T16:25:00.673+00:00] Sending command CONNECT to
thread P001
[P001 2023-08-18T16:25:00.674+00:00] Executing CONNECT command.
[P001 2023-08-18T16:25:00.674+00:00] Connecting to database using
dnabk01_dgmgrl.
[P001 2023-08-18T16:25:00.674+00:00] Attempt logon as SYSDBA
ORA-01017: invalid username/password; logon denied *<== how can an
ORA-01017 be returned when the database was ORA-03113 one second ago???*
[P001 2023-08-18T16:25:01.722+00:00] Authentication failed.
[W000 2023-08-18T16:25:01.722+00:00] Command CONNECT to thread P001
returned status=-2
[W000 2023-08-18T16:25:01.722+00:00] State cleared on disk.
[W000 2023-08-18T16:25:01.722+00:00] Sending command QUIT to thread P001
[W000 2023-08-18T16:25:01.722+00:00] Sending command QUIT to thread S002
[W000 2023-08-18T16:25:01.722+00:00] Sending command QUIT to thread B003
[B003 2023-08-18T16:25:01.722+00:00] Executing QUIT command.
[B003 2023-08-18T16:25:01.722+00:00] Quitting thread...
[S002 2023-08-18T16:25:01.722+00:00] Executing QUIT command.
[S002 2023-08-18T16:25:01.722+00:00] Quitting thread...
[P001 2023-08-18T16:25:01.723+00:00] Executing QUIT command.
[P001 2023-08-18T16:25:01.723+00:00] Quitting thread...
[W000 2023-08-18T16:25:02.722+00:00] Releasing context of thread B003.
[W000 2023-08-18T16:25:02.723+00:00] Releasing context of thread S002.
[W000 2023-08-18T16:25:02.723+00:00] Releasing context of thread P001.
2023-08-18T16:25:02.723+00:00
Observer 'l7ld2dntapp0001' stopped
So, as you can see in the annotated "observer.log" (with debug enabled), the observer receives ORA-03113 following the SHUTDOWN ABORT of the primary database, and then one second later gets an response ORA-01017 response from the dead database instance.
How is this possible? How can a database instance return an ORA-03113 error (due to the SHUTDOWN ABORT that I performed) and then one second later return an ORA-01017 error, which indicates that the database is up and running?
We have opened an Oracle SR on this, but I'm hoping someone here saw something similar recently?
Please let me know what you think?
Thanks!
-Tim
-- http://www.freelists.org/webpage/oracle-lReceived on Fri Aug 18 2023 - 22:45:23 CEST