Re: ORA-01017 when DGMGRL observer re-authenticates primary during failover

From: Mikhail Velikikh <mvelikikh_at_gmail.com>
Date: Sat, 19 Aug 2023 15:19:49 +0100
Message-ID: <CALe4HpmbBpb7NwjvqSRdK2dFUDA+8HbfMEQOG8j5-VVB_-QMpw_at_mail.gmail.com>



Hello,

When we SHUTDOWN ABORT just the primary database, things go awry.
> According to the observer.log, the observer sees the outage, but dies on an
> "authentication error" almost immediately...
>

The scenario works fine for me on 19.18. The DB alert log demonstrating shutdown abort:

> 2023-08-19T14:02:03.517381+00:00
> Shutting down ORACLE instance (abort) (OS id: 3824)
> 2023-08-19T14:02:03.517486+00:00
> Shutdown is initiated by sqlplus_at_host-a.mycompany.mydomain (TNS V1-V3).
> License high water mark = 21
> USER (ospid: 3824): terminating the instance
> 2023-08-19T14:02:04.464641+00:00
> ORA-1092 : opitsk aborting process
> 2023-08-19T14:02:04.553354+00:00
> Instance terminated by USER, pid = 3824
> 2023-08-19T14:02:04.560451+00:00
> Instance shutdown complete (OS id: 3824)
>

The observer log:

> [W000 2023-08-19T14:02:03.839+00:00] Primary database cannot be reached.
> [W000 2023-08-19T14:02:03.839+00:00] Fast-Start Failover threshold has not
> exceeded. Retry for the next 10 seconds
> [W000 2023-08-19T14:02:04.839+00:00] Try to connect to the primary.
> [P024 2023-08-19T14:02:04.847+00:00] Failed to attach to
> //host-a:1521/orcla.
> ORA-12514: TNS:listener does not currently know of service requested in
> connect descriptor
>
> ...
>
> Unable to connect to database using //host-a:1521/orcla
> [W000 2023-08-19T14:02:13.873+00:00] Primary database cannot be reached.
> [W000 2023-08-19T14:02:13.873+00:00] Fast-Start Failover threshold has
> expired.
> [W000 2023-08-19T14:02:13.873+00:00] Try to connect to the standby.
> [W000 2023-08-19T14:02:13.873+00:00] Check if the standby is ready for
> failover.
> [P024 2023-08-19T14:02:13.875+00:00] Failed to attach to
> //host-a:1521/orcla.
> ORA-12514: TNS:listener does not currently know of service requested in
> connect descriptor
>
> Unable to connect to database using //host-a:1521/orcla
> [S025 2023-08-19T14:02:13.876+00:00] Fast-Start Failover started...
>
> 2023-08-19T14:02:13.876+00:00
> Initiating Fast-Start Failover to database "orclb"...
> [S025 2023-08-19T14:02:13.876+00:00] Initiating Fast-start Failover.
> Performing failover NOW, please wait...
> Failover succeeded, new primary is "orclb"
> 2023-08-19T14:02:32.922+00:00
> [S025 2023-08-19T14:02:32.922+00:00] Fast-Start Failover finished...
> [W000 2023-08-19T14:02:32.922+00:00] Failover succeeded. Restart pinging.
> [W000 2023-08-19T14:02:32.927+00:00] Primary database has changed to orclb.
> [W000 2023-08-19T14:02:32.929+00:00] Try to connect to the primary.
> [W000 2023-08-19T14:02:32.929+00:00] Try to connect to the primary
> //host-b:1521/orclb.
> [W000 2023-08-19T14:02:33.033+00:00] The standby orcla needs to be
> reinstated
> [W000 2023-08-19T14:02:33.033+00:00] Try to connect to the new standby
> orcla.
> [W000 2023-08-19T14:02:33.033+00:00] Connection to the primary restored!
> [S029 2023-08-19T14:02:33.036+00:00] Failed to attach to
> //host-a:1521/orcla.
> ORA-12514: TNS:listener does not currently know of service requested in
> connect descriptor
>
>
>

DGMGRL for Linux: Release 19.0.0.0.0 - Production on Fri Aug 18 16:16:08
> 2023
> Version 19.3.0.0.0 *<== observer on 19.3, but RDBMS is 19.18 on both
> primary and standby*
>

It does not look right. It is better to update the Oracle Home on the observer host too. For me it is 19.18 everywhere:

> DGMGRL for Linux: Release 19.0.0.0.0 - Production on Sat Aug 19 14:04:46
> 2023
> Version 19.18.0.0.0
>

How can a database instance return an ORA-03113 error (due to the SHUTDOWN
> ABORT that I performed) and then one second later return an ORA-01017
> error, which indicates that the database is up and running?
>

You can have a static registration and supply a wrong password to get ORA-1017 with no database instance running. See an example below (it is probably not your case, though):

[oracle_at_host-a ~]$ pgrep -af pmon
> 2200 asm_pmon_+ASM
> [oracle_at_host-a ~]$ sqlplus -L sys/wrong_password_at_host-a/orcla as sysdba
>
> SQL*Plus: Release 19.0.0.0.0 - Production on Sat Aug 19 14:09:17 2023
> Version 19.18.0.0.0
>
> Copyright (c) 1982, 2022, Oracle. All rights reserved.
>
> ERROR:
> ORA-01017: invalid username/password; logon denied
>
>
> SP2-0751: Unable to connect to Oracle. Exiting SQL*Plus
> [oracle_at_host-a ~]$ sqlplus -L sys/change_on_install_at_host-a/orcla as sysdba
>
> SQL*Plus: Release 19.0.0.0.0 - Production on Sat Aug 19 14:09:27 2023
> Version 19.18.0.0.0
>
> Copyright (c) 1982, 2022, Oracle. All rights reserved.
>
> *Connected to an idle instance.*
>

Best regards,
Mikhail Velikikh

On Fri, 18 Aug 2023 at 21:47, Tim Gorman <tim.evdbt_at_gmail.com> wrote:

> Friends and colleagues,
>
> I have a customer testing automated failovers using the observer of Oracle
> DataGuard Broker with FastStart Failover (FSFO) enabled, and we're seeing a
> puzzling problem.
>
> When we stop the entire VM on which the primary database resides, then
> everything works great. The observer sees the outage, gives the configured
> 30 second grace period, then starts the failover at the 30 second mark,
> completing the failover 20-30 seconds later.
>
> When we SHUTDOWN ABORT just the primary database, things go awry.
> According to the observer.log, the observer sees the outage, but dies on an
> "authentication error" almost immediately...
>
> DGMGRL for Linux: Release 19.0.0.0.0 - Production on Fri Aug 18 16:16:08
> 2023
> Version 19.3.0.0.0 *<== observer on 19.3, but RDBMS is 19.18 on both
> primary and standby*
>
> Copyright (c) 1982, 2019, Oracle and/or its affiliates. All rights
> reserved.
>
> Welcome to DGMGRL, type "help" for information.
> DGMGRL> Connected to "dnabk01"
> Connected as SYSDBA.
> DGMGRL>
> Configuration - FSF
>
> Protection Mode: MaxAvailability
> Members:
> dnabk01 - Primary database
> dnabk02 - (*) Physical standby database
>
> Fast-Start Failover: Enabled in Zero Data Loss Mode
>
> Configuration Status:
> SUCCESS (status updated 44 seconds ago)
>
> *(...lots of lines edited out for brevity...)*
>
> [S002 2023-08-18T16:24:59.672+00:00] Executing PING command.
> [S002 2023-08-18T16:24:59.673+00:00] DO_OBSERVE command
> <DO_OBSERVE version="19.3.0.0.0" obid="214916678"><PING miv="19"
> target="2" waittime="0" fsfo_version="36" flags="3"></PING></DO_OBSERVE>
> ORA-03113: end-of-file on communication channel *<== caused by SHUTDOWN
> ABORT of primary database (P001)*
> Process ID: 20826
> Session ID: 354 Serial number: 25580
>
> [W000 2023-08-18T16:24:59.673+00:00] Command PING to thread P001 returned
> status=-1
> [W000 2023-08-18T16:24:59.673+00:00] Primary database cannot be reached.
> [W000 2023-08-18T16:24:59.673+00:00] Fast-Start Failover threshold has not
> exceeded. Retry for the next 30 seconds
> [S002 2023-08-18T16:24:59.674+00:00] DO_OBSERVE result
> <OBSVPING_RESULT status="0" ackseq="0" version="36" flags="3"
> focond="N/A"></OBSVPING_RESULT>
> [S002 2023-08-18T16:24:59.677+00:00] Ping returned SUCCESS
> [W000 2023-08-18T16:25:00.673+00:00] Command PING to thread S002 returned
> status=0
> [W000 2023-08-18T16:25:00.673+00:00] Try to connect to the primary.
> [W000 2023-08-18T16:25:00.673+00:00] Sending command CONNECT to thread P001
> [P001 2023-08-18T16:25:00.674+00:00] Executing CONNECT command.
> [P001 2023-08-18T16:25:00.674+00:00] Connecting to database using
> dnabk01_dgmgrl.
> [P001 2023-08-18T16:25:00.674+00:00] Attempt logon as SYSDBA
> ORA-01017: invalid username/password; logon denied *<== how can an
> ORA-01017 be returned when the database was ORA-03113 one second ago???*
>
> Warning: You are no longer connected to ORACLE.
> [P001 2023-08-18T16:25:01.722+00:00] Authentication failed.
> [W000 2023-08-18T16:25:01.722+00:00] Command CONNECT to thread P001
> returned status=-2
> [W000 2023-08-18T16:25:01.722+00:00] State cleared on disk.
> [W000 2023-08-18T16:25:01.722+00:00] Sending command QUIT to thread P001
> [W000 2023-08-18T16:25:01.722+00:00] Sending command QUIT to thread S002
> [W000 2023-08-18T16:25:01.722+00:00] Sending command QUIT to thread B003
> [B003 2023-08-18T16:25:01.722+00:00] Executing QUIT command.
> [B003 2023-08-18T16:25:01.722+00:00] Quitting thread...
> [S002 2023-08-18T16:25:01.722+00:00] Executing QUIT command.
> [S002 2023-08-18T16:25:01.722+00:00] Quitting thread...
> [P001 2023-08-18T16:25:01.723+00:00] Executing QUIT command.
> [P001 2023-08-18T16:25:01.723+00:00] Quitting thread...
> [W000 2023-08-18T16:25:02.722+00:00] Releasing context of thread B003.
> [W000 2023-08-18T16:25:02.723+00:00] Releasing context of thread S002.
> [W000 2023-08-18T16:25:02.723+00:00] Releasing context of thread P001.
> 2023-08-18T16:25:02.723+00:00
> Observer 'l7ld2dntapp0001' stopped
>
>
> So, as you can see in the annotated "observer.log" (with debug enabled),
> the observer receives ORA-03113 following the SHUTDOWN ABORT of the primary
> database, and then one second later gets an response ORA-01017 response
> from the dead database instance.
>
> How is this possible? How can a database instance return an ORA-03113
> error (due to the SHUTDOWN ABORT that I performed) and then one second
> later return an ORA-01017 error, which indicates that the database is up
> and running?
>
> We have opened an Oracle SR on this, but I'm hoping someone here saw
> something similar recently?
>
> Please let me know what you think?
>
> Thanks!
>
> -Tim
>

--
http://www.freelists.org/webpage/oracle-l
Received on Sat Aug 19 2023 - 16:19:49 CEST

Original text of this message