Re: ORA-01017 when DGMGRL observer re-authenticates primary during failover

From: Tim Gorman <tim.evdbt_at_gmail.com>
Date: Sun, 20 Aug 2023 09:52:00 -0700
Message-ID: <8860bf4f-eb34-43b3-bbfa-278041670371_at_gmail.com>



Stefan,

I won't know if your response is correct until Monday (tomorrow), but *thank you* for noticing my disregard for the almost-certain problem with the Oracle password file, evidenced by the immediate ORA-01017 error following one second after the ORA-03113.

All I can say is it was Friday and I was fatigued at the end of a long week, only it was the same for you (in fact 9 hours later on your Friday), so I can't even use that as an excuse.

At any rate, thank you for the proper interpretation of those symptoms, I'll follow up to the list with what we find.

Thanks again!

-Tim

On 8/19/2023 2:21 AM, Stefan Koehler wrote:
> Hello Tim,
> this looks like an issue with the password file (orapw$ORACLE_SID) to me.
>
> The difference between your two scenarios is this trace file line "Ping returned SUCCESS".
>
> So if your VM is shutdown completely the ping will also fail and no connect to primary is tried. However as the host is still reachable if you just shutdown the database/instance, the ping will succeed and the observers tries to (re-)connect to the primary database. As the database is shutdown it uses the password from the password file this time (and not the SYS password from the Oracle DDIC).
>
> What happens if you shutdown the database and try to connect remotely to primary as SYSDBA with the known SYS password?
>
>
> Best Regards
> Stefan Koehler
>
> Independent Oracle performance consultant and researcher
> Website:www.soocs.de
> Twitter: _at_OracleSK
>
>> Tim Gorman<tim.evdbt_at_gmail.com> hat am 18.08.2023 22:45 CEST geschrieben:
>>
>> Friends and colleagues,
>>
>> I have a customer testing automated failovers using the observer of Oracle DataGuard Broker with FastStart Failover (FSFO) enabled, and we're seeing a puzzling problem.
>>
>> When we stop the entire VM on which the primary database resides, then everything works great. The observer sees the outage, gives the configured 30 second grace period, then starts the failover at the 30 second mark, completing the failover 20-30 seconds later.
>>
>> When we SHUTDOWN ABORT just the primary database, things go awry. According to the observer.log, the observer sees the outage, but dies on an "authentication error" almost immediately...
>>
>>
>>> DGMGRL for Linux: Release 19.0.0.0.0 - Production on Fri Aug 18 16:16:08 2023
>>> Version 19.3.0.0.0<== observer on 19.3, but RDBMS is 19.18 on both primary and standby
>>>
>>> Copyright (c) 1982, 2019, Oracle and/or its affiliates. All rights reserved.
>>>
>>> Welcome to DGMGRL, type "help" for information.
>>> DGMGRL> Connected to "dnabk01"
>>> Connected as SYSDBA.
>>> DGMGRL>
>>> Configuration - FSF
>>>
>>> Protection Mode: MaxAvailability
>>> Members:
>>> dnabk01 - Primary database
>>> dnabk02 - (*) Physical standby database
>>>
>>> Fast-Start Failover: Enabled in Zero Data Loss Mode
>>>
>>> Configuration Status:
>>> SUCCESS (status updated 44 seconds ago)
>>> (...lots of lines edited out for brevity...)
>>>
>>> [S002 2023-08-18T16:24:59.672+00:00] Executing PING command.
>>> [S002 2023-08-18T16:24:59.673+00:00] DO_OBSERVE command
>>> <DO_OBSERVE version="19.3.0.0.0" obid="214916678"><PING miv="19" target="2" waittime="0" fsfo_version="36" flags="3"></PING></DO_OBSERVE>
>>> ORA-03113: end-of-file on communication channel<== caused by SHUTDOWN ABORT of primary database (P001)
>>> Process ID: 20826
>>> Session ID: 354 Serial number: 25580
>>>
>>> [W000 2023-08-18T16:24:59.673+00:00] Command PING to thread P001 returned status=-1
>>> [W000 2023-08-18T16:24:59.673+00:00] Primary database cannot be reached.
>>> [W000 2023-08-18T16:24:59.673+00:00] Fast-Start Failover threshold has not exceeded. Retry for the next 30 seconds
>>> [S002 2023-08-18T16:24:59.674+00:00] DO_OBSERVE result
>>> <OBSVPING_RESULT status="0" ackseq="0" version="36" flags="3" focond="N/A"></OBSVPING_RESULT>
>>> [S002 2023-08-18T16:24:59.677+00:00] Ping returned SUCCESS
>>> [W000 2023-08-18T16:25:00.673+00:00] Command PING to thread S002 returned status=0
>>> [W000 2023-08-18T16:25:00.673+00:00] Try to connect to the primary.
>>> [W000 2023-08-18T16:25:00.673+00:00] Sending command CONNECT to thread P001
>>> [P001 2023-08-18T16:25:00.674+00:00] Executing CONNECT command.
>>> [P001 2023-08-18T16:25:00.674+00:00] Connecting to database using dnabk01_dgmgrl.
>>> [P001 2023-08-18T16:25:00.674+00:00] Attempt logon as SYSDBA
>>> ORA-01017: invalid username/password; logon denied<== how can an ORA-01017 be returned when the database was ORA-03113 one second ago???
>>>
>>> Warning: You are no longer connected to ORACLE.
>>> [P001 2023-08-18T16:25:01.722+00:00] Authentication failed.
>>> [W000 2023-08-18T16:25:01.722+00:00] Command CONNECT to thread P001 returned status=-2
>>> [W000 2023-08-18T16:25:01.722+00:00] State cleared on disk.
>>> [W000 2023-08-18T16:25:01.722+00:00] Sending command QUIT to thread P001
>>> [W000 2023-08-18T16:25:01.722+00:00] Sending command QUIT to thread S002
>>> [W000 2023-08-18T16:25:01.722+00:00] Sending command QUIT to thread B003
>>> [B003 2023-08-18T16:25:01.722+00:00] Executing QUIT command.
>>> [B003 2023-08-18T16:25:01.722+00:00] Quitting thread...
>>> [S002 2023-08-18T16:25:01.722+00:00] Executing QUIT command.
>>> [S002 2023-08-18T16:25:01.722+00:00] Quitting thread...
>>> [P001 2023-08-18T16:25:01.723+00:00] Executing QUIT command.
>>> [P001 2023-08-18T16:25:01.723+00:00] Quitting thread...
>>> [W000 2023-08-18T16:25:02.722+00:00] Releasing context of thread B003.
>>> [W000 2023-08-18T16:25:02.723+00:00] Releasing context of thread S002.
>>> [W000 2023-08-18T16:25:02.723+00:00] Releasing context of thread P001.
>>> 2023-08-18T16:25:02.723+00:00
>>> Observer 'l7ld2dntapp0001' stopped
>> So, as you can see in the annotated "observer.log" (with debug enabled), the observer receives ORA-03113 following the SHUTDOWN ABORT of the primary database, and then one second later gets an response ORA-01017 response from the dead database instance.
>>
>> How is this possible? How can a database instance return an ORA-03113 error (due to the SHUTDOWN ABORT that I performed) and then one second later return an ORA-01017 error, which indicates that the database is up and running?
>>
>> We have opened an Oracle SR on this, but I'm hoping someone here saw something similar recently?
>>
>> Please let me know what you think?
>>
>> Thanks!
>>
>> -Tim

--
http://www.freelists.org/webpage/oracle-l
Received on Sun Aug 20 2023 - 18:52:00 CEST

Original text of this message