Re: DataGuard Failover Test issues
From: Maris Elsins <elmaris_at_gmail.com>
Date: Fri, 12 Feb 2016 12:31:34 +0200
Message-ID: <CABQhObu=va5E6r8FXkkBLHFgxSa0A=dNJy_R6bkM7L8pC=2-eA_at_mail.gmail.com>
Abbas, I don't think you are wrong, and honestly saying I don't know if I'm right :)
I based my answer on few clues - 1) issue happens only when the host is down (IP is unreachable) 2) TCP connection timeouts can take long 3) we haven't seen the sqlnet.ora (probably the timeout settings have been changed?).
Date: Fri, 12 Feb 2016 12:31:34 +0200
Message-ID: <CABQhObu=va5E6r8FXkkBLHFgxSa0A=dNJy_R6bkM7L8pC=2-eA_at_mail.gmail.com>
Abbas, I don't think you are wrong, and honestly saying I don't know if I'm right :)
I based my answer on few clues - 1) issue happens only when the host is down (IP is unreachable) 2) TCP connection timeouts can take long 3) we haven't seen the sqlnet.ora (probably the timeout settings have been changed?).
I think there is a chance that the broker could try connecting several times to the primary. (easy to check using strace or tracing tool alike). I just wanted to give my opinion one where I'd start the troubleshooting.
--- Maris Elsins _at_MarisElsins <https://twitter.com/MarisElsins> www.facebook.com/maris.elsins On Fri, Feb 12, 2016 at 12:18 PM, abu fazal md abbas <abbas.appsdba_at_yahoo.inReceived on Fri Feb 12 2016 - 11:31:34 CET
> wrote:
> Maris,
>
> > $ some of the network related timeouts can be tuned in sqlnet.ora (ie.
> SQLNET.OUTBOUND_CONNECT_TIMEOUT, TCP.CONNECT_TIMEOUT), or by specific
> parameters in the TNS description in tnsnames.ora (CONNECT_TIMEOUT,
> RETRY_COUNT). Probably experimenting with these could reduce the time it
> takes to try getting the connection to the primary.
>
> My understanding is that at max it should time out within 60 secs (based
> on the default TCP.CONNECT_TIMEOUT value).
> Please correct me if I am wrong.
>
> > These timeouts can be quite long (70 to 100 seconds), and if the DG
> Broker attempts reconnecting several times it can possible take 20+ minutes.
>
> Is there a possibility that the broker will try reconnecting even though
> retry parameter is not set for TNS alias (default RETRY=0)?
>
>
>
> *Thanks and Regards,*
>
> Abbas
>
> *Abu Fazal Mohammed Abbas*
> +91-991-625-7994
> Website: www.oraclebuffer.com
>
>
> ------------------------------
> *From:* Maris Elsins <elmaris_at_gmail.com>
> *To:* oracle.blog3_at_gmail.com
> *Cc:* abu fazal md abbas <abbas.appsdba_at_yahoo.in>; Oracle Mailing List <
> oracle-l_at_freelists.org>
> *Sent:* Friday, 12 February 2016 3:22 PM
>
> *Subject:* Re: DataGuard Failover Test issues
>
> Hi,
>
> It does look like it's related to TCP connection timeouts, because they
> affect you only when the host is down and the IP is unreachable. These
> timeouts can be quite long (70 to 100 seconds), and if the DG Broker
> attempts reconnecting several times it can possible take 20+ minutes. I
> haven't experienced this by myself, but you can test a few things to get
> closer to the truth:
> on the standby when the primary is down:
> $ time telnet primary 1545 - see how long this would take before it fails
> (that would be the TCP connection timeout)
> $ tnsping test_P - see how long this takes before it fails, does this take
> as long a the previous command (this is impacted by sql*net settings and
> should be shorter than the tcp timeout)?
> $ some of the network related timeouts can be tuned in sqlnet.ora (ie.
> SQLNET.OUTBOUND_CONNECT_TIMEOUT, TCP.CONNECT_TIMEOUT), or by specific
> parameters in the TNS description in tnsnames.ora (CONNECT_TIMEOUT,
> RETRY_COUNT). Probably experimenting with these could reduce the time it
> takes to try getting the connection to the primary.
>
> ---
> Maris Elsins
> _at_MarisElsins <https://twitter.com/MarisElsins>
> www.facebook.com/maris.elsins
>
>
>
> On Thu, Feb 11, 2016 at 10:49 PM, max scalf <oracle.blog3_at_gmail.com>
> wrote:
>
> Abbas,
>
> Thank you for those doc I will check those out, I remember looking at my
> notes from the past as we are using NON default 1521 port(security teams
> wishes), after enabling the broker configuration we see below entry and
> that actually prevented us from doing a switchover with below entry and
> hence to get around that we moved it to staticconnectidentifer to TEST_P
> and TEST_S...
>
> StaticConnectIdentifier =
> '(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=primary)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=TEST_P_DGMGRL)(INSTANCE_NAME=TEST)(SERVER=DEDICATED)))'
>
> Also here is another thing that I forgot to mentioned, If i were to run
> "failover to test_s", it takes long time but works just fine. And as this
> is part of our FAILOVER test, we bring the primary back up online and
> reinstate the primary we are back in business, the show config command does
> not hang.... seems like it only hang when primary is down completely.
>
> On Thu, Feb 11, 2016 at 1:49 PM, abu fazal md abbas <
> abbas.appsdba_at_yahoo.in> wrote:
>
> Looks like you are using the static listener entry for both
> *StaticConnectIdentifier* and *DGConnectIdentifier*, which is not
> recommended as per Oracle documentation. Further, you have also defined
> static entry for broker "DGB" service which is not recommended.
>
> I would suggest to review the configuration based on the following
> documentation
>
> Oracle Data Guard Broker and Static Service Registration (Doc ID 1387859.1)
>
> If the problem still persists, you may consider enabling tracing at broker
> level to troubleshoot the issue further. Following is the document, which
> can referred be for enabling tracing.
>
> Diagnosing Connection Problems with an enabled Data Guard Broker
> Configuration (Doc ID 745201.1)
>
>
>
>
> *Thanks and Regards,*
>
> Abbas
>
> *Abu Fazal Mohammed Abbas*
> +91-991-625-7994
> Website: www.oraclebuffer.com
>
>
> ------------------------------
> *From:* max scalf <oracle.blog3_at_gmail.com>
> *To:* abu fazal md abbas <abbas.appsdba_at_yahoo.in>
> *Cc:* Oracle Mailing List <oracle-l_at_freelists.org>
> *Sent:* Friday, 12 February 2016 12:34 AM
>
> *Subject:* Re: DataGuard Failover Test issues
>
> We are using non default port of 1545, but local_listener is not pointing
> to anything. We have a static listener entry in listener.ora file....see
> below for entries from listener.ora and tnsnames.ora
>
> standby:/oracle/test> sqlplus / as sysdba
>
> SQL*Plus: Release 11.2.0.4.0 Production on Fri Feb 12 06:01:20 2016
>
> Copyright (c) 1982, 2013, Oracle. All rights reserved.
>
>
> Connected to:
> Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit
> Production
> With the Partitioning, OLAP, Data Mining and Real Application Testing
> options
>
> SQL> show parameter local_list%;
>
> NAME TYPE VALUE
> ------------------------------------ -----------
> ------------------------------
> local_listener string
> SQL> exit
> Disconnected from Oracle Database 11g Enterprise Edition Release
> 11.2.0.4.0 - 64bit Production
> With the Partitioning, OLAP, Data Mining and Real Application Testing
> options
> standby:/oracle/test>
> standby:/oracle/test> cd /oracle/test/11204/network/admin/
> standby:/oracle/test/11204/network/admin>
> standby:/oracle/test/11204/network/admin> cat listener.ora
> LISTENER_test =
> (DESCRIPTION_LIST =
> (DESCRIPTION =
> (ADDRESS_LIST =
> (ADDRESS =
> (PROTOCOL = TCP)
> (HOST = standby)
> (PORT = 1545)
> )
> )
> )
> )
> STARTUP_WAIT_TIME_LISTENER_test = 0
> CONNECT_TIMEOUT_LISTENER_test = 10
> TRACE_LEVEL_LISTENER_test = OFF
>
> SID_LIST_LISTENER_test=
> (SID_LIST =
> (SID_DESC =
> (SDU = 32768)
> (SID_NAME = test)
> (GLOBAL_DBNAME=test_S)
> (ORACLE_HOME = /oracle/test/11204)
> )
> (SID_DESC =
> (SID_NAME = test)
> (GLOBAL_DBNAME=test.WORLD)
> (ORACLE_HOME = /oracle/test/11204)
> )
> (SID_DESC =
> (SID_NAME = test)
> (GLOBAL_DBNAME=test_S_DGMGRL)
> (ORACLE_HOME = /oracle/test/11204)
> )
> (SID_DESC =
> (SID_NAME = test)
> (GLOBAL_DBNAME=test_S_DGMGRL.WORLD)
> (ORACLE_HOME = /oracle/test/11204)
> )
> (SID_DESC =
> (SID_NAME = test)
> (GLOBAL_DBNAME=test_S_XPT.WORLD)
> (ORACLE_HOME = /oracle/test/11204)
> )
> (SID_DESC =
> (SID_NAME = test)
> (GLOBAL_DBNAME=test_S_DGB.WORLD)
> (ORACLE_HOME = /oracle/test/11204)
> )
> )
>
> standby:/oracle/test/11204/network/admin>
> standby:/oracle/test/11204/network/admin> cat tnsnames.ora
> test.WORLD=
> (DESCRIPTION =
> (SDU = 32768)
> (ADDRESS_LIST =
> (ADDRESS =
> (PROTOCOL = TCP)
> (HOST = primary)
> (PORT = 1545)
> )
> )
> (CONNECT_DATA =
> (SID = test)
> (GLOBAL_NAME = test.WORLD)
> )
> )
>
> TEST_P.WORLD=
> (DESCRIPTION =
> (ADDRESS_LIST =
> (ADDRESS =
> (COMMUNITY = TEST.WORLD)
> (PROTOCOL = TCP)
> (HOST = primary)
> (PORT =1545)
> )
> )
> (CONNECT_DATA =
> (SID = test)
> (GLOBAL_NAME = test)
> (SERVICE_NAME = test_P)
> )
> )
> TEST_S.WORLD=
> (DESCRIPTION =
> (ADDRESS_LIST =
> (ADDRESS =
> (COMMUNITY = TEST.WORLD)
> (PROTOCOL = TCP)
> (HOST = standby)
> (PORT =1545)
> )
> )
> (CONNECT_DATA =
> (SID = test)
> (GLOBAL_NAME = test)
> (SERVICE_NAME= test_S)
> )
> )
> standby:/oracle/test/11204/network/admin>
>
>
> On Thu, Feb 11, 2016 at 12:41 PM, abu fazal md abbas <
> abbas.appsdba_at_yahoo.in> wrote:
>
> I was suspecting a discrepancy between broker NetTimeout settings and TNS
> retries/delay settings.
> However, looks like that is not the case here.
>
> The only setting that seems suspicious to me in the broker setting is the
> *"StaticConnectIdentifier"* property which looks to be pointing to a
> dynamic listener service. Are you not using a static listener service
> (<db_unique_name>_DGMGRL.<db_domain>) for this setup?
>
> Looks like you are also using a non default port (1545) for this databases?
> Can you confirm, if the local_listener of the databases point to that non
> default port?
>
>
>
> *Thanks and Regards,*
>
> Abbas
>
> *Abu Fazal Mohammed Abbas*
> Website: www.oraclebuffer.com
>
>
> ------------------------------
> *From:* max scalf <oracle.blog3_at_gmail.com>
> *To:* abu fazal md abbas <abbas.appsdba_at_yahoo.in>
> *Cc:* Oracle Mailing List <oracle-l_at_freelists.org>
> *Sent:* Thursday, 11 February 2016 10:04 PM
>
> *Subject:* Re: DataGuard Failover Test issues
>
> Here it is...
>
> DGMGRL> show configuration;
>
> Configuration - test_config
>
> Protection Mode: MaxPerformance
> Databases:
> test_p - Primary database
> test_s - Physical standby database
>
> Fast-Start Failover: DISABLED
>
> Configuration Status:
> SUCCESS
>
> DGMGRL>
> DGMGRL> show database verbose test_p;
>
> Database - test_p
>
> Role: PRIMARY
> Intended State: TRANSPORT-ON
> Instance(s):
> test
>
> Properties:
> DGConnectIdentifier = 'test_p'
> ObserverConnectIdentifier = ''
> LogXptMode = 'ASYNC'
> DelayMins = '0'
> Binding = 'optional'
> MaxFailure = '0'
> MaxConnections = '1'
> ReopenSecs = '300'
> NetTimeout = '30'
> RedoCompression = 'DISABLE'
> LogShipping = 'ON'
> PreferredApplyInstance = ''
> ApplyInstanceTimeout = '0'
> ApplyParallel = 'AUTO'
> StandbyFileManagement = 'AUTO'
> ArchiveLagTarget = '0'
> LogArchiveMaxProcesses = '5'
> LogArchiveMinSucceedDest = '1'
> DbFileNameConvert = ''
> LogFileNameConvert = '/oracle/test/origlogA/,
> /oracle/test/origlogA/, /oracle/test/origlogB/, /oracle/test/origlogB/,
> /oracle/test/standbylog, /oracle/test/standbylog'
> FastStartFailoverTarget = ''
> InconsistentProperties = '(monitor)'
> InconsistentLogXptProps = '(monitor)'
> SendQEntries = '(monitor)'
> LogXptStatus = '(monitor)'
> RecvQEntries = '(monitor)'
> ApplyLagThreshold = '0'
> TransportLagThreshold = '0'
> TransportDisconnectedThreshold = '30'
> SidName = 'test'
> StaticConnectIdentifier = 'test_P'
> StandbyArchiveLocation = '/oracle/test/oraarch/testarch'
> AlternateLocation = ''
> LogArchiveTrace = '0'
> LogArchiveFormat = '%t_%s_%r.dbf'
> TopWaitEvents = '(monitor)'
>
> Database Status:
> SUCCESS
>
> DGMGRL>
> DGMGRL>
> DGMGRL>
> DGMGRL> show database verbose test_s;
>
> Database - test_s
>
> Role: PHYSICAL STANDBY
> Intended State: APPLY-ON
> Transport Lag: 0 seconds (computed 0 seconds ago)
> Apply Lag: 0 seconds (computed 0 seconds ago)
> Apply Rate: 78.00 KByte/s
> Real Time Query: OFF
> Instance(s):
> test
>
> Properties:
> DGConnectIdentifier = 'test_s'
> ObserverConnectIdentifier = ''
> LogXptMode = 'ASYNC'
> DelayMins = '0'
> Binding = 'OPTIONAL'
> MaxFailure = '0'
> MaxConnections = '1'
> ReopenSecs = '300'
> NetTimeout = '30'
> RedoCompression = 'DISABLE'
> LogShipping = 'ON'
> PreferredApplyInstance = ''
> ApplyInstanceTimeout = '0'
> ApplyParallel = 'AUTO'
> StandbyFileManagement = 'AUTO'
> ArchiveLagTarget = '0'
> LogArchiveMaxProcesses = '5'
> LogArchiveMinSucceedDest = '1'
> DbFileNameConvert = ''
> LogFileNameConvert = '/oracle/test/origlogA/,
> /oracle/test/origlogA/, /oracle/test/origlogB/, /oracle/test/origlogB/,
> /oracle/test/standbylog, /oracle/test/standbylog'
> FastStartFailoverTarget = ''
> InconsistentProperties = '(monitor)'
> InconsistentLogXptProps = '(monitor)'
> SendQEntries = '(monitor)'
> LogXptStatus = '(monitor)'
> RecvQEntries = '(monitor)'
> ApplyLagThreshold = '0'
> TransportLagThreshold = '0'
> TransportDisconnectedThreshold = '30'
> SidName = 'test'
> StaticConnectIdentifier = 'test_S'
> StandbyArchiveLocation = '/oracle/test/oraarch/testarch'
> AlternateLocation = ''
> LogArchiveTrace = '0'
> LogArchiveFormat = '%t_%s_%r.dbf'
> TopWaitEvents = '(monitor)'
>
> Database Status:
> SUCCESS
>
> DGMGRL> exit
> standby:/oracle/test>
> standby:/oracle/test>
> standby:/oracle/test>
> standby:/oracle/test>
> standby:/oracle/test> tnsping test_s
>
> TNS Ping Utility for Linux: Version 11.2.0.4.0 - Production on 12-FEB-2016
> 03:32:18
>
> Copyright (c) 1997, 2013, Oracle. All rights reserved.
>
> Used parameter files:
> /oracle/test/11204/network/admin/sqlnet.ora
>
>
> Used TNSNAMES adapter to resolve the alias
> Attempting to contact (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (COMMUNITY
> = TEST.WORLD) (PROTOCOL = TCP) (HOST = standby) (PORT =1545)))
> (CONNECT_DATA = (SID = test) (GLOBAL_NAME = test) (SERVICE_NAME= test_S)))
> OK (0 msec)
> standby:/oracle/test>
> standby:/oracle/test> tnsping test_p
>
> TNS Ping Utility for Linux: Version 11.2.0.4.0 - Production on 12-FEB-2016
> 03:32:20
>
> Copyright (c) 1997, 2013, Oracle. All rights reserved.
>
> Used parameter files:
> /oracle/test/11204/network/admin/sqlnet.ora
>
>
> Used TNSNAMES adapter to resolve the alias
> Attempting to contact (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (COMMUNITY
> = TEST.WORLD) (PROTOCOL = TCP) (HOST = primary) (PORT =1545)))
> (CONNECT_DATA = (SID = test) (GLOBAL_NAME = test) (SERVICE_NAME = test_P)))
> OK (10 msec)
> standby:/oracle/test>
>
>
> On Thu, Feb 11, 2016 at 10:20 AM, abu fazal md abbas <
> abbas.appsdba_at_yahoo.in> wrote:
>
> Hi,
>
> It seems to be an issue with connection timeout settings.
>
> Could you provide the output of the following
>
> dgmgrl> show database verbose test_p
> dgmgrl> show database verbose test_s
>
> and then from both primary and standby servers
>
> $ tnsping test_p
> $ tnsping test_s
>
>
>
> *Thanks and Regards,*
>
> Abbas
>
> *Abu Fazal Mohammed Abbas*
> Website: www.oraclebuffer.com
>
>
> ------------------------------
> *From:* max scalf <oracle.blog3_at_gmail.com>
> *To:* Oracle Mailing List <oracle-l_at_freelists.org>
> *Sent:* Thursday, 11 February 2016 8:49 PM
> *Subject:* Re: DataGuard Failover Test issues
>
> quick update to that, after about 20 mins of hanging i got below error
> from show config...
>
> Configuration Status:
> ORA-16662: network timeout when contacting a database
> DGM-17017: unable to determine configuration status
>
> DGMGRL>
>
>
> On Thu, Feb 11, 2016 at 9:18 AM, max scalf <oracle.blog3_at_gmail.com> wrote:
>
> hello list,
>
> We are in process of testing failover for our 11.2.0.4 data guard
> environement. While under normal circumstance when we ran show
> configuration for data guard broker we get success message as shown below
>
> DGMGRL> show configuration verbose;
>
> Configuration - test_config
>
> Protection Mode: MaxPerformance
> Databases:
> test_p - Primary database
> test_s - Physical standby database
>
> Properties:
> FastStartFailoverThreshold = '30'
> OperationTimeout = '30'
> FastStartFailoverLagLimit = '30'
> CommunicationTimeout = '180'
> ObserverReconnect = '0'
> FastStartFailoverAutoReinstate = 'TRUE'
> FastStartFailoverPmyShutdown = 'TRUE'
> BystandersFollowRoleChange = 'ALL'
> ObserverOverride = 'FALSE'
> ExternalDestination1 = ''
> ExternalDestination2 = ''
> PrimaryLostWriteAction = 'CONTINUE'
>
> Fast-Start Failover: DISABLED
>
> Configuration Status:
> SUCCESS
>
> DGMGRL>
>
> now when we literally go power off our primary host(test_p) and come over
> to standby host(test_p) we do show configuration it just hangs as shown
> below, I waited a good 20 mins but no luck
>
> DGMGRL> connect sys/XXXXXX;
> Connected.
> DGMGRL> show configuration verbose;
>
>
> i do see message in alert log as show below, once we shutdown primary...
> and when i issue show config in data guard broker, no errors in drcTEST.log
> file either...
>
> Fri Feb 12 02:13:00 2016
> RFS[5]: Possible network disconnect with primary database
> Fri Feb 12 02:13:00 2016
> RFS[6]: Assigned to RFS process 3786
> RFS[6]: Possible network disconnect with primary database
> Fri Feb 12 02:13:00 2016
> RFS[4]: Possible network disconnect with primary database
> Fri Feb 12 02:13:00 2016
> RFS[2]: Possible network disconnect with primary database
> Fri Feb 12 02:14:32 2016
>
>
>
> has anybody seem something similar ?? What am i doing wrong? I have
> tested switchover 10's of time successfully but when it comes to failover
> the show config part hangs....
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
-- http://www.freelists.org/webpage/oracle-l