IBM Support

75 ways to demystify DB2: #14 : Expert Advice: Why the standby keeps going into remoteCatchUp state when log files are available locally?

Technical Blog Post


Abstract

75 ways to demystify DB2: #14 : Expert Advice: Why the standby keeps going into remoteCatchUp state when log files are available locally?

Body

Consider the following scenario, in a DB2 HADR configuration using ASYNC mode. Presently,the log files were extracted from TSM to a local file system on the standby server. Standby database was deactivated and re-activated causing the standby to go into LocalCatchUp state. However, its unable to read the local log file because its considered as a "stale" file and the standby re-enters a RemoteCatchUp state. This process was repeated two to three times and the standby fails to read the log file and  switch HADR to RemoteCatchUp state. DB2 will generate similar messages in the db2diag.log  :

2015-01-28-12.26.52.860613+060 I6092198A457       LEVEL: Warning        
PID     : 1282122              TID  : 3856        PROC : db2sysc 0      
INSTANCE: db2inst              NODE : 000         DB   : MYDB            
APPHDL  : 0-8                  APPID: *LOCAL.db2inst.110729112647        
AUTHID  : DB2BIP                                                        
EDUID   : 3856                 EDUNAME: db2agent (MYDB) 0                
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduStartup, probe:21151
MESSAGE : Info: HADR Startup has begun.                               

2015-01-28-12.26.53.138837+060 E6096113A369       LEVEL: Event          
PID     : 1282122              TID  : 9254        PROC : db2sysc 0      
INSTANCE: db2inst              NODE : 000                               
EDUID   : 9254                 EDUNAME: db2hadrs (MYDB) 0                
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE  : HADR state set to S-LocalCatchup (was S-Boot)                

2015-01-28-12.26.53.286362+060 I6100019A470       LEVEL: Warning        
PID     : 1282122              TID  : 5142        PROC : db2sysc 0      
INSTANCE: db2inst1             NODE : 000                               
EDUID   : 5142                 EDUNAME: db2lfr (MYDB) 0                  
FUNCTION: DB2 UDB, recovery manager, sqlplfrFMReadLog, probe:5150       
MESSAGE : Found a log on a newer chain.  Updating chain number.  extNum / chainId
DATA #1 : unsigned integer, 4 bytes                                     
597293                                                                  
DATA #2 : unsigned integer, 4 bytes                                     
12   

2015-01-28-12.26.55.487649+060 I6101311A360       LEVEL: Warning        
PID     : 1282122              TID  : 5142        PROC : db2sysc 0      
INSTANCE: db2inst1             NODE : 000                               
EDUID   : 5142                 EDUNAME: db2lfr (MYDB) 0                  
FUNCTION: DB2 UDB, recovery manager, sqlplfrFMOpenLog, probe:600  
    
MESSAGE : Extent 597294 in log path may be stale. Trying archive.     

As a result the standby enters RemoteCatchup state and requests the primary to retrieve the log files required for rolling forward the database records.

2015-01-28-12.27.00.179704+060 I6102036A364       LEVEL: Warning
PID     : 1282122              TID  : 4371        PROC : db2sysc 0
INSTANCE: db2inst1               NODE : 000
EDUID   : 4371                 EDUNAME: db2logmgr (MYDB) 0
FUNCTION: DB2 UDB, data protection services, sqlpgRetrieveLogFile, probe:4130
MESSAGE : Started retrieve for log file S0597294.LOG.
 
2015-01-28-12.27.00.259102+060 E6102975A431       LEVEL: Warning
PID     : 1282122              TID  : 4371        PROC : db2sysc 0
INSTANCE: db2inst1               NODE : 000
EDUID   : 4371                 EDUNAME: db2logmgr (MYDB) 0
FUNCTION: DB2 UDB, data protection services, sqlpgRetrieveLogFile, probe:4165
MESSAGE : ADM1847W  Failed to retrieve log file "S0597294.LOG" on chain "12" to "/db2/MYDB/log_dir/NODE0000/".

2015-01-28-12.27.00.558157+060 E6103820A385       LEVEL: Event
PID     : 1282122              TID  : 9254        PROC : db2sysc 0
INSTANCE: db2inst1               NODE : 000
EDUID   : 9254                 EDUNAME: db2hadrs (MYDB) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE  : HADR state set to S-RemoteCatchupPending (was S-LocalCatchup)

2015-01-28-12.27.00.656618+060 E6104206A386       LEVEL: Event
PID     : 1282122              TID  : 9254        PROC : db2sysc 0
INSTANCE: db2inst1               NODE : 000
EDUID   : 9254                 EDUNAME: db2hadrs (MYDB) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE  : HADR state set to S-RemoteCatchup (was S-RemoteCatchupPending)

In this case, why the standby keeps switching to a RemoteCatchUp state when the log files are available locally? Being in a RemotCatchUp state, may take too long to complete compared to a LocalCatchUp state and what can be done to resolve this issue?

Explanation :

The problem is a timing issue. It depends when you issue the deactivate database command. If a log file is in the process of being transferred
when you deactivate the standby database, then the local copy of that file will be incomplete. This is because the remote catchup process does not wait for a log file to be transferred completely before it starts applying it on the standby database. The consequence of this is that when you subsequently re-activate the standby database, it goes into LocalCatchUp state, checks the next log it needs which (because of the interrupted transfer) is incomplete, rejects it as "stale" and goes into RemoteCatchUp state again.

Hence if you are not careful to replace any logs the remote catchup process downloaded from the primary, the standby will always pickup where it left off, i.e with the log file it must process when the database deactivated, and that is always incomplete. Keep in mind that, since the effect of an incompletely transferred log file is masked because log files are always allocated full size, then populated; i.e a log file being transferred does not start at zero bytes and grow to full size; its just allocated full size, and so when you list the log directory it looks just the same size as a full log file.

Resolution:

1. Deactivate the database on standby. Don not stop HADR as it switches the database role to standard.
2. Replace the (stale) log file the standby was processing when it was deactivated, with a full copy of the log file from the primary.
3. Copy in some subsequent logs from primary for good measure.
4. Re-activate database on standby.

 

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm11141246