We had a customer with HADR+TSA setup in place. For about 3 hour period they noticed extreme workload and hence system (Primary) was almost hung most times.
During this period, they saw a lot of connections made to standby db but there was no takeover through TSA.
How did connections go to HADR standby db when no TAKEOVER command manually or by TSA was issued ?
On investigation, we could confirm that - while primary db was in hung state intermittently due to workload, on standby db there were numerous connections attempts which were turned down. Messages such as follows were seen continuously in standby for the complete period :
From Standby db2diag.log:
2017-10-17-09.30.59.933355+000 I2587636E769 LEVEL: Warning
PID : 3020 TID : 87658658759 PROC : db2sysc 0
INSTANCE: INSname NODE : 000 DB : DBname
APPHDL : 0-2652 APPID: *LOCAL.INSname.171017143845
AUTHID : authid HOSTNAME: hostname_standby
EDUID : 1083 EDUNAME: db2agent (DBname) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrCheckDb, probe:18200
MESSAGE : SQL1776N The command cannot be issued on an HADR database. Reason code = "1".
DATA #1 : Hex integer, 4 bytes
DATA #2 : sqeApplication_acbInfo, PD_TYPE_sqeApplication_acbInfo, 4 bytes
DATA #3 : String, 50 bytes
Connections are not allowed on a standby database.
There were no TSA initiated or HADR takeover commands seen in db2diag.logs on primary and standby.
So we checked the next possible place and we found the reason.
ACR (automatic client reroute) was enabled on Primary and Standby.
On Primary we could see :
Number of entries in the directory = 1
Database 1 entry:
Database alias = DBname
Database name = DBname
Local database directory = /path/
Database release level = 10.00
Directory entry type = Indirect
Catalog database partition number = 0
Alternate server hostname = hostname_standby
Alternate server port number = 50001
When ACR (automatic client reroute) is enabled the DB2 database system tries to establish the connection to primary once again when first fails. If it is still not working, the DB2 database system tries the alternate server location (host "hostname_standby" with port 50001 i.e. standby). This is what happened in this customer case.
Since Primary was hung, it was not connectable, and when connections were refused, applications which had connected before must have had this ACR information stored, so irrespective of whether HADR or TSA is in place, ACR happened and connections went to the standby machine.
Obviously, since it was a standby db, no connections were allowed and it returned SQL1776N RC=1 on all those connections attempts.
In a high availability disaster recovery (HADR) environment, ACR is also enabled if SQL1776N is returned back from the HADR standby server, so connections are next routed to Primary again.
Refer below link for more details :
Automatic client reroute description and setup
Usually, in HADR+TSA, we do not have the alternate server updated i.e. ACR enabled and have a virtual IP.
Please go through the below link for more details about how ACR and HADR work together :
In the customer case, the connections started to work seamlessly on Primary after the workload issue was addressed.