New monitoring parameters are introduced in DB2 V10.5 for HADR. These hold good for both ESE and pS environment.We discuss a few of them in this blog.
A heartbeat is sent from Primary to Standby and visa-versa in an DB2 HADR setup. This heartbeat is used by both primary and standby to check if the other database is up and running. The following monitoring elements related to heartbeat are introduced in DB2 V10.5:
Number of heartbeat messages not received on time on this log stream, accumulated since database startup on the local member. This number should be viewed relative to HEARTBEAT_EXPECTED. For example, 100 missed heartbeats indicate network problem with HEARTBEAT_EXPECTED at 1000 (10% miss rate), but not (or less) a problem with HEARTBEAT_EXPECTED at 10000 (1% miss rate).
It is the interval between two heartbeats. Very short HEARTBEAT_INTERVAL (as the result of short HADR_TIMEOUT,) can result in false alarm on heartbeat missed.
Number of heartbeat messages expected on this log stream, accumulated since database startup on the local member. Together with HEARTBEAT_MISSED field, you can determine health of network in any given time duration.
We know HADR_TIMEOUT parameter specifies the time (in seconds) that the high availability disaster recovery (HADR) process waits before considering a communication attempt to have failed.
So if HADR_TIMEOUT = 40 then HEARTBEAT_INTERVAL is calculated as HADR_TIMEOUT/4 = 10
The recommended value for HADR_TIMEOUT is 120 seconds. In this case the HEARTBEAT_INTERVAL would be 30 seconds. Depending on the network speed, you can decide to increase the HADR_TIMEOUT value for slower network speeds.
Following screenshot shows these values under db2pd -hadr output
From the above output we see that the HEARTBEAT_INTERVAL is 10 seconds.
From the time database is activated , 4 heartbeats are missed out of the 10 expected.
So nearly 40% heartbeat misses. This gives us some clue that there can be some issue with the network.
The parameters HEARTBEAT_MISSED,HEARTBEAT_EXPECTED get reset at following conditions :
1. Stop and start hadr, Deactivate and activate db on primary -- > this will reset the values on primary only
2. Deactivate and activate db on standby --> this will reset the values on standby only
3. Graceful takeover on standby --> In non-pureScale environment its not reset on primary but in pureScale environment the values are reset on primary because DB on primary is bounced as part of graceful takeover. On standby these values are not reset.
4. Reintegration of host1(old-primary) after forced takeover on host2(old-standby) will reset the values on the Host1