The host failure detection time is one of the key parameters associated with heartbeating. In a DB2 pureScale cluster, this is the setting for the time it takes RSCT to declare a host unreachable and initiate recovery actions. A shortened host failure detection time could allow for faster recovery, but with the trade off that it could also result in false positives and unnecessary host outages if the network is unreliable (i.e., a temporary problem could be mistaken as a network failure.) .
The host failure detection time setting can be queried using the db2cluster command:
db2cluster -cm -list -hostfailuredetectiontime
The host failure detection time is 4 seconds.
Above db2cluster command actually invoked command lsrsrc -s "Name like 'CG%'" IBM.CommunicationGroup underneath to obtain the parameters Sensitivity,PeriodMilliSec and PingGracePeriodMilliSec , then calculate based on the value of those parameters ,
The formula is : (2 * (sensitivity + 1) * periodMilliSec) / 1000
The explanation of Sensitivity, Period, Priority and Grace are as follows:
Sensitivity
The number of missed heartbeats that constitute a failure
Period
The number of MilliSec between heartbeats
Priority
The relative priority of the communication group
Grace
The number of MilliSec for the grace period
The recommended value is 4 for systems with SCSI-3 PR enabled and 8 otherwise .
If the hostfailuredetectiontime is not set properly , db2diag.log may report error as below
2019-07-17-13.38.32.298997+480 E21255E605 LEVEL: Error
PID : 113746 TID : 46912652740544 PROC : db2cluster
INSTANCE: db2sdin1 NODE : 000
HOSTNAME: db2pssh02-biz-m00
FUNCTION: <0>, <0>, <0>, probe:1274
RETCODE : ECF=0x90000617=-1879046633=ECF_SQLHA_COMM_GROUP_BAD_CONFIG
A communication group is in bad configuration
DATA #1 : String, 67 bytes
libsqlha: sqlhaGetObjectAttribute() call error from wrapper library
DATA #2 : String, 0 bytes
Object not dumped: Address: 0x000000000261FD10 Size: 0 Reason: Zero-length data
DATA #3 : signed integer, 4 bytes
0
2019-07-17-13.38.32.299690+480 E21861E413 LEVEL: Error
PID : 113746 TID : 46912652740544 PROC : db2cluster
INSTANCE: db2sdin1 NODE : 000
HOSTNAME: db2pssh02-biz-m00
FUNCTION: DB2 UDB, high avail services, sqlhaGetInfoForClusterObject, probe:4601
MESSAGE : ECF=0x90000617=-1879046633=ECF_SQLHA_COMM_GROUP_BAD_CONFIG
A communication group is in bad configuration
Or if there are more than one Communications Group used in pureScale cluster , and each has different host failure detection time after calculation ,
e.g.
e.g.
CG7 : (2 * (sensitivity + 1) * periodMilliSec) / 1000=(2*5*1000)/1000=10s
CG2 :(2 * (sensitivity + 1) * periodMilliSec) / 1000=(2*5*800)/1000=8s
resource 1:
Name = "CG7"
Sensitivity = 4
Period = 1
UseBroadcast = 1
UseSourceRouting = 1
NIMPathName = ""
NIMParameters = ""
Priority = 1
PeriodMilliSec = 1000
PingGracePeriodMilliSec = -1
MediaType = 1
UseForNodeMembership = 1
ActivePeerDomain = "db2domain_20170621073028"
ConfigChanged = 0
resource 4:
Name = "CG2"
Sensitivity = 4
Period = 1
UseBroadcast = 1
UseSourceRouting = 1
NIMPathName = ""
NIMParameters = ""
Priority = 1
PeriodMilliSec = 800
PingGracePeriodMilliSec = 60000
MediaType = 1
UseForNodeMembership = 1
ActivePeerDomain = "db2domain_20170621073028"
ConfigChanged = 0
resource 1:
Name = "CG7"
Sensitivity = 4
Period = 1
UseBroadcast = 1
UseSourceRouting = 1
NIMPathName = ""
NIMParameters = ""
Priority = 1
PeriodMilliSec = 1000
PingGracePeriodMilliSec = -1
MediaType = 1
UseForNodeMembership = 1
ActivePeerDomain = "db2domain_20170621073028"
ConfigChanged = 0
resource 4:
Name = "CG2"
Sensitivity = 4
Period = 1
UseBroadcast = 1
UseSourceRouting = 1
NIMPathName = ""
NIMParameters = ""
Priority = 1
PeriodMilliSec = 800
PingGracePeriodMilliSec = 60000
MediaType = 1
UseForNodeMembership = 1
ActivePeerDomain = "db2domain_20170621073028"
ConfigChanged = 0
then we may report following error message when query the value by db2cluster -cm -list -HostFailureDetectionTime command :
#db2cluster -cm -list -HostFailureDetectionTime
Unable to determine the host failure detection time of the shared file system cluster.
There was an internal db2cluster error. Refer to the diagnostic logs (db2diag.log or /tmp/ibm.db2.cluster.*) and the DB2 Information Center for details.
So please ensure the HostFailureDetectionTime has been configured correctly ,and do not use RSCT/GPFS command to change above mentioned parameters and GPFS's totalPingTimeout separately;
The procedure of changing the host failure detection time can be found in KC link below
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.admin.sd.doc/doc/t0056839.html