Introduction of Host Failure Detection Time(HostFailureDetectionTime) in pureScale

The host failure detection time is one of the key parameters associated with heartbeating. In a DB2 pureScale cluster, this is the setting for the time it takes RSCT to declare a host unreachable and initiate recovery actions. A shortened host failure detection time could allow for faster recovery, but with the trade off that it could also result in false positives and unnecessary host outages if the network is unreliable (i.e., a temporary problem could be mistaken as a network failure.) .

The host failure detection time setting can be queried using the db2cluster command:

db2cluster -cm -list -hostfailuredetectiontime
The host failure detection time is 4 seconds.

Above db2cluster command actually invoked command lsrsrc -s "Name like 'CG%'" IBM.CommunicationGroup underneath to obtain the parameters Sensitivity,PeriodMilliSec and PingGracePeriodMilliSec , then calculate based on the value of those parameters ,

The formula is : (2 * (sensitivity + 1) * periodMilliSec) / 1000

The explanation of Sensitivity, Period, Priority and Grace are as follows:

Sensitivity
   The number of missed heartbeats that constitute a failure
Period
   The number of MilliSec between heartbeats
Priority
   The relative priority of the communication group
Grace
   The number of MilliSec for the grace period

The recommended value is 4 for systems with SCSI-3 PR enabled and 8 otherwise .

If the hostfailuredetectiontime is not set properly , db2diag.log may report error as below

2019-07-17-13.38.32.298997+480 E21255E605            LEVEL: Error
PID     : 113746               TID : 46912652740544 PROC : db2cluster
INSTANCE: db2sdin1             NODE : 000
HOSTNAME: db2pssh02-biz-m00
FUNCTION: <0>, <0>, <0>, probe:1274
RETCODE : ECF=0x90000617=-1879046633=ECF_SQLHA_COMM_GROUP_BAD_CONFIG
          A communication group is in bad configuration
DATA #1 : String, 67 bytes
libsqlha: sqlhaGetObjectAttribute() call error from wrapper library
DATA #2 : String, 0 bytes
Object not dumped: Address: 0x000000000261FD10 Size: 0 Reason: Zero-length data
DATA #3 : signed integer, 4 bytes
0
2019-07-17-13.38.32.299690+480 E21861E413            LEVEL: Error
PID     : 113746               TID : 46912652740544 PROC : db2cluster
INSTANCE: db2sdin1             NODE : 000
HOSTNAME: db2pssh02-biz-m00
FUNCTION: DB2 UDB, high avail services, sqlhaGetInfoForClusterObject, probe:4601
MESSAGE : ECF=0x90000617=-1879046633=ECF_SQLHA_COMM_GROUP_BAD_CONFIG
          A communication group is in bad configuration

Or if there are more than one Communications Group used in pureScale cluster , and each has different host failure detection time after calculation ,
e.g.

CG7 : (2 * (sensitivity + 1) * periodMilliSec) / 1000=(2*5*1000)/1000=10s

CG2 :(2 * (sensitivity + 1) * periodMilliSec) / 1000=(2*5*800)/1000=8s
resource 1:
           Name                   = "CG7"
           Sensitivity            = 4
           Period                 = 1
           UseBroadcast           = 1
           UseSourceRouting       = 1
           NIMPathName            = ""
           NIMParameters          = ""
           Priority               = 1
           PeriodMilliSec         = 1000
           PingGracePeriodMilliSec = -1
           MediaType              = 1
           UseForNodeMembership   = 1
           ActivePeerDomain       = "db2domain_20170621073028"
           ConfigChanged          = 0
resource 4:
           Name                   = "CG2"
           Sensitivity            = 4
           Period                 = 1
           UseBroadcast           = 1
           UseSourceRouting       = 1
           NIMPathName            = ""
           NIMParameters          = ""
           Priority               = 1
           PeriodMilliSec         = 800
           PingGracePeriodMilliSec = 60000
           MediaType              = 1
           UseForNodeMembership   = 1
           ActivePeerDomain       = "db2domain_20170621073028"
           ConfigChanged          = 0

then we may report following error message when query the value by db2cluster -cm -list -HostFailureDetectionTime command :

#db2cluster -cm -list -HostFailureDetectionTime
Unable to determine the host failure detection time of the shared file system cluster.
There was an internal db2cluster error. Refer to the diagnostic logs (db2diag.log or /tmp/ibm.db2.cluster.*) and the DB2 Information Center for details.

So please ensure the HostFailureDetectionTime has been configured correctly ,and do not use RSCT/GPFS command to change above mentioned parameters and GPFS's totalPingTimeout separately;

The procedure of changing the host failure detection time can be found in KC link below
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.admin.sd.doc/doc/t0056839.html

Introduction of Host Failure Detection Time(HostFailureDetectionTime) in pureScale

Share your feedback

Need support?