IBM Support

Heartbeat shows as stopped, crm_mon show it running

Troubleshooting


Problem

This is a known issue fixed in hpf 5.5, however, there is also a workaround that needs to be done.

Symptom

Heartbeat shows as stopped, crm_mon show it running; HPF version below 5.5.0.1.

Nzhealthcheck reports the failure as below :

Failures (1):
- Rule --+-------- Issue ---------+------ Component ------+- Severity --
SHC921 | Host's cluster is not | rack1.host1.cluster | High
| active | rack1.host2.cluster |
- Rule --+-------- Issue ---------+------ Component ------+- Severity --

Cause

HPF version below 5.5.0.1.

Environment

IBM PDA appliance all models

Diagnosing The Problem

Heartbeat shows as stopped, crm_mon show it running; HPF version below 5.5.0.1.

Resolving The Problem

Action plan:

1) log directly into each host as root. must ensure that each host login is done via the underlying host IP address, and not the VIP/floating database address.

to confirm the ip's run more /etc/hosts

2) Once you have established the two root logins, use 'crm_mon -1' to determine the active host where the cluster resources are currently running. If this command fails to return data, it may be indicative that the heartbeat subsystem is not currently running, which can be confirmed by running

ps -ef | grep heartbeat

If the only process ID returned is the 'grep heartbeat' line, you should not have to proceed with the rest of this procedure for this particular host.

3) On the active host, bring down the NPS software by issuing

su - nz
nzstop
nzstate

Once you have confirmed this, log out of the nz user session and return back to the root user prompt.

4) Disable the automatic restart of heartbeat on both hosts. From each host, run

chkconfig heartbeat off

This will ensure that we control the startup of the heartbeat subsystem if there are any unintentional STONITH operations during the course of this workaround.

5) Confirm that there are no active file handles into the /nz or /export/home filesystems by issuing

lsof /nz
lsof /export/home

If there are any processes with handles into these filesystems returned by the lsof commands, they need to be terminated, and the process repeated until each lsof command returns no output.

6) We are now ready to forcibly stop the heartbeat subsystem on both hosts. This script is only intended to be used in extreme circumstances, and customers should be shielded from knowledge of its existence as much as possible, but many customers will be monitoring the repair process, and they should be made aware that they risk doing themselves great harm if they ever try to invoke this themselves. On each host, passive node first, run

/nzlocal/scripts/defib.sh

press enter when prompted. The scripts do not need to be run exactly in parallel, but they should be run within a short duration of each other to ensure that the heartbeat processes are brought down on both hosts in succession. The longer the heartbeat processes on one host are allowed to remain running after running the defib script on the other host, it increases the chance for having an unintended STONITH occur.

7) At this point, even though the heartbeat subsystem has been forcibly stopped on both hosts, the output of 'service drbd status' will show that the DRBD filesystems are still mounted on the former heartbeat master host:

[root@q25m-3-h1 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.4.0nz2c (api:1/proto:86-100)
GIT-hash: 28753f559ab51b549d16bcf487fe625d5919c49c build by root@nz30027-h1, 2014-01-27 04:20:37
m:res cs ro ds p mounted fstype
0:r1 Connected Primary/Secondary UpToDate/UpToDate C /export/home ext4
1:r0 Connected Primary/Secondary UpToDate/UpToDate C /nz ext4

Run...

/nzlocal/scripts/heartbeat.sh

to unmount the DRBD filesystems and relinquish ownership of the DRBD resources. A subsequent re-issuance of the 'service drbd status' command should resemble the following output:

[root@q25m-3-h1 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.4.0nz2c (api:1/proto:86-100)
GIT-hash: 28753f559ab51b549d16bcf487fe625d5919c49c build by root@nz30027-h1, 2014-01-27 04:20:37
m:res cs ro ds p mounted fstype
0:r1 Connected Secondary/Secondary UpToDate/UpToDate C
1:r0 Connected Secondary/Secondary UpToDate/UpToDate C

================

NOTE!!!!!!!!!!

If HPF upgrade is part of the action plan you must upgrade to hpf 5.5.0.1 or higher. You would follow the readme at this point and upgrade HPF at this stage of the procedure.

================


8) We are now ready to restart heartbeat. Starting with the desired active host, issue

service heartbeat start
ssh ha2 !!

heartbeat should restart normally at this point.

The output of the 'service heartbeat status' command should now correctly report that the heartbeat subsystem is running. Once you are satisfied that everything is working properly, make sure to re-enable the automatic startup of heartbeat by issuing

chkconfig heartbeat on

on both the active and standby hosts before returning the system back to the customer.

[{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Cluster","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
17 October 2019

UID

swg21978136