IBM Support

IBM Spectrum Scale: Quick restart of ctdb under SMB load can lead to a race condition

Flashes (Alerts)


Abstract

IBM has identified an issue with IBM Spectrum Scale V4.2.0.x in which a quick restart of ctdb under SMB load can lead to a race condition that keeps ctdb stuck in an endless recovery. This can occur during upgrade or by rapidly bringing the SMB service online / offline.

Content


Problem summary

If the ctdb service (part of the SMB service) is restarted quickly in the affected releases, it can hang in an endless recovery due to a race condition at startup. Thus the SMB service will not get healthy and SMB connections cannot be served.

This situation can be diagnosed by the following means that show that ctdb is in recovery:

- the ctdb status command (/usr/lpp/mmfs/bin/ctdb status)
- the syslog has entries indicating that ctdb is in recovery for a long time
- mmfslog will have information if CES commands are used to start/stop ctdb
- mmhealth node show shows that state of CES and SMB

Some examples are shown in Recommendation number 2 below.

Users affected

For any user running SMB cluster export services on IBM Spectrum Scale V4.2.0.x (where .x is all levels), enabled and active SMB connections may be affected during the following scenarios:

- User is rebooting SMB nodes or bringing SMB services online/offline in quick succession
- User is running heavy SMB workload during a manual upgrade
- User is running heavy SMB workload during an upgrade with the install toolkit

Recommendations

1) Any customer planning to upgrade should quiesce all SMB (and NFS and Object) workload to all CES nodes prior to the upgrade. This will help reduce the possibility of open files, due to CES file access, from causing a failure to unload GPFS and thus a subsequent reboot to recover. While the open files and reboot do not cause this issue, recovery from these situations may lead to recommendation number 2.

2) Rebooting of SMB nodes or bringing the SMB service online/offline should be controlled so that these activities do not occur in quick succession. It is often the case that during debug, a user may start/stop a service repeatedly. If this is necessary, the user should wait until an ongoing ctdb recovery has finished. This can be checked with the ctdb status command, a look into syslog showing that ctdb is in recovery or using mmhealth, for example:

ctdb status output:

Number of nodes:4
pnn:0 192.168.1.100    OK (THIS NODE)
pnn:1 192.168.1.102    OK
pnn:2 192.168.1.104    OK
pnn:3 192.168.1.106    OK
Generation:2009384308
Size:4
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
hash:3 lmaster:3
Recovery mode:RECOVERY (1)
Recovery master:3

mmhealth output on a node that is in recovery:

[root@fscc-x36m3-32 ~]# mmhealth node show CES

Node name:      cesnode001

Component      Status        Status Change     Reasons
------------------------------------------------------------
  CES          DEGRADED      Now               ctdb_recovery
  AUTH         HEALTHY       3 days ago        -
  AUTH_OBJ     DISABLED      15 days ago       -
  BLOCK        DISABLED      15 days ago       -
  CESNETWORK   HEALTHY       11 days ago       -
  NFS          HEALTHY       1 day ago         -
  OBJECT       DISABLED      15 days ago       -
  SMB          DEGRADED      Now               ctdb_recovery


Event             Parameter     Severity    Active Since      Event Message
------------------------------------------------------------------------------------
ctdb_recovery     SMB           WARNING     Now               CTDB Recovery detected


Normally, waiting for one minute should be sufficient for the recovery to finish. Once ctdb gets in HEALTHY state, the user can start other nodes.

3) To recover, stop and start ctdb on those nodes hanging in recovery using the following procedure:

On all nodes hanging in recovery run
 mmces service stop smb

For each stopped node in turn
wait for recoveries to finish
run mmces service start smb

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"--","Platform":[{"code":"PF016","label":"Linux"}],"Version":"4.2.0","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
26 September 2022

UID

ssg1S1010618