Recovering from the loss of a majority of quorum nodes

Quorum loss might happen because of hardware issues on the quorum nodes. Due to the loss of a majority of quorum nodes, the cluster becomes inoperable.

Perform the following steps to investigate the issue with the cluster and recover CCR:
  1. Issue the mmgetstate command to understand the status of the cluster. When a majority of quorum nodes are not available, the mmgetstate command gives an output similar to the following example:
    # mmgetstate -a
    mmgetstate: [E] The command was unable to reach the CCR service on the majority of quorum nodes to form CCR quorum.  Ensure the CCR service (mmfsd or mmsdrserv daemon) is running on all quorum nodes and the communication port is not blocked by a firewall.
    mmgetstate: Command failed. Examine previous error messages to determine cause.
    
  2. Issue the mmlscluster --noinit command to identify the quorum nodes in the cluster as shown in the following example:
    # mmlscluster --noinit
    
    GPFS cluster information
    ========================
      GPFS cluster name:         gpfs-cluster-2.localnet.com
      GPFS cluster id:           13445038716777501310
      GPFS UID domain:           gpfs-cluster-2.localnet.com
      Remote shell command:      /usr/bin/ssh
      Remote file copy command:  /usr/bin/scp
      Repository type:           CCR
    
     Node  Daemon node name      IP address   Admin node name       Designation
    ----------------------------------------------------------------------------
       1   node-21.localnet.com  10.0.100.21  node-21.localnet.com  quorum-manager
       2   node-22.localnet.com  10.0.100.22  node-22.localnet.com  quorum-manager
       3   node-23.localnet.com  10.0.100.23  node-23.localnet.com  quorum-manager
       4   node-24.localnet.com  10.0.100.24  node-24.localnet.com
       5   node-25.localnet.com  10.0.100.25  node-25.localnet.com
    
  3. Issue the ping command to verify whether the lost quorum nodes are reachable:
    # ping -c 1 node-22.localnet.com
    PING node-22.localnet.com (10.0.100.22) 56(84) bytes of data.
    From node-21.localnet.com (10.0.100.21) icmp_seq=1 Destination Host Unreachable
    
    --- node-22.localnet.com ping statistics ---
    1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
    
    # ping -c 1 node-23.localnet.com
    PING node-23.localnet.com (10.0.100.23) 56(84) bytes of data.
    From node-21.localnet.com (10.0.100.21) icmp_seq=1 Destination Host Unreachable
    
    --- node-23.localnet.com ping statistics ---
    1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
    
  4. Issue the mmccr check on the remaining quorum node to get the details of the missing quorum nodes and a quorum loss (809) of the CCR server, which is running on the local node:
    # mmccr check -Ye
    mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:ListOfFailedEntities:ListOfSucceedEntities:Severity:
    mmccr::0:1:::1:CCR_CLIENT_INIT:0:::/var/mmfs/ccr,/var/mmfs/ccr/committed,/var/mmfs/ccr/ccr.nodes,Security:OK:
    mmccr::0:1:::1:FC_CCR_AUTH_KEYS:0:::/var/mmfs/ssl/authorized_ccr_keys:OK:
    mmccr::0:1:::1:FC_CCR_PAXOS_CACHED:0:::/var/mmfs/ccr/cached,/var/mmfs/ccr/cached/ccr.paxos:OK:
    mmccr::0:1:::1:FC_CCR_PAXOS_12:0:::/var/mmfs/ccr/ccr.paxos.1,/var/mmfs/ccr/ccr.paxos.2:OK:
    mmccr::0:1:::1:PC_LOCAL_SERVER:0:::node-21.localnet.com:OK:
    mmccr::0:1:::1:PC_IP_ADDR_LOOKUP:0:::node-21.localnet.com,0.000:OK:
    mmccr::0:1:::1:PC_QUORUM_NODES:1143:Ping CCR quorum nodes failed:10.0.100.22,10.0.100.23:10.0.100.21:FATAL:
    mmccr::0:1:::1:FC_COMMITTED_DIR:809:Connect local CCR server failed:10.0.100.21::WARNING:
    mmccr::0:1:::1:TC_TIEBREAKER_DISKS:0::::OK:
    
  5. Issue the mmchnode command with the --force option to force the system to reduce the number of quorum nodes to the still available quorum nodes. This command takes a while and expects a confirmation to proceed.
    The --force option enforces the GPFS to continue run normally by using the copy of the CCR state found on the only remaining quorum node. As CCR no longer has quorum, GPFS cannot verify whether it is the most recent version of the CCR state. If the other two quorum nodes are failed while a GPFS command was running and some configuration data was changed during this time, then the CCR state on the surviving quorum node might become stale or inconsistent with the state of one of the GPFS file systems. Therefore, this procedure must be used only if no recent configuration change or if none of the failed quorum nodes can be brought back online.
    # mmchnode --noquorum -N node-22,node-23 --force
    mmchnode: Unable to obtain the GPFS configuration file lock.
    mmchnode: Processing continues without lock protection.
    mmchnode: Entering mmchnode restricted mode of operations.
    Wed Jul  7 16:44:21 CEST 2021: mmchnode: Processing node node-23.localnet.com
    Wed Jul  7 16:44:21 CEST 2021: mmchnode: Processing node node-22.localnet.com
    mmchnode: You are attempting to override normal GPFS quorum semantics.
        This may endanger the integrity of the configuration data and prevent normal operations.
        Proceed only if this cluster is part of a disaster recovery environment that is set up
        according to the instructions in "Establishing disaster recovery for your GPFS cluster"
        in the GPFS Advanced Administration guide and you are strictly following the failover
        procedures described in that document.
        Do you want to continue? (yes/no) yes
    mmchnode: mmsdrfs propagation completed.
    mmchnode: Propagating the cluster configuration data to all
      affected nodes.  This is an asynchronous process.
    [root@node-21 ~]# Wed Jul  7 16:45:56 CEST 2021: mmcommon pushSdr_async: mmsdrfs propagation started
    Wed Jul  7 16:46:07 CEST 2021: mmcommon pushSdr_async: mmsdrfs propagation completed; mmdsh rc=0
    

    After the command returned successfully, the cluster is back to a working state because CCR is able to reach quorum without the quorum nodes that are no longer available. The failed nodes are still in the list of cluster nodes.

  6. Issue the mmdelnode command as shown in the following example to remove the failed nodes:
    # mmlscluster
    
    GPFS cluster information
    ========================
      GPFS cluster name:         gpfs-cluster-2.localnet.com
      GPFS cluster id:           13445038716777501310
      GPFS UID domain:           gpfs-cluster-2.localnet.com
      Remote shell command:      /usr/bin/ssh
      Remote file copy command:  /usr/bin/scp
      Repository type:           CCR
    
     Node  Daemon node name      IP address   Admin node name       Designation
    ----------------------------------------------------------------------------
       1   node-21.localnet.com  10.0.100.21  node-21.localnet.com  quorum-manager
       2   node-22.localnet.com  10.0.100.22  node-22.localnet.com  manager
       3   node-23.localnet.com  10.0.100.23  node-23.localnet.com  manager
       4   node-24.localnet.com  10.0.100.24  node-24.localnet.com
       5   node-25.localnet.com  10.0.100.25  node-25.localnet.com
    
    # mmgetstate -a
    
     Node number  Node name  GPFS state  
    -------------------------------------
               1  node-21    active
               2  node-22    unknown
               3  node-23    unknown
               4  node-24    active
               5  node-25    active
    
    # mmdelnode -N node-22,node-23 --force
    Verifying GPFS is stopped on all affected nodes ...
    mmdsh: There are no available nodes on which to run the command.
    mmdelnode: Unable to confirm that GPFS is stopped on all of the affected nodes.
        Nodes should not be removed from the cluster if GPFS is still running.
        Make sure GPFS is down on all affected nodes before continuing. If not,
        this may cause a cluster outage.
        Do you want to continue? (yes/no) yes
    mmdelnode: Removing GPFS system files on all deleted nodes ...
    mmdelnode: [W] Could not cleanup the following unreached nodes:
    node-23.localnet.com
    node-22.localnet.com
    mmdelnode: Command successfully completed
    QOS configuration has been installed and broadcast to all nodes.
    mmdelnode: Propagating the cluster configuration data to all
      affected nodes.  This is an asynchronous process.
    Wed Jul  7 18:08:55 CEST 2021: mmcommon pushSdr_async: mmsdrfs propagation started
    # mmlscluster
    
    GPFS cluster information
    ========================
      GPFS cluster name:         gpfs-cluster-2.localnet.com
      GPFS cluster id:           13445038716777501310
      GPFS UID domain:           gpfs-cluster-2.localnet.com
      Remote shell command:      /usr/bin/ssh
      Remote file copy command:  /usr/bin/scp
      Repository type:           CCR
    
     Node  Daemon node name      IP address   Admin node name       Designation
    ----------------------------------------------------------------------------
       1   node-21.localnet.com  10.0.100.21  node-21.localnet.com  quorum-manager
       4   node-24.localnet.com  10.0.100.24  node-24.localnet.com
       5   node-25.localnet.com  10.0.100.25  node-25.localnet.com
    
    You can also use the mmhealth node show instead of using the mmlscluster --noinit command to get the list of quorum nodes. The mmhealth node show command provides the status of the IBM Storage Scale components as shown in the following example:
    # mmhealth node show
    
    Node name:      node-21.localnet.com
    Node status:    DEGRADED
    Status Change:  7 min. ago
    
    Component      Status        Status Change     Reasons
    ----------------------------------------------------------------------------------------------
    FILESYSMGR     HEALTHY       22 hours ago      -
    GPFS           DEGRADED      7 min. ago        ccr_local_server_warn, quorum_down, ccr_quorum_nodes_fail
    NETWORK        HEALTHY       14 days ago       -
    FILESYSTEM     FAILED        8 min. ago        stale_mount(gpfs0)
    DISK           HEALTHY       14 days ago       -
    
    In addition, the mmhealth node show <COMPONENT> -v --unhealthy lists more details about the specified component. You can find the IP addresses of the unavailable quorum nodes from the command output:
    # mmhealth node show GPFS -v --unhealthy
    
    Node name:      node-21.localnet.com
    
    Component     Status        Status Change            Reasons
    ----------------------------------------------------------------------------------------------
    GPFS          DEGRADED      2021-07-23 13:53:11      ccr_local_server_warn, quorum_down, ccr_quorum_nodes_fail
    
    
    Event                     Parameter     Severity    Active Since             Event Message
    ----------------------------------------------------------------------------------------------
    ccr_local_server_warn     GPFS          WARNING     2021-07-23 14:14:04      The local GPFS CCR server is not reachable Item=PC_LOCAL_SERVER,ErrMsg='Ping local CCR server failed',Failed='node-21.localnet.com'
    quorum_down               GPFS          ERROR       2021-07-23 13:53:11      The node is not able to reach enough quorum nodes/disks to work properly.
    ccr_quorum_nodes_fail     GPFS          ERROR       2021-07-23 13:58:54      A majority of the quorum nodes are not reachable over the management network Item=PC_QUORUM_NODES,ErrMsg='Ping CCR quorum nodes failed',Failed='10.0.100.22;10.0.100.23'