Disk failures in local storage on workers (Db2 Big SQL)

When there is a disk drive failure on the local storage that a Db2® Big SQL worker pod is bound to, the pod terminates and goes into a CrashLoopBackOff state.

Symptoms

Any queries that are running at the time that the disk failure occurs terminate with Db2 SQL-1229 errors, and then the worker's Db2 process aborts and is excluded by the Db2 Big SQL scheduler from the list of available workers. Any subsequently issued queries run as normal without error, but on one less worker. Soon after, once the container liveness check on the affected worker detects that the worker's Db2 process is no longer running, then the container is automatically terminated and restarted. As the local storage Persistent Volume (PV) is no longer accessible, permission denied errors are observed in the container log from the affected pod. For example:

# oc logs bigsql-1606663199210719-worker-3 | head -3
M:1
Starting the Big SQL container (addWorkerParallel)
rm: cannot remove '/mnt/PV/versioned/marker_files/ready.txt': Permission denied

As the worker's Db2 process cannot start without the files on the inaccessible PV, the liveness check again fails within a matter of minutes, and the container again terminates and restarts, leading to the pod entering a cyclic CrashLoopBackOff state.

Resolving the problem

There are two recovery options. The first option involves replacing the failed disk on the Red Hat® OpenShift® worker node and restarting the affected pod. The second option involves moving the affected pod to a new OpenShift worker node.

Option 1: Replace the failed disk and restart the affected pod

The following steps can be applied without needing any interruption to query workloads that are running on the Db2 Big SQL instance.

Identify the OpenShift worker where the failed disk resides by examining the PV resources for the affected worker pod. For example:

# oc get pod -l instance=1606663199210719,app=db2-bigsql,bigsql-node-role=worker --no-headers=true
bigsql-1606663199210719-worker-0   1/1       Running            0         1d
bigsql-1606663199210719-worker-1   1/1       Running            0         1d
bigsql-1606663199210719-worker-2   1/1       Running            0         21h
bigsql-1606663199210719-worker-3   0/1       CrashLoopBackOff   129       21h

# oc get pvc pv-bigsql-1606663199210719-worker-3
NAME                                  STATUS    VOLUME                CAPACITY   ACCESS MODES   STORAGECLASS    AGE
pv-bigsql-1606663199210719-worker-3   Bound     pv-local-worker-3-1   500Gi      RWO            local-storage   21h

# oc describe pv pv-local-worker-3-1 | egrep "hostname|Path"
    Term 0:        kubernetes.io/hostname in [worker3.ocp45.cp.fyre.ibm.com]
    Path. : /bigsql-local

Ssh to the OpenShift worker node with the failed disk. For example:
```
ssh <user_name>@<server_name>.com
```
Confirm that the disk has failed by reviewing the operating system logs and attempting to read or write to the disk with OS level commands. For example:
```
# ls /bigsql-local
   ls: cannot access /bigsql-local: Input/output error
```

Unmount the failed device and replace the physical disk. For example:

# df | grep bigsql-local
/dev/sde                                                                                                                  3905110864  16232180  3888878684   1% /bigsql-local

# df | grep sde
/dev/sde                                                                                                                  3905110864  16232180  3888878684   1% /bigsql-local
/dev/sde                                                                                                         3905110864  16232180  3888878684   1% /var/lib/origin/openshift.local.volumes/pods/0f3a9725-3665-11eb-b9ad-40f2e9757b10/volumes/kubernetes.io~local-volume/pv-local-worker-3-1

# umount /bigsql-local
# umount /var/lib/origin/openshift.local.volumes/pods/0f3a9725-3665-11eb-b9ad-40f2e9757b10/volumes/kubernetes.io~local-volume/pv-local-worker-3-1

# df | grep sde
#

Note: In this example, in addition to the original mount, there might be a second mount for the Kubernetes local volume. Confirm that the device for the failed disk is fully unmounted before proceeding to replace the disk, to avoid any subsequent stale file handles.

Repartition and format the new disk with the required filesystem type. For example, xfs, zfs.
Mount the newly formatted disk on the original mount point, and confirm that the disk is mounted correctly. For example:
```
# df | grep bigsql-local
/dev/sde
3905110864  164  3905110700   1% /bigsql-local
```

Restart the affected pod:

oc delete pod bigsql-<instance_id>-worker-<n>

Confirm that the pod starts successfully and re-joins the cluster:
```
oc logs pod bigsql-<instance_id>-worker-<n>
```

Option 2: Move the pod to a new OpenShift worker node

This option assumes that there is an OpenShift worker node that has a local storage PV created, but which does not currently have a Db2 Big SQL worker scheduled on it, and the volume is in an Available state. This option requires that the Db2 Big SQL worker statefulset replicas are down-scaled, and therefore requires a quiet instance to complete the procedure.

Identify the OpenShift worker where the failed disk resides by examining the PV resources for the affected worker pod. For example:

# oc get pod -l instance=1606663199210719,app=db2-bigsql,bigsql-node-role=worker --no-headers=true
bigsql-1606663199210719-worker-0   1/1       Running            0         1d
bigsql-1606663199210719-worker-1   1/1       Running            0         1d
bigsql-1606663199210719-worker-2   1/1       Running            0         21h
bigsql-1606663199210719-worker-3   0/1       CrashLoopBackOff   129       21h

# oc get pvc pv-bigsql-1606663199210719-worker-3
NAME                                  STATUS    VOLUME                CAPACITY   ACCESS MODES   STORAGECLASS    AGE
pv-bigsql-1606663199210719-worker-3   Bound     pv-local-worker-3-1   500Gi      RWO            local-storage   21h

# oc describe pv pv-local-worker-3-1 | egrep "hostname|Path"
    Term 0:        kubernetes.io/hostname in [worker3.ocp45.cp.fyre.ibm.com]
    Path. : /bigsql-local

Scale down the worker statefulset to zero replicas:

oc scale sts bigsql-<instance_id>-worker --replicas=0

Delete the PV claim and PV for the affected worker as identified in step 1.

oc delete pvc pv-bigsql-<instance_id>-worker-<n>
  oc delete pv-<local-worker-n-m>  ##adjust to match exact volume name

Scale the worker statefulset back to original complement of workers:
```
oc scale sts bigsql-<instance_id>-worker --replicas=<N>
```
Confirm that all workers start up properly, including the original failed worker pod, on the new OpenShift worker, and binds to the associated local storage PV.

Note: If the failed disk is recovered without needing a replacement (for example, by a disk reseat), then the contents of the disk must be wiped before you create a new PV on this disk for use in any subsequent Db2 Big SQL out-scaling operations.