Disk failures in local storage on workers (Db2 Big SQL)
When there is a disk drive failure on the local storage that a Db2® Big SQL worker pod is bound to, the pod terminates and goes into a CrashLoopBackOff state.
Symptoms
Any queries that are running at the time that the disk failure occurs terminate with Db2 SQL-1229 errors, and then the worker's Db2 process aborts and is excluded by the Db2 Big SQL scheduler from the list of available workers. Any subsequently issued queries run as normal without error, but on one less worker. Soon after, once the container liveness check on the affected worker detects that the worker's Db2 process is no longer running, then the container is automatically terminated and restarted. As the local storage Persistent Volume (PV) is no longer accessible, permission denied errors are observed in the container log from the affected pod. For example:
# oc logs bigsql-1606663199210719-worker-3 | head -3 M:1 Starting the Big SQL container (addWorkerParallel) rm: cannot remove '/mnt/PV/versioned/marker_files/ready.txt': Permission denied
As the worker's Db2 process cannot start without the files on the inaccessible PV, the liveness check again fails within a matter of minutes, and the container again terminates and restarts, leading to the pod entering a cyclic CrashLoopBackOff state.
Resolving the problem
There are two recovery options. The first option involves replacing the failed disk on the Red Hat® OpenShift® worker node and restarting the affected pod. The second option involves moving the affected pod to a new OpenShift worker node.
Option 1: Replace the failed disk and restart the affected pod
The following steps can be applied without needing any interruption to query workloads that are running on the Db2 Big SQL instance.
- Identify the OpenShift worker where the
failed disk resides by examining the PV resources for the affected worker pod. For
example:
# oc get pod -l instance=1606663199210719,app=db2-bigsql,bigsql-node-role=worker --no-headers=true bigsql-1606663199210719-worker-0 1/1 Running 0 1d bigsql-1606663199210719-worker-1 1/1 Running 0 1d bigsql-1606663199210719-worker-2 1/1 Running 0 21h bigsql-1606663199210719-worker-3 0/1 CrashLoopBackOff 129 21h # oc get pvc pv-bigsql-1606663199210719-worker-3 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pv-bigsql-1606663199210719-worker-3 Bound pv-local-worker-3-1 500Gi RWO local-storage 21h # oc describe pv pv-local-worker-3-1 | egrep "hostname|Path" Term 0: kubernetes.io/hostname in [worker3.ocp45.cp.fyre.ibm.com] Path. : /bigsql-local
- Ssh to the OpenShift worker node with the
failed disk. For
example:
ssh <user_name>@<server_name>.com
- Confirm that the disk has failed by reviewing the operating system logs and attempting to read
or write to the disk with OS level commands. For
example:
# ls /bigsql-local ls: cannot access /bigsql-local: Input/output error
- Unmount the failed device and replace the physical disk. For
example:
# df | grep bigsql-local /dev/sde 3905110864 16232180 3888878684 1% /bigsql-local # df | grep sde /dev/sde 3905110864 16232180 3888878684 1% /bigsql-local /dev/sde 3905110864 16232180 3888878684 1% /var/lib/origin/openshift.local.volumes/pods/0f3a9725-3665-11eb-b9ad-40f2e9757b10/volumes/kubernetes.io~local-volume/pv-local-worker-3-1 # umount /bigsql-local # umount /var/lib/origin/openshift.local.volumes/pods/0f3a9725-3665-11eb-b9ad-40f2e9757b10/volumes/kubernetes.io~local-volume/pv-local-worker-3-1 # df | grep sde #
Note: In this example, in addition to the original mount, there might be a second mount for the Kubernetes local volume. Confirm that the device for the failed disk is fully unmounted before proceeding to replace the disk, to avoid any subsequent stale file handles. - Repartition and format the new disk with the required filesystem type. For example, xfs, zfs.
- Mount the newly formatted disk on the original mount point, and confirm that the disk is mounted
correctly. For
example:
# df | grep bigsql-local /dev/sde 3905110864 164 3905110700 1% /bigsql-local
- Restart the affected
pod:
oc delete pod bigsql-<instance_id>-worker-<n>
- Confirm that the pod starts successfully and re-joins the
cluster:
oc logs pod bigsql-<instance_id>-worker-<n>
Option 2: Move the pod to a new OpenShift worker node
This option assumes that there is an OpenShift worker node that has a local storage PV created, but which does not currently have a Db2 Big SQL worker scheduled on it, and the volume is in an Available state. This option requires that the Db2 Big SQL worker statefulset replicas are down-scaled, and therefore requires a quiet instance to complete the procedure.
- Identify the OpenShift worker where the
failed disk resides by examining the PV resources for the affected worker pod. For
example:
# oc get pod -l instance=1606663199210719,app=db2-bigsql,bigsql-node-role=worker --no-headers=true bigsql-1606663199210719-worker-0 1/1 Running 0 1d bigsql-1606663199210719-worker-1 1/1 Running 0 1d bigsql-1606663199210719-worker-2 1/1 Running 0 21h bigsql-1606663199210719-worker-3 0/1 CrashLoopBackOff 129 21h # oc get pvc pv-bigsql-1606663199210719-worker-3 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pv-bigsql-1606663199210719-worker-3 Bound pv-local-worker-3-1 500Gi RWO local-storage 21h # oc describe pv pv-local-worker-3-1 | egrep "hostname|Path" Term 0: kubernetes.io/hostname in [worker3.ocp45.cp.fyre.ibm.com] Path. : /bigsql-local
- Scale down the worker statefulset to zero
replicas:
oc scale sts bigsql-<instance_id>-worker --replicas=0
- Delete the PV claim and PV for the affected worker as identified in step
1.
oc delete pvc pv-bigsql-<instance_id>-worker-<n> oc delete pv-<local-worker-n-m> ##adjust to match exact volume name
- Scale the worker statefulset back to original complement of
workers:
oc scale sts bigsql-<instance_id>-worker --replicas=<N>
- Confirm that all workers start up properly, including the original failed worker pod, on the new OpenShift worker, and binds to the associated local storage PV.