Troubleshooting zen-metastoredb pod failure "cockroach server exited with error: failed to create engines: could not open rocksdb instance"

Troubleshooting

Problem

One of the zen-metastoredb pod is in crash loop and not starting even after deleting the pods with the following error

cockroach server exited with error: failed to create engines: could not open rocksdb instance: IO error: While lock file: /cockroach/cockroach-data/LOCK: Resource temporarily unavailable

Symptom

zen-metastoredb should have 3 replicas to function correctly. Due to error, one or more of the zen-metastoredb is not in 1/1 Running state.

Cause

It is caused by the zen-metastoredb was killed or stopped abruptly causing the cockroach db file getting corrupted. Some of the known events causing the issue are : Node going down or deleting the pods with --force --grace-period=0

Environment

OpenShift 3.11, OpenShift 4.3, OpenShift 4.5
Cloud Pak for Data 3.0, Cloud Pak for Data 3.5

Diagnosing The Problem

oc get pods | grep "zen-metastoredb"

oc describe pod <pod-name>

oc logs <pod-name>

Resolving The Problem

To solve this issue

Step 1: Remove the following files in the failing pods PV volume.

a. Find the pvc

oc describe pods zen-metastoredb-1  | grep -i claimname | awk '{ print $2 }'

datadir-zen-metastoredb-1

b. Find the persistent volume name

oc get pvc | grep datadir-zen-metastoredb-1 | grep pvc | awk '{ print $3 }'

pvc-37a99c4b-9329-11ea-88e1-00163e01d86d

3. Mount the PVC

Create a temporary pods to mount the volume for fix the LOCK. Please make sure the image and claimName values are updated correctly

create a file cx-troubleshoot.yaml with the following content

cat cx-troubleshoot.yaml

apiVersion: v1
kind: Pod
metadata:
  name: cx-troubleshoot
spec:
  containers:
  - image: docker-registry.default.svc:5000/zen/zen-core:v3.0.1.0-x86_64-45
    command:
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
    name: cx-troubleshoot
    volumeMounts:
    - mountPath: /cockroach/cockroach-data
      name: datadir
    securityContext:
      runAsNonRoot: true
      runAsUser: 1000321000
  volumes:
  - name: datadir
    persistentVolumeClaim:
      claimName: datadir-zen-metastoredb-1
  restartPolicy: Always

oc create -f cx-troubleshoot.yaml -n <zen>

oc get pods | grep cx-troubleshoot

oc rsh cx-troubleshoot

Remove the LOCK file

cd /cockroach/cockroach-data
rm -rf LOCK
rm -rf temp-dirs-record.txt
rm -rf cockroach-temp0000000000

This should clear the LOCK file issue

Step 2: Restart all the zen-metastoredb pods

oc delete pods zen-metastoredb-0 zen-metastoredb-1 zen-metastoredb-2 -n zen

watch "oc get pods -n zen | grep zen-metastoredb"

Now all the 3 zen-metastoredb pods should be up and running

Document Location

Worldwide

[{"Line of Business":{"code":"LOB10","label":"Data and AI"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSABD6","label":"Expertise Connect for Cloud Pak for Data Platform"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)"}]

Tips

Troubleshooting zen-metastoredb pod failure "cockroach server exited with error: failed to create engines: could not open rocksdb instance"