Troubleshooting deployment

no endpoints available for service "ibm-spectrum-scale-webhook-service"

When creating or editing IBM Storage Scale container native custom resources, it is possible that validating or mutating webhooks fails when the operator pod is unavailable. If you receive an error: no endpoints for service "ibm-spectrum-scale-webhook-service", check the state of the operator pod.

Example error:

# kubectl apply -f remotecluster.yaml
remotecluster.scale.spectrum.ibm.com/remotecluster-sample unchanged
Error from server (InternalError): error when creating "remotecluster.yaml": Internal error occurred: failed calling webhook "mcluster.scale.spectrum.ibm.com": failed to call webhook: Post "https://ibm-spectrum-scale-webhook-service.ibm-spectrum-scale-operator.svc:443/mutate-scale-spectrum-ibm-com-v1beta1-cluster?timeout=10s": no endpoints available for service "ibm-spectrum-scale-webhook-service"

Checking the operator pod, the STATUS is ImagePullBackOff:

# kubectl get pods -n ibm-spectrum-scale-operator
NAME                                                         READY   STATUS             RESTARTS         AGE
pod/ibm-spectrum-scale-controller-manager-64bb4798df-rrj4j   0/1     ImagePullBackOff   10 (4m14s ago)   34m

In the example, the operator is unable to pull the image due to errors in the pull credentials. To remedy the no endpoints available issue, the operator problems must be resolved first, then retry the original commands.

core pods are stuck in init

If for some reason, the IBM Storage Scale container native cluster fails to create, the core pods on the worker nodes get stuck in the Init container.

# kubectl get pods
NAME                               READY   STATUS    RESTARTS   AGE
...
worker0                            2/2     Init:1/2   0         2h
worker1                            2/2     Init:1/2   0         2h
worker2                            2/2     Init:1/2   0         2h
worker3                            2/2     Init:1/2   0         2h

There is no recovery from this problem. For more information about clean up, see Cleanup IBM Storage Scale container native and Cleanup Red Hat OpenShift nodes. For more information on redeploying, see Installing the IBM Storage Scale container native operator and cluster.

core, GUI, or collector pods are in ErrImgPull or ImagePullBackOff state

When viewing kubectl get pods -n ibm-spectrum-scale, if any of the pods are in ErrImgPull or ImagePullBackOff state, use kubectl describe pod <podname> to get more details on the pod and look for any errors that may be happening.

kubectl describe pod <pod-name> -n ibm-spectrum-scale

core, GUI, or collector pods are not up

core, GUI, or collector pods show container restarts

all pods are running but the GPFS cluster is stuck in the "arbitrating" state

If the cluster is stuck in an arbitrating state:

remote mount file system not getting configured or mounted

* File system

    ```bash
    kubectl get filesystem.scale -n ibm-spectrum-scale
    kubectl describe filesystem.scale <name> -n ibm-spectrum-scale
    ```


Check the `Status` and `Events` for any reasons of failure.

If nothing, check the operator logs for any errors:

```bash
kubectl logs $(kubectl get pods -n ibm-spectrum-scale-operator -ojson  | jq -r ".items[0].metadata.name") -n ibm-spectrum-scale-operator
```