Pods restarted regularly every 10 hours

Pods which have certificate secrets mounted are being restarted about every 10 hours in foundational services 3.13.

Symptoms

Pods which have certificate secrets mounted are restarted about every 10 hours. When listing pods with oc get pods, the Age column will never exceed about 12 hours.

Checking the YAML of the restarted pod(s), there is a label named certmanager.k8s.io/time-restarted, and the value of this label matches the time when the pod was created.

Cause

The ibm-cert-manager-operator pod automatically restarts pods whenever the Certificate they use is renewed. However, in foundational services 3.13, there is a bug that causes the operator to incorrectly restart pods even when the Certificate has NOT been renewed.

Resolving the problem

This issue is fixed in foundational services 3.14, so upgrading would be the permanent solution. If upgrading foundational services as a whole cannot be done for particular reasons, then patching the ibm-cert-manager-operator CSV with the image from foundational services 3.14 can be an option. This will technically upgrade ibm-cert-manager-operator, leaving all other services from foundational services untouched.

Patching CSV

  1. Run the following command:

    oc edit ibm-cert-manager-operator.v3.13.0
    
  2. Change the operator image value to quay.io/opencloudio/ibm-cert-manager-operator@sha256:bcf43bd31ed39ba8a8f559e503c0ec2db62459a7466248917c258d9a990c5d17.

    Notes:

    • Note that the registry quay.io/opencloudio may have to be changed depending on what registry you used for foundational services.
    • In an air-gapped scenario, the operator image and the operand images must be mirrored first.
    • The following are the operand images:
      • quay.io/opencloudio/icp-cert-manager-controller@sha256:b4ab5ef86d492b6f5caa6e2676b095ab7683ade4c42660f36ec3d43616558f3b
      • quay.io/opencloudio/icp-cert-manager-webhook@sha256:a3a4c2982f018ae500274e154ff00a39b72501c0f4d6f5293401f0dc9e16a915
      • quay.io/opencloudio/icp-cert-manager-cainjector@sha256:66447d5997e0ee3d7ed8226cfda1774c6a412e9a749515b1f6bf1fd6ac5726f8
      • quay.io/opencloudio/icp-cert-manager-acmesolver@sha256:9eabb93d88dc158c4892e298eec1da262515c9359ef70c21a91f4260b1aaf37f
      • quay.io/opencloudio/icp-configmap-watcher@sha256:ffa55e50f834d1ab79686832679ad9e97c2a529dad6a262dac6fd60467cadf19

If patching the CSV is also not an option, then the temporary workaround is to scale down the ibm-cert-manager-operator:

   oc scale --replicas=0 deployment/ibm-cert-manager-operator

Note that this will stop all pod restarts even if a Certificate has been renewed. Use this method ONLY as a short-term, temporary workaround.

Another issue with this method is that any new Certificates created using the v1alpha1 version will also not be converted to v1 Certificate, which means no certificate secret will be generated.