Certificate manager experiences CPU resource issues

You experience the following symptoms:

This issue is applicable for IBM Cloud Pak foundational services version 3.7.x and 3.8.x. The issue is fixed in 3.9.x version.

Cause

During performance testing, certificate manager was only tested with Certificates that were necessary for foundational services. No additional load was generated against certificate manager. So when adopters and IBM Cloud Paks create their own Certificates, the default CPU limit that is allocated to certificate manager is not enough. This situation causes the cert-manager-cainejctor pod to become CPU throttled. Eventually, Kubernetes decides to restart the container since it did not receive a response for the liveness and readiness probes.

Since the cert-manager-cainjector cannot complete its duties, cert-manager-webhook does not receive the necessary certificates to communicate with the Kubernetes API server so the Certificates are not patched by the cert-manager-webhook. Now, the certificate manager cannot acknowledge or resolve these Certificates.

Resolving the problem

Ideally, adopters and IBM Cloud Paks should test their deployments and configure the resources that are allocated to foundational services as needed before they release the product. As a result the users do not have to take any actions. However, if you are already experiencing this issue, the fix is to increase the CPU limit for cert-manager-cainejctor.

The recommended way to configure CPU resources is through the CommonService CR. For this issue, the CR option would not work properly because the cert-manager-webhook is most likely down, so it blocks the common-service-operator from reconciling any changes. In this scenario, complete the following steps to temporarily fix cert-manager-webhook:

Normally, the recommended way to configure CPU resources is through the CommonService CR, however, that would not work properly for this issue because the cert-manager-webhook is most likely down, so it blocks the common-service-operator from reconciling any changes. In this scenario, cert-manager-webhook needs to be temporarily fixed first:

  1. Scale cert-manager-operator deployment down from 1 to 0 pods. Run the following command:

     oc scale --replicas=0 deployment ibm-cert-manager-operator
    
  2. Edit the cert-manager-cainjector deployment and set the CPU limit to 100m.Run the following command:

     oc edit deployment cert-manager-cainjector
    

    cert-manager-cainjector should stop constantly restarting and cert-manager-webhook no longer has error logs about bad TLS certificate at this point.

  3. Update the CommonService custom resource to persist the changes with the following command:

     oc edit commonservice common-service
    

    Make the following changes in the spec section under ibm-cert-manager-operator:

    spec:
      services:
        - name: ibm-cert-manager-operator
          spec:
            certManager:
              certManagerCAInjector:
                resources:
                  limits:
                    cpu: 100m
    
  4. Save the changes and wait for status, Succeeded, to indicate that your changes are reconciled. You can check the status with the following command:

     oc get commonservice common-service -o=jsonpath='{.status.phase}'
    
  5. Next, scale the cert-manager-deployment back to one pod with the following command:

     oc scale --replicas=1 deployment ibm-cert-manager-operator
    

The cert-manager-cainejctor CPU limit is now persistently changed to 100m so that the issue does not occur again.