Certificate manager experiences CPU resource issues
You experience the following symptoms:
cert-manager-cainejctor
pod restarts many times.- Certificates are not ready.
cert-manager-webhook
pod has errors about bad TLS certificate in the logs.
This issue is applicable for IBM Cloud Pak foundational services version 3.7.x and 3.8.x. The issue is fixed in 3.9.x version.
Cause
During performance testing, certificate manager was only tested with Certificates
that were necessary for foundational services. No additional load was generated against certificate manager. So when adopters and IBM Cloud Paks create
their own Certificates
, the default CPU limit that is allocated to certificate manager is not enough. This situation causes the cert-manager-cainejctor
pod to become CPU throttled. Eventually, Kubernetes decides to restart
the container since it did not receive a response for the liveness and readiness probes.
Since the cert-manager-cainjector
cannot complete its duties, cert-manager-webhook
does not receive the necessary certificates to communicate with the Kubernetes API server so the Certificates
are not patched
by the cert-manager-webhook
. Now, the certificate manager cannot acknowledge or resolve these Certificates
.
Resolving the problem
Ideally, adopters and IBM Cloud Paks should test their deployments and configure the resources that are allocated to foundational services as needed before they release the product. As a result the users do not have to take any actions. However,
if you are already experiencing this issue, the fix is to increase the CPU limit for cert-manager-cainejctor
.
The recommended way to configure CPU resources is through the CommonService
CR. For this issue, the CR option would not work properly because the cert-manager-webhook
is most likely down, so it blocks the common-service-operator
from reconciling any changes. In this scenario, complete the following steps to temporarily fix cert-manager-webhook
:
Normally, the recommended way to configure CPU resources is through the CommonService
CR, however, that would not work properly for this issue because the cert-manager-webhook
is most likely down, so it blocks the common-service-operator
from reconciling any changes. In this scenario, cert-manager-webhook
needs to be temporarily fixed first:
-
Scale
cert-manager-operator
deployment down from1
to0
pods. Run the following command:oc scale --replicas=0 deployment ibm-cert-manager-operator
-
Edit the
cert-manager-cainjector
deployment and set the CPU limit to100m
.Run the following command:oc edit deployment cert-manager-cainjector
cert-manager-cainjector
should stop constantly restarting andcert-manager-webhook
no longer has error logs about bad TLS certificate at this point. -
Update the
CommonService
custom resource to persist the changes with the following command:oc edit commonservice common-service
Make the following changes in the
spec
section underibm-cert-manager-operator
:spec: services: - name: ibm-cert-manager-operator spec: certManager: certManagerCAInjector: resources: limits: cpu: 100m
-
Save the changes and wait for status,
Succeeded
, to indicate that your changes are reconciled. You can check the status with the following command:oc get commonservice common-service -o=jsonpath='{.status.phase}'
-
Next, scale the
cert-manager-deployment
back to one pod with the following command:oc scale --replicas=1 deployment ibm-cert-manager-operator
The cert-manager-cainejctor
CPU limit is now persistently changed to 100m
so that the issue does not occur again.