Changes to CA certificate and key does not automatically rotate Kafka leaf certificates

For example, you see the following error from the Go Kafka client failed to create protocol: kafka: client has run out of available brokers to talk to (Is your cluster reachable?) after your force a renewal of the CA certificate and key for the AutomationBase instance by deleting the secret.

Cause

This issue is because the IBM Events Operator responsible for managing the Kafka instance does not automatically rotate the leaf certificates for the cluster when it is provided with a custom CA. The Kafka cluster is provided with a custom CA so that a common CA can be used for all IBM Automation foundation components.

Resolving the problem

Before you proceed, take a copy of the following secrets:

If you need to renew the CA or the CA and the key as a part of this process follow these steps:

  1. Read the documentation for renewing certificates here.
  2. Determine the CA certificates for Kafka that you are going to renew from the above documentation.
  3. Follow the documentation to renew the CA for Kafka and any other leaf certificates for components underneath that CA.

If you are using v1.0 or v1.1 of AutomationBase, then:

  1. Uninstall the IBM Automation foundation Operator.
  2. Edit the secret iaf-system-cluster-ca-cert and add a copy of the old ca.crt file in pem format as ca-<exipry_date>.crt, where <exipry_date> is the certificate expiry date in the format YEAR-MONTH-DAYTHOUR-MINUTE-SECONDS.

    • The expiry date can be retrieved by using the openssl command:

      openssl x509 -enddate -noout -in <path_to_ca>
      

      For example, ca-2018-09-27T17-32-00Z.crt.

  3. Follow the steps for v1.2+ AutomationBase below.

If you are using v1.2+ of AutomationBase, then:

  1. Ensure that the secret iaf-system-cluster-ca-cert contains a copy of new CA certificate in field ca.crt and a copy of the old CA in ca-<expiry_date>.crt.
  2. Restart all the Zookeeper pods one at a time iaf-system-zookeeper-*. Waiting for each to become ready after being restarted.
  3. Restart all the Kafka pods one at a time iaf-system-kafka-*. Waiting for each pod to become ready after being restarted.
  4. Restart the entity Operator pod iaf-system-entity-operator-*. Waiting for the pod to become ready after being restarted.
  5. Restart the Apicurio pod iaf-system-apicurio-*, if you have Apicurio installed. Waiting for the pod to become ready after being restarted.
  6. Follow the steps below to renew the leaf certificates for Kafka.

To renew the leaf certificates for Kafka follow these steps:

You can start here if you are okay with the state of your CA certificates and the keys as they are, and they are not expired.

  1. Delete the secret iaf-system-cluster-operator-certs.
  2. Wait for the secret iaf-system-cluster-operator-certs to be recreated, this could take a few minutes. If this is taking too long see the Note.
  3. Delete the secret iaf-system-zookeeper-nodes.
  4. Wait for the secret iaf-system-zookeeper-nodes to be recreated, this could take a few minutes. If this is taking too long see the Note.
  5. Restart all the Zookeeper pods one at a time iaf-system-zookeeper-*. Waiting for each pod to become ready after being restarted.
  6. Delete the secret iaf-system-kafka-brokers.
  7. Wait for the secret iaf-system-kafka-brokers to be recreated, this could take a few minutes. If this is taking too long see the Note.
  8. Wait for all the Kafka pods to restart, this will occur automatically and may take a few minutes.
  9. Delete the secret iaf-system-entity-operator-certs
  10. Wait for the secret iaf-system-entity-operator-certs to be recreated, this could take a few minutes. If this is taking too long see the Note.
  11. Delete the secret iaf-system-cluster-ca-cert.
  12. If you are following the v1.0/v1.1 steps re-install the IBM Automation Foundation Operator now.
  13. Restart the entity Operator pod iaf-system-entity-operator-*. Waiting for the pod to become ready after being restarted.

Note: If the step is taking too long you can restart the ibm-events-operator-* pod in the ibm-common-services namespace.

You will now be able to connect to the Kafka instance successfully.