How-tos

How to Automate TLS Certificate Rotation to Avoid Outages

Share this post:

End-to-end protection for data in transit without running into any TLS certificate expiry issues

Certificate expiration is a common problem that, again and again, leads to unplanned outages in websites and applications. Automation and proper certificate lifecycle management are needed.

In this post, we’ll share how you can make sure you have end-to-end protection for data in transit without running into any TLS certificate expiry issues. Specifically, we’ll share our experience running our microservices on IBM Cloud Kubernetes Service and configuring them to communicate internally over HTTPS. However, our approach can be applied more broadly whenever you are using TLS certificates that can be renewed or regenerated in an automated way.

How we manage our internal certificates

Our team is running a service on IBM Cloud Kubernetes Service and deploying multiple microservices. We wanted to ensure all the internal communication between the microservices is protected and encrypted using the HTTPS protocol.

Our microservices use internal domains so we couldn’t get domain validated (DV) certificates signed by publicly trusted certificate authorities. Instead, we decided to use self-signed certificates. To manage the lifecycle of these certificates and to trigger automated renewal and deployment, we used IBM Cloud Certificate Manager. We uploaded the first set of self-signed certificates that we generated to our Certificate Manager instance and used the Certificate Manager notifications feature as a trigger for certificate rotation. Certificate Manager monitors the expiration of certificates and sends proactive notifications before certificates expire. You can add a callback URL to which Certificate Manager will send notifications. We used the callback URL feature to send notifications to an IBM Cloud Function we wrote that triggers the process of generating a new set of self-signed certificates, deploying them to our Kubernetes Cluster as Kubernetes Secrets, and uploading the new set of certificates to Certificate Manager for further monitoring and visibility.

certificate rotation flow

 

Setting things up

First, we created an instance of Certificate Manager. We have about 50 internal certificates per region where our service is deployed and a few more public certificates. To generate the internal certificates, we first generated a root certificate and key and used that to sign all the other certificates. We created a script that generates the certificates for us, imports the certificates to Certificate Manager, and then deletes the certificates and keys from the machine where they were generated. The script also creates Kubernetes secret resources for all the certificates and applies them to our clusters.

Cert Rotation

Now for the interesting part. We created a Cloud Function that is called in response to a notification event from Certificate Manager so that our script to regenerate and redeploy certificates runs before our certificates expire.

Verifying the payload

When a ‘cert_about_to_expire_reimport_required’ event arrives to the Cloud Function from Certificate Manager, we first need to verify that it was not tampered with. For that, we will get the public key for Certificate Manager instance and verify the payload originated from Certificate Manager.

Note: If you are managing multiple instances of Certificate Manager that all use the same Cloud Function, you can first decode the message and use the instance CRN in the message to get the correct public key. However, in this case, we recommend that you have a whitelist of your instances and verify the instance prior to getting the public key.

Rotating the certificates

Now that we verified and decoded the payload, we can see which certificates are about to expire and need to be recreated. There are two possible scenarios:

  • The soon-to-be-expired certificate is the root certificate—we will need to recreate all other certificates with the new root key.
  • The soon-to-be-expired certificate is not the root certificate—we will need to get the root certificate from Certificate Manager and sign the new certificates using the existing root certificate.

Creating the certificates within the Cloud Function seemed expensive, so we decided to use one of our VMs. Our VMs are Jenkins slaves, so we created a Jenkins job to recreate the relevant certificates and publish them to our clusters and to Certificate Manager. The Cloud Function triggers this Jenkins job. Since our VMs are in our private network, the Cloud Function cannot connect directly with them. Instead, the Cloud Function updates a file in a git repo with the data using the GitHub API, and this update triggers the Jenkins job.

The Jenkins job runs a script to create certificates and deploy them. Here, we need to distinguish between the two scenarios mentioned above. For the first scenario, we need to first create the root certificate. The root certificate, in our case, is a self-signed certificate that we create using the openssl CLI.

# Create the root CA certificate and key
openssl genrsa -out rootCA.key 2048;
openssl req -x509 -new -nodes -key rootCA.key -sha256 -subj "<root subject string>" -days 365 -out rootCA.pem;

In the second scenario, we need to get the root certificate and private key from our Certificate Manager instance (using the Certificate Manager API), and use that to sign the CSRs. In order to make it easier to find the correct root certificate, we used specific naming conventions. Each certificate’s name includes the type of certificate (i.e., internal, external, etc.), the name of the cluster where it is used, and the specific microservice name (or the word ‘root’ for root certificates) – “<type>-<cluster name>-<microservice name>”.

Now that we have a root cert and key in hand, we can recreate the rest of the certificates. This is done with openssl as well.

openssl req -new -newkey rsa:2048 -nodes -subj "<certificate subject string>" -keyout <cert name>.key -out <cert name>.csr;
openssl x509 -req -in <cert name>.csr -out <cert name>.pem -CA rootCA.pem -CAkey rootCA.key -sha256 -CAcreateserial -days 180;

Deployment

Now that we have all the certificates and private keys, we need to deploy to our Kube clusters and reimport the new certificates to Certificate Manager in place of the old ones.
Deploying to Kubernetes, in our case, means creating secret resources for the certificates and applying them to the cluster. We create a subfolder for our secrets (let’s call it `cert-secrets`), read each certificate and it’s matching private key, and create a secret resource in the correct format with the specific data. Once we have all the secret files ready, we use kubectl (Kubernetes CLI) to apply all files.

kubectl apply -f ./cert-secrets

Now that all certificates have been correctly deployed to their intended locations, we can reimport them to the Certificate Manager instance in place of the old ones. By using the above-mentioned naming convention, we can easily identify the certificates to be replaced and use the Certificate Manager APIs to do so.

If something fails, Certificate Manager allows us to get the previous version of certificates and keys so we can roll back a change.

You can get help with technical questions on Stack Overflow with the ‘ibm-certificate-manager’ tag, or you can find help for non-technical questions in IBM developerWorks with the ‘ibm-certificate-manager’ tag. For defect or support needs, use the support section in the IBM Cloud menu. We would love to hear your feedback!

To get started with Certificate Manager, check it out in the IBM Cloud catalog.

Software Developer

Carmel Schindelhaim

Offering Manager - Cloud Developer Services - Security

More How-tos stories
April 23, 2019

Introducing Private Service Endpoints in IBM Cloud Databases

We recently released an update to all IBM Cloud Databases which allows you to enable public and/or private service endpoints for your database deployments. In this post, we’ll walk you through the setup.

Continue reading

April 5, 2019

Node.js 502 Bad Gateway Issues and How To Resolve Them

In December of 2018, many Node.js users noticed that their applications randomly returned an HTTP status code 502 "Bad Gateway" error. In this post, we'll show you how to resolve this issue if you have been affected.

Continue reading

April 3, 2019

Managing IBM Cloud Resources with a Service ID Through the Command Line Interface

We are excited to announce that you can now log into IBM Cloud with a service ID in v0.15.0 of the IBM Cloud CLI. This enables users to manage IBM Cloud resources with a service ID created within an account through the command line interface.

Continue reading