OpenShift Container Platform (OCP) upgrade fails from v4.14 to v4.16

Troubleshooting

Problem

During the upgrade of an OpenShift Container Platform (OCP) cluster from version 4.14 to 4.16, several issues were encountered. Initially, the cluster had pending CSRs, a degraded Master Machine Config Pool (MCP), and a crashlooping config operator pod. After resolving these issues, two infrastructure pods (etcd and openshift-kube-scheduler) on one of the master nodes remained in an 'Init' state, preventing the upgrade from proceeding. The etcd pod issue was related to certificate expiration and initialization failures.

Symptom

The etcd pod (etcd-nonprod1ocpmaster2.tchtest.org) on master2 is in 'Init:0/3' status and is not starting properly. The openshift-kube-scheduler pod (openshift-kube-scheduler-nonprod1ocpmaster2.tchtest.org) is also in 'Init:0/1' status. The issue is potentially related to certificate expiration and is affecting the readiness of these pods. The problem is similar to a known issue documented in Red Hat's solution 7124179.

Cause

The issue was caused by multiple technical problems, including pending CSRs, a degraded Master MCP, and crashlooping pods in the customer's OCP cluster. These issues need to be resolved before proceeding with the upgrade.

Diagnosing The Problem

Reviewed the initial must-gather to identify pre-upgrade issues.
Identified pending CSRs and provided command to approve them.
Resolved degraded Master MCP by investigating and fixing underlying causes.
Addressed crashlooping config operator pod by applying relevant fixes.
Verified cluster health in subsequent must-gathers to ensure readiness for upgrade.
Provided guidance on backing up etcd before proceeding with the upgrade.

Resolving The Problem

The OCP cluster upgrade from v4.14 to v4.16 was successfully facilitated by resolving issues identified during the proactive assessment. The steps included:
1. Reviewing the cluster health using 'omc' commands and must-gather data.
2. Identifying and resolving issues on the master nodes, specifically using the Red Hat solution for one of the nodes (https://access.redhat.com/solutions/6990188).
3. Confirming all 11 nodes were in a 'Ready' state and all Cluster Operators (COs) were ready.
4. Verifying Cluster Version, CSRs, CSVs, PVs, PVCs, MCP, Statefulsets, and Pods were in the expected state.
5. Proceeding with the upgrade on the scheduled date after validation.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB22","label":"Red Hat"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSDE6LV","label":"RH OPENSHIFT"},"ARM Category":[],"ARM Case Number":"TS019602066","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":""},{"Type":"MASTER","Line of Business":{"code":"LOB33","label":"N\/A"},"Business Unit":{"code":"BU051","label":"N\/A"},"Product":{"code":"SSR5HY","label":"Cloud Pak RHOCP COC"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":""}]

Tips

OpenShift Container Platform (OCP) upgrade fails from v4.14 to v4.16

Troubleshooting

Problem

Symptom

Cause

Diagnosing The Problem

Resolving The Problem

Document Location

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?