Troubleshooting
Problem
During the upgrade of an OpenShift Container Platform (OCP) cluster from version 4.14 to 4.16, several issues were encountered. Initially, the cluster had pending CSRs, a degraded Master Machine Config Pool (MCP), and a crashlooping config operator pod. After resolving these issues, two infrastructure pods (etcd and openshift-kube-scheduler) on one of the master nodes remained in an 'Init' state, preventing the upgrade from proceeding. The etcd pod issue was related to certificate expiration and initialization failures.
Symptom
The etcd pod (etcd-nonprod1ocpmaster2.tchtest.org) on master2 is in 'Init:0/3' status and is not starting properly. The openshift-kube-scheduler pod (openshift-kube-scheduler-nonprod1ocpmaster2.tchtest.org) is also in 'Init:0/1' status. The issue is potentially related to certificate expiration and is affecting the readiness of these pods. The problem is similar to a known issue documented in Red Hat's solution 7124179.
Cause
The issue was caused by multiple technical problems, including pending CSRs, a degraded Master MCP, and crashlooping pods in the customer's OCP cluster. These issues need to be resolved before proceeding with the upgrade.
Diagnosing The Problem
- Reviewed the initial must-gather to identify pre-upgrade issues.
- Identified pending CSRs and provided command to approve them.
- Resolved degraded Master MCP by investigating and fixing underlying causes.
- Addressed crashlooping config operator pod by applying relevant fixes.
- Verified cluster health in subsequent must-gathers to ensure readiness for upgrade.
- Provided guidance on backing up etcd before proceeding with the upgrade.
Resolving The Problem
The OCP cluster upgrade from v4.14 to v4.16 was successfully facilitated by resolving issues identified during the proactive assessment. The steps included:
1. Reviewing the cluster health using 'omc' commands and must-gather data.
2. Identifying and resolving issues on the master nodes, specifically using the Red Hat solution for one of the nodes (https://access.redhat.com/solutions/6990188).
3. Confirming all 11 nodes were in a 'Ready' state and all Cluster Operators (COs) were ready.
4. Verifying Cluster Version, CSRs, CSVs, PVs, PVCs, MCP, Statefulsets, and Pods were in the expected state.
5. Proceeding with the upgrade on the scheduled date after validation.
Document Location
Worldwide
Was this topic helpful?
Document Information
Modified date:
18 December 2025
UID
ibm17254870