Troubleshooting upgrades on Kubernetes

Review the following troubleshooting guidance if you encounter a problem during an API Connect on Kubernetes.

Subsystem is stuck in Pending state with a reason of PreUpgradeCheckInProgress

Before the subsystem microservices are upgraded, the operator triggers a set of pre-upgrade checks that must pass for the upgrade to proceed. If one or more of the checks fail, the subsystem status remains in Pending state with a reason of PreUpgradeCheckInProgress. Check the status condition of the subsystem CR to confirm the pre-upgrade check failed. The status.PreUpgradeCheck property contains a summary of the failed checks. Full logs for the checks that are carried out can be viewed in the ConfigMap referenced in the status.PreUpgradeCheck property. The pre-upgrade checks automatically retry until they successfully pass. If you are unable to rectify the problem that causes a check to fail, then open an IBM support case.

License webhook error

If you did not update the license ID in the CR, then when you save your changes, the following webhook error might display:
admission webhook "vmanagementcluster.kb.io" denied the request: 
ManagementCluster.management.apiconnect.ibm.com "management" is invalid: 
spec.license.license: Invalid value: "L-RJON-BYGHM4": 
License L-RJON-BYGHM4 is invalid for the chosen version 10.0.8.1. 
Please refer license document https://ibm.biz/apiclicenses

To resolve the error, see API Connect licenses for the list of the available license IDs and select the appropriate license IDs for your deployment. Update the CR with the new license value as in the following example, and then save and apply your changes again.

Taskmanager error syncing management and gateway

If the gateways are not in sync with management after upgrade, check if the management subsystem taskmanager pods log the following error message. It starts 15 minutes after upgrade and repeats every 15 minutes for any stuck task.
TASK: Stale claimed task set to errored state:
If these errors are reported, restart all the management-natscluster pods, for example: management-natscluster-1.
kubectl -n <namespace> delete pod management-natscluster-1 management-natscluster-2 management-natscluster-3

DataPower operator fails to start

There is a known issue on Kubernetes that can cause the DataPower operator to fail to start. In this case, the DataPower operator pods can fail to schedule, and display the status message: no nodes match pod topology spread constraints (missing required label). For example:
0/15 nodes are available: 12 node(s) didn't match pod topology spread constraints (missing required label), 
3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
You can workaround the issue by editing the DataPower operator deployment and reapplying it, as follows:
  1. Delete the DataPower operator deployment, if deployed already:
    kubectl delete -f ibm-datapower.yaml -n <namespace>
  2. Open ibm-datapower.yaml, and locate the topologySpreadConstraints: section. For example:
    topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: zone
      whenUnsatisfiable: DoNotSchedule
  3. Replace the values for topologyKey: and whenUnsatisfiable: with the corrected values that are shown in the following example:
    topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: ScheduleAnyway
  4. Save ibm-datapower.yaml and deploy the file to the cluster:
    kubectl apply -f ibm-datapower.yaml -n <namespace>

Unexpected behavior in Cloud Manager and API Manager UIs after upgrade

Stale browser cache issues can cause problems after an upgrade. To remedy this problem, clear your browser cache, and open a new browser window.

Note: In the The Help icon.Help page of the Cloud Manager, API Manager, and API Designer user interfaces, there's a Product information tile that you can click to find out information about your product versions, as well as Git information about the package versions being used. Note that the API Designer product information is based on its associated management server, but the Git information is based on where it was downloaded from.

Portal sites failed to be upgraded successfully

If one or more of the Portal sites failed to be upgraded successfully, check the portal-www admin container logs to see what prevented the site upgrade from completing successfully.

To trigger the site upgrade again, exec into the portal-www pod admin container and run the following command:

upgrade_devportal -s <site_uuid> -p <platform>
where:
  • <site_uuid> can be obtained by running the command: list_sites
  • <platform> can be obtained by running the command: list_platforms