Troubleshooting upgrade on OpenShift

Review the following troubleshooting tips if you encounter a problem during API Connect upgrade on OpenShift, including as a component of IBM Cloud Pak for Integration (CP4I).

Subsystem is stuck in `Pending` state with a reason of `PreUpgradeCheckInProgress`

Before the subsystem microservices are upgraded, the operator triggers a set of pre-upgrade checks that must pass for the upgrade to proceed. If one or more of the checks fail, the subsystem status remains in Pending state with a reason of PreUpgradeCheckInProgress. Check the status condition of the subsystem CR to confirm the pre-upgrade check failed. The status.PreUpgradeCheck property contains a summary of the failed checks. Full logs for the checks that are carried out can be viewed in the ConfigMap referenced in the status.PreUpgradeCheck property. The pre-upgrade checks automatically retry until they successfully pass. If you are unable to rectify the problem that causes a check to fail, then open an IBM support case.

Incorrect `productVersion` of gateway pods after upgrade

When you upgrade on OpenShift, the rolling update might fail to start on gateway operand pods due to a gateway peering issue, so that the productVersion of the gateway pods is incorrect even though the reconciled version on the gateway CR displays correctly. This problem happens when there are multiple primary pods for gateway peering, and some of the peering pods are assigned to each primary. There can be only one primary pod for gateway-peering; the existence of multiple primaries prevents the upgrade from completing.

Verify that you have a gateway-peering issue with multiple primaries, and then resolve the issue by assigning a single primary for all gateway-peering pods. Complete the following steps:

Determine whether multiple primary pods are causing the problem:

Run one of the following commands to check the productVersion of each gateway pod.
```
oc get pods -n apic_namespace <gateway_pods> -o yaml | yq .metadata.annotations.productVersion
```
or
```
oc get pods -n apic_namespace <gateway_pods> -o custom-columns="productVersion:.metadata.annotations.productVersion"
```
where:
- apic_namespace is the namespace where API Connect is installed
- <gateway_pods> is a space-delimited list of the names of your gateway peering pods
Verify that the target version of API Connect supports the productVersion.
For information about supported versions of DataPower Gateway, see tapic_upgrade_OpenShift_consider.html#tapic_upgrade_OpenShift_consider__gwy_versions.

If any pod returns an incorrect value for the productVersion, check the DataPower operator logs for peering issues; look for messages similar to the following example:

"level":"error","ts":"2023-10-20T16:31:47.824751077Z","logger":"controllers.DataPowerRollout","msg":"Unable to check primaries","Updater.id":"d3858eb3-4a34-4d72-8c51-e4290b6f9cda","Request.Namespace":"apic","Request.Name":"production-gw-b725a391-4768-4ddc-a73a-d655cf806586","Request.Stage":"mainWork","error":"Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).mainWork\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/workers.go:107"}
{"level":"error","ts":"2023-10-20T16:32:17.262171148Z","logger":"controllers.DataPowerRollout","msg":"Attempting to recover from failed updater","Updater.id":"5524b8b5-8125-492e-aad7-3e54d107fdbf","Request.Namespace":"apic","Request.Name":"production-gw-b725a391-4768-4ddc-a73a-d655cf806586","Request.Stage":"checkWorkStatus","error":"Previous updater was unable to complete work: Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).checkWorkStatus\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/utils.go:389\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).Reconcile\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/reconcile.go:46\ngithub.ibm.com/datapower/datapower-operator/controllers/datapower.(*DataPowerRolloutReconciler).Reconcile\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/controllers/datapower/datapowerrollout_controller.go:50\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/jenkins/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235"}
{"level":"error","ts":"2023-10-20T16:32:17.856155403Z","logger":"controllers_datapowerrollout_gatewaypeering","msg":"Multiple primaries","DataPowerService":"production-gw","Namespace":"apic","Method":"isPrimary","podName":"production-gw-2","error":"Multiple primaries found: map[10.254.14.107:true 10.254.20.175:true]","stacktrace":"github.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout/gatewaypeering.(*GatewayPeeringPrimariesRolloutManager).isPrimary\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/gatewaypeering/gatewaypeering.go:284\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout/gatewaypeering.(*GatewayPeeringPrimariesRolloutManager).IsPrimaryByPodName\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/gatewaypeering/gatewaypeering.go:320\ngithub.ibm.com/datapower/datapower-operator/internal/controllers/datapowerrollout.(*DataPowerRolloutUpdater).mainWork\n\t/home/jenkins/agent/workspace/datapower-operator_build_v1.6/internal/controllers/datapowerrollout/workers.go:105"}

If you confirmed that gateway-peering (multiple primaries) is the issue, complete the following steps to resolve the issue.

Attach to a gateway pod:

Run the following command to get a list of gateway pods:

oc get pods -l app.kubernetes.io/managed-by=datapower-operator

The response looks like the following example:

NAME                                                     READY   STATUS    RESTARTS   AGE
datapower-operator-conversion-webhook-585d95cf87-mktrv   1/1     Running   0          2d22h
datapower-operator-f4d8fbd85-5xt47                       1/1     Running   0          2d22h
production-gw-0                                          1/1     Running   0          2d22h
production-gw-1                                          1/1     Running   0          2d22h
production-gw-2                                          1/1     Running   0          2d22h

Select a pod from the list and run the following command to attach to it using the NAME:
```
oc -n ns attach -it <pod_NAME> -c datapower
```

Select the new primary:

Run the following commands to switch to the API Connect configuration:
```
config; switch apiconnect
```

Run the following command to show the current peering status:

show gateway-peering-status

Example response:

 Address      Name             Pending Offset     Link Primary Service port Monitor port Priority 
 ------------ ---------------- ------- ---------- ---- ------- ------------ ------------ -------- 
 10.254.12.45 api-probe        0       19114431   ok   no      16382        26382        2        
 10.254.12.45 gwd              0       1589848370 ok   no      16380        26380        2        
 10.254.12.45 rate-limit       0       19148496   ok   no      16383        26383        2        
 10.254.12.45 ratelimit-module 0       19117538   ok   no      16386        26386        2        
 10.254.12.45 subs             0       26840742   ok   no      16384        26384        2        
 10.254.12.45 tms              0       19115634   ok   no      16381        26381        2        
 10.254.12.45 tms-external     0       19116159   ok   no      16385        26385        2        
 10.254.28.61 api-probe        0       19114431   ok   no      16382        26382        1        
 10.254.28.61 gwd              0       1589849802 ok   yes     16380        26380        1        
 10.254.28.61 rate-limit       0       19148496   ok   yes     16383        26383        1        
 10.254.28.61 ratelimit-module 0       19117538   ok   yes     16386        26386        1        
 10.254.28.61 subs             0       26842695   ok   yes     16384        26384        1        
 10.254.28.61 tms              0       19117734   ok   yes     16381        26381        1        
 10.254.28.61 tms-external     0       19117965   ok   yes     16385        26385        1        
 10.254.36.32 api-probe        0       19114739   ok   yes     16382        26382        3        
 10.254.36.32 gwd              0       1589848370 ok   no      16380        26380        3        
 10.254.36.32 rate-limit       0       19148496   ok   no      16383        26383        3        
 10.254.36.32 ratelimit-module 0       19117538   ok   no      16386        26386        3        
 10.254.36.32 subs             0       26840742   ok   no      16384        26384        3        
 10.254.36.32 tms              0       19115634   ok   no      16381        26381        3        
 10.254.36.32 tms-external     0       19116159   ok   no      16385        26385        3

Take note of the IP address that has the most primary yes options; this IP address will be the new primary for all functions involved in gateway peering.
In this example, 10.254.28.61 is the primary for the largest number of functions; however api-probe is not primary on this pod. In the following steps, set api-probe as primary on this pod.

For every function that needs the primary changed, attach to the primary pod and set the function to primary:
1. Log out of the current pod by pressing Ctrl+P, and then pressing Ctrl+Q.
2. Run the following command to attach to a pod that needs its primary updated:
```
oc -n ns attach -it <IP_of_pod> -c datapower
```
3. Run the following command to switch to the API Connect configuration:
```
config; switch apiconnect
```
4. Run the following command to show the current peering status:
```
show gateway-peering-status
```
  Verify that the function is set to the wrong primary IP and requires updating (in the example, api-probe requires updating).
5. Run the following command to change the primary for the function:
```
gateway-peering-switch-primary <function>
```
  For example:
```
gateway-peering-switch-primary api-probe
```
6. Repeat this process for every function that needs its primary updated.
Verify that all functions are now primary on the same IP address by running the following command:
```
show gateway-peering-status
```
Log out of the gateway pod (Ctrl+P, then Ctrl+Q).

Upgrade fails with an error message about a missing cert-manager

Beginning in API Connect V10.0.7.0, it is recommended to use the cert-manager operator for Red Hat OpenShift. If you did not install Red Hat OpenShift cert-manager operator during the upgrade, the upgrade fails with the following error message from the ibm-apiconnect pod:

cert-manager is required but not installed. Please install a cert-manager operator such as [cert-manager Operator for Red Hat OpenShift]. APIC instance being upgraded

Resolve the issue by completing the following steps:

Install the cert-manager operator for Red Hat OpenShift:
1. Log in to the OpenShift Container Platform web console.
2. Click Operators > OperatorHub.
3. In the filter box, type: cert-manager Operator for Red Hat OpenShift.
4. Select cert-manager Operator for Red Hat OpenShift and click Install.
5. On the Install Operator page, complete the following steps:
  1. Update the Update channel if needed. The channel defaults to stable-v1, which installs the latest stable release of the cert-manager Operator for Red Hat OpenShift.
  2. Select the Installed Namespace for the operator.
    The default operator namespace is cert-manager-operator; if that namespace doesn't exist, it is created for you.
  3. Select an Update approval strategy:
    - Automatic: allow Operator Lifecycle Manager (OLM) to automatically update the operator when a new version is available.
    - Manual: require a user with the appropriate credentials to approve all operator updates.
  4. Click Install.
Verify the new cert-manager installation by completing the following steps:
1. Click Operators > Installed Operators.
2. Verify that cert-manager Operator for Red Hat OpenShift is listed with a Status of Succeeded in the cert-manager-operator namespace.
3. Verify that cert-manager pods are up and running with the following command:
```
oc get pods -n cert-manager
```
  For a successful installation, the response looks like the following example:
```
NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-bd7fbb9fc-wvbbt               1/1     Running   0          3m39s
cert-manager-cainjector-56cc5f9868-7g9z7   1/1     Running   0          4m5s
cert-manager-webhook-d4f79d7f7-9dg9w       1/1     Running   0          4m9s
```
Remove the obsolete IBM cert-manager operator by running the following command:
```
oc delete certmanagers.operator.ibm.com default
```

DataPower operator pod stuck waiting for lock removal

If you see messages in the datapower-operator pod log that indicate that the pod is waiting for a lock to be removed:

{"level":"info","ts":"2021-03-08T19:29:53.432Z","logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":"2021-03-08T19:29:57.971Z","logger":"leader","msg":"Leader pod has been deleted, waiting for garbage collection to remove the lock."}

The DataPower operator cannot be upgraded until this problem is resolved, see: DataPower operator documentation.

You see the `denied: insufficient scope` error during an air-gapped deployment

Problem: You encounter the denied: insufficient scope message while mirroring images during an air-gapped installation or upgrade.

Reason: This error occurs when a problem is encountered with the entitlement key that is used for obtaining images.

Solution: Obtain a new entitlement key by completing the following steps:

Log in to the IBM Container Library.
In the Container software library, select Get entitlement key.
After the Access your container software heading, click Copy key.
Copy the key to a safe location.

Apiconnect operator pod fails

Problem: During installation (or upgrade), the apiconnect operator fails with the following message:

panic: unable to build API support: unable to get Group and Resources: unable to retrieve the complete list of server APIs: packages.operators.coreos.com/v1: the server is currently unable to handle the request

goroutine 1 [running]:
github.ibm.com/velox/apiconnect-operator/operator-utils/v2/apiversions.GetAPISupport(0x0)
	operator-utils/v2/apiversions/api-versions.go:89 +0x1e5
main.main()
	ibm-apiconnect/cmd/manager/main.go:188 +0x4ee

Additional symptoms:

Apiconnect operator is in crash loopback status

Kube apiserver pods log the following information:

E1122 18:02:07.853093 18 available_controller.go:437] v1.packages.operators.coreos.com failed with:
 failing or missing response from https://10.128.0.3:5443/apis/packages.operators.coreos.com/v1:
 bad status from https://10.128.0.3:5443/apis/packages.operators.coreos.com/v1: 401

The IP logged here belongs to the package server pod present in the openshift-operator-lifecycle-manager namespace

Package server pods log the following error message:

/apis/packages.operators.coreos.com/v1 API call is being rejected with 401
issue

E1122 18:10:25.614179 1 authentication.go:53] Unable to authenticate the request due to an error: x509: 
certificate signed by unknown authority I1122 18:10:25.614224 1 httplog.go:90] 
verb="GET" URI="/apis/packages.operators.coreos.com/v1" latency=161.243µs resp=401 
UserAgent="Go-http-client/2.0" srcIP="10.128.0.1:41370":

The problem is intermittent

Solution:

If you find the exact symptoms as described, the solution is to delete package server pods in the openshift-operator-lifecycle-manager namespace.
New package server pods log the 200 Success message for the same API call.

Unexpected behavior in Cloud Manager and API Manager UIs after upgrade

Stale browser cache issues can cause problems after an upgrade. To remedy this problem, clear your browser cache, and open a new browser window.

Note: In the The Help icon.

Help page of the Cloud Manager, API Manager, and API Designer user interfaces, there's a Product information tile that you can click to find out information about your product versions, as well as Git information about the package versions being used. Note that the API Designer product information is based on its associated management server, but the Git information is based on where it was downloaded from.