Troubleshooting installation and upgrade on OpenShift

Review the following troubleshooting tips if you encounter a problem while installing or upgrading API Connect on OpenShift.

Note: In the The Help icon.

Help page of the Cloud Manager, API Manager, and API Designer user interfaces, there's a Product information tile that you can click to find out information about your product versions, as well as Git information about the package versions being used. Note that the API Designer product information is based on its associated management server, but the Git information is based on where it was downloaded from.

One or more pods in CrashLoopBackoff or Error state, and report a certificate error in the logs
Additional postgres replica created during installation or upgrade
Operator upgrade fails with error from API Connect operator and Postgres operator
Upgrade error when the CRD for the new certificate manager is not found
Gateway pods not in sync with Management after upgrade
You see the denied: insufficient scope error during an air-gapped deployment
Apiconnect operator crashes
Portal db pods restart mysqld if the database state transfer takes more than 5 minutes
Issues when installing Drupal 8 based custom modules or sub-themes into the Drupal 9-based Developer Portal
Upgrading a 3-node profile from 10.0.1.4-eus or earlier might result in some portal-db/www pods being stuck in the Pending state
Disabling the Portal web endpoint check

One or more pods in `CrashLoopBackoff` or `Error` state, and report a certificate error in the logs

In rare cases, cert-manager might detect a certificate in a bad state right after it has been issued, and then re-issues the certificate. If a CA certificate has been issued twice, the certificate that was signed by the previously issued CA will be left stale and can't be validated by the newly issued CA. In this scenario, one of the following messages displays in the log:

javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown

Error: unable to verify the first certificate

ERROR: openssl verify failed to verify the Portal CA tls.crt, ca.crt chain signed the Portal Server tls.crt cert

Resolve the problem by completing the following steps:

Use apicops (v10 version 0.10.57+ required) to validate the certificates in the system:
```
apicops upgrade:stale-certs -n <namespace>
```
If any certificate that is managed by cert-manager fails the validation, delete the stale certificate secret:
```
oc delete secret <stale-secret> -n <namespace>
```
Cert-manager automatically generates a new certificate to replace the one you deleted.
Use apicops to make sure all certificates can be verified successfully:
```
apicops upgrade:stale-certs -n <namespace>
```

Additional postgres replica created during installation or upgrade

If you see an installation or upgrade procedure stall with more than 2 postgres replica deployment in a n3 profile, delete the pending replica deployment by completing the following steps:

Get the name of the pending postgres deployments:

oc get deploy -n <APIC_namespace> | grep postgres

Delete the pending postgres replica deployment:

oc delete deploy <replica_deployment_name> -n <APIC_namespace>

Operator upgrade fails with error from API Connect operator and Postgres operator

If you encounter the following error condition during the upgrade from an API Connect version earlier than 10.0.1.6-eus, complete the workaround steps to patch the pgcluster CR, and then upgrade the API Connect operator again.

Check for the following errors:

From the API Connect operator:

upgrade cluster failed: Could not upgrade cluster: there exists an ongoing upgrade task: [minimum-mgmt-56616911-postgres-upgrade]. If you believe this is an error, try deleting this pgtask

{"level":"info","ts":1659638932.470165,"logger":"UpgradeCluster: ","msg":"Postgres DB version is less than pgoVersion. Performing upgrade ","pgoVersion: ":"4.7.4","postgresDBVersion: ":"4.5.2","clusterName":"minimum-mgmt-56616911-postgres"}

And from the postgres operator:

time="2022-08-04T22:05:42Z" level=error msg="Namespace Controller: error syncing Namespace 'cp4i': unsuccessful pgcluster version check: Pgcluster.crunchydata.com "minimum-mgmt-56616911-postgres" is invalid: spec.backrestStorageTypes: Invalid value: "null": spec.backrestStorageTypes in body must be of type array: "null""

The errors are caused by a null value for the property in the pgcluster CR. The workaround is to patch the pgcluster CR and correct the property.

remove the null value for backrestStorageTypes: under spec, if found and add appropriate value mentioned below. Or if backrestStorageTypes doesn't show up in pgcluster add the below values according to the backup type used

Get the pgcluster name and edit the CR:

oc get pgcluster -n <APIC_namespace>
oc edit pgcluster <pgcluster_name> -n <APIC_namespace>

Add a value for the backrestStorageTypes in the spec: section:
Example for S3 backups:
```
    backrestStorageTypes:
    - s3
```
Example for local and SFTP backups:
```
    backrestStorageTypes:
    - posix
```
Save and apply the CR with the wq command.

Upgrade error when the CRD for the new certificate manager is not found

The certificate manager was upgraded in Version 10.0.4.0 and you might encounter an upgrade error if the CRD for the new certificate manager is not found; for example:

ibm-apiconnect-5fb55f5c5c-hdks4 ibm-apiconnect error {"level":"error","ts":1637587218.9319391,"logger":"controller-runtime.source","msg":"if kind is a CRD, it should be installed before calling Start","kind":"Certificate.cert-manager.io","error":"no matches for kind "Certificate" in version "cert-manager.io/v1"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/source/source.go:117\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:143\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:184\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startRunnable.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/manager/internal.go:661"}

To resolve the error, delete the IBM API Connect operator and re-install it. If you find that the operator is stuck in the "deleting" state in the web console, delete it manually by running the following commands in the project (namespace) where the operator is located:

oc delete subscription ibm-apiconnect 
oc delete csv ibm-apiconnect.v2.4.0

If you installed the operator in all namespaces, the project is called openshift-operators.

Gateway pods not in sync with Management after upgrade

A rare failure has occurred in which some API Connect Manager tasks are unable to run after an upgrade. Check if the API Connect 'taskmanager' pods log an error message similar to the following example:

TASK: Stale claimed task set to errored state:

The error starts approximately 15 minutes after upgrade, and repeat every 15 minutes for any stuck task. If these errors are reported, run the following command restart all the management-natscluster pods, for example:

oc -n <namespace> delete pod management-natscluster-1 management-natscluster-2 management-natscluster-3

You see the `denied: insufficient scope` error during an air-gapped deployment

Problem: You encounter the denied: insufficient scope message while mirroring images during an air-gapped installation or upgrade.

Reason: This error occurs when a problem is encountered with the entitlement key used for obtaining images.

Solution: Obtain a new entitlement key by completing the following steps:

Log in to the IBM Container Library.
In the Container software library, select Get entitlement key.
After the Access your container software heading, click Copy key.
Copy the key to a safe location.

Portal `db` pods restart `mysqld` if the database state transfer takes more than 5 minutes

If the portal database state transfer takes longer than 5 minutes from one db pod to another, then the db pod sending the data incorrectly thinks that the database process is stuck in a bad state, and will restart the database process. This situation would typically happen if you have more than 10 or 12 sites, a slow network between the db pods, such as in a distant multi-site high availability setup, or both.

In such cases, the following entry would be seen in one of the ready db pods or db containers:

dbstatus: ERROR: stuck in Donor mode for 5m31s!, restarting the database

To prevent this situation from happening, edit the top level CR by using the UI, and add the environment variable as shown in the following CR snippet:

spec:
...
  portal:
    template:
    - containers:
      - env:
        - name: DONOR_STALE_SECS
          value: "7200"
        name: db
      name: db

Then, you must delete all of the db pods, to ensure that you get the new configuration and the database can startup correctly. For example:

kubectl delete pod <portal-mydc-db-0> <portal-mydc-db-1> <portal-mydc-db-2>

Issues when installing Drupal 8 based custom modules or sub-themes into the Drupal 9-based Developer Portal

From IBM® API Connect 10.0.1.4-eus, the Developer Portal is based on the Drupal 9 content management system. If you want to install Drupal 8 custom modules or sub-themes into the Drupal 9 based Developer Portal, you must ensure that they are compatible with Drupal 9, including any custom code that they contain, and not using any deprecated APIs, for example. There are tools available for checking your custom code, such as drupal_check on GitHub, which checks Drupal code for deprecations.

For example, any Developer Portal sites that contain modules or sub-themes that don't contain a Drupal 9 version declaration will fail to upgrade, and errors like the following output will be seen in the admin logs:

[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: Checking theme: emeraldgreen
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: ERROR: Incompatible core_version_requirement '' found for emeraldgreen
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: Checking theme: rubyred
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: ERROR: Incompatible core_version_requirement '8.x' found for rubyred
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: ERROR: Found themes incompatible with Drupal 9: emeraldgreen rubyred
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: ERROR: /tmp/restore_site.355ec8 is NOT Drupal 9 compatible
...
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: Checking module: custom_mod_1
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: ERROR: Incompatible core_version_requirement '' found for custom_mod_1
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: Checking module: custom_mod_2
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: ERROR: Incompatible core_version_requirement '8.x' found for custom_mod_2
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: ERROR: Found modules incompatible with Drupal 9: emeraldgreen rubyred
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: ERROR: site1.com is NOT Drupal 9 compatible

To fix version compatibility errors, all custom modules and sub-themes should declare a core_version_requirement key in their *.info.yml file that indicates Drupal 9 compatibility. For example:

name: Example module
type: module
description: Purely an example
core: 8.x
core_version_requirement: '^8 || ^9'
package: Example module

# Information added by Drupal.org packaging script on 2020-05-31
version: '8.x-1.3'
project: 'example_module'
datestamp: 1590905415

This example specifies that the module is compatible with all versions of Drupal 8 and 9. For more information, see Let Drupal know about your module with an .info.yml file on the drupal.org website.

If you have a backup of a site that you need to restore, and are getting the version compatibility error, but the module or theme *.info.yml file cannot be changed easily, then you have two options. Either:

Add an environment variable into the portal CR for the www pod of the admin container stating SKIP_D9_COMPAT_CHECK: "true". However, if you choose this method, you must be positive that all of the custom modules and themes for your sites are Drupal 9 compatible, as otherwise the sites may end up inaccessible after the upgrade or restore. On OpenShift or IBM Cloud Pak for Integration, edit the top level CR by using the UI, and add the environment variable as shown in the following CR snippet:
```
spec:
...
  portal:
    template:
    - containers:
      - env:
        - name: SKIP_D9_COMPAT_CHECK
          value: "true"
        name: admin
      name: www
```

Or:

Extract the site backup, edit the relevant files inside it, and then tar the backup file again. Note that this procedure will overwrite the original backup file, so ensure that you keep a separate copy of the original file before you start the extraction. For example:
1. mkdir /tmp/backup
2. cd /tmp/backup
3. tar xfz path_to_backup.tar.gz
4. Edit the custom module and theme files to make them Drupal 9 compatible, and add the correct core_version_requirement setting.
5. rm -f path_to_backup.tar.gz
6. tar cfz path_to_backup.tar.gz
7. cd /
8. rm -rf /tmp/backup

Upgrading a 3-node profile from 10.0.1.4-eus or earlier might result in some `portal-db/www` pods being stuck in the `Pending` state

IBM API Connect 10.0.1.4-ifix1-eus introduces the pod anti-affinity required rule, meaning that in a 3 node profile deployment, all 3 db and www pods can run only if there are at least 3 running worker nodes. This rule can cause some upgrades to version 10.0.1.4-ifix1-eus to become stuck in the Pending state, in which case some extra steps are needed during the upgrade to workaround the issue. See the following example for detailed information about the issue, and how to continue with the upgrade.

Important: You must have a backup of your current deployment before starting the upgrade.

The following example shows 3 worker nodes:

$ oc get nodes
NAME          STATUS   ROLES    AGE   VERSION
apimdev0103   Ready    worker   42m   v1.20.0
apimdev0129   Ready    worker   45m   v1.20.0
apimdev1066   Ready    worker   39m   v1.20.0

The pods have been scheduled across only 2 of the 3 worker nodes due to a transient problem with apimdev1066, as shown in the following pod list. Pods without persistent storage, such as nginx-X, can be rescheduled to apimdev1066 as soon as they are restarted, but any pods with persistent local storage, such as db-X and www-X, have to be rescheduled onto the same worker node as that is where their files live.

$ oc get po -o wide
NAME                                READY   STATUS    RESTARTS   AGE     IP               NODE          NOMINATED NODE   READINESS GATES
ejs-portal-nginx-84f57ffd8c-hbf66   1/1     Running   0          5m12s   888.16.109.208   apimdev0103   <none>           <none>
ejs-portal-nginx-84f57ffd8c-mvq96   1/1     Running   0          5m12s   888.16.142.215   apimdev0129   <none>           <none>
ejs-portal-nginx-84f57ffd8c-vpmtl   1/1     Running   0          5m12s   888.16.142.214   apimdev0129   <none>           <none>
ejs-portal-site1-db-0               2/2     Running   0          4m39s   888.16.109.209   apimdev0103   <none>           <none>
ejs-portal-site1-db-1               2/2     Running   0          6m37s   888.16.109.206   apimdev0103   <none>           <none>
ejs-portal-site1-db-2               2/2     Running   0          4m39s   888.16.142.216   apimdev0129   <none>           <none>
ejs-portal-site1-www-0              2/2     Running   0          4m9s    888.16.109.210   apimdev0103   <none>           <none>
ejs-portal-site1-www-1              2/2     Running   0          6m37s   888.16.142.213   apimdev0129   <none>           <none>
ejs-portal-site1-www-2              2/2     Running   0          4m9s    888.16.142.217   apimdev0129   <none>           <none>
ibm-apiconnect-75b47f9f87-p25dd     1/1     Running   0          5m12s   888.16.109.207   apimdev0103   <none>           <none>

The upgrade to version 10.0.1.4-ifix1-eus is started and the following pod list is observed:

$ oc get po -o wide
NAME                                READY   STATUS    RESTARTS   AGE     IP               NODE          NOMINATED NODE   READINESS GATES
ejs-portal-nginx-84f57ffd8c-hbf66   1/1     Running   0          10m     888.16.109.208   apimdev0103   <none>           <none>
ejs-portal-nginx-84f57ffd8c-mvq96   1/1     Running   0          10m     888.16.142.215   apimdev0129   <none>           <none>
ejs-portal-nginx-84f57ffd8c-vpmtl   1/1     Running   0          10m     888.16.142.214   apimdev0129   <none>           <none>
ejs-portal-site1-db-0               2/2     Running   0          10m     888.16.109.209   apimdev0103   <none>           <none>
ejs-portal-site1-db-1               0/2     Pending   0          91s     <none>           <none>        <none>           <none>
ejs-portal-site1-db-2               2/2     Running   0          2m41s   888.16.142.218   apimdev0129   <none>           <none>
ejs-portal-site1-www-0              2/2     Running   0          9m51s   888.16.109.210   apimdev0103   <none>           <none>
ejs-portal-site1-www-1              2/2     Running   0          12m     888.16.142.213   apimdev0129   <none>           <none>
ejs-portal-site1-www-2              0/2     Pending   0          111s    <none>           <none>        <none>           <none>
ibm-apiconnect-75b47f9f87-p25dd     1/1     Running   0          10m     888.16.109.207   apimdev0103   <none>           <none>

The pod list shows that db-2 has restarted, and has been rescheduled to apimdev0129 as there were no other db pods running on that node. However, db-1 and www-2 are both stuck in Pending state, as there is already a pod of the same type running on the worker node that is hosting the local storage that they are bound to. If you run a describe command of either pod you will see the following output:

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  10s (x4 over 2m59s)  default-scheduler  0/3 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't match pod anti-affinity rules, 2 node(s) had volume node affinity conflict.

To resolve this situation you need to delete the PVCs for each pod, and then delete the pod itself, so that Kubernetes will regenerate the PVCs and schedule the pod on the worker node that does not have the anti-affinity conflict.

Therefore, for the db-1 pod the following commands must be run:

$ oc get pvc | grep ejs-portal-site1-db-1
db-ejs-portal-site1-db-1        Bound    local-pv-fa445e30   250Gi      RWO            local-storage   15m
dblogs-ejs-portal-site1-db-1    Bound    local-pv-d57910e7   250Gi      RWO            local-storage   15m

$ oc delete pvc db-ejs-portal-site1-db-1 dblogs-ejs-portal-site1-db-1
persistentvolumeclaim "db-ejs-portal-site1-db-1" deleted
persistentvolumeclaim "dblogs-ejs-portal-site1-db-1" deleted

$ oc delete po ejs-portal-site1-db-1
pod "ejs-portal-site1-db-1" deleted

For the www-2 pod the following commands must be run:

$ oc get pvc | grep ejs-portal-site1-www-2
admin-ejs-portal-site1-www-2    Bound    local-pv-48799536   245Gi      RWO            local-storage   51m
backup-ejs-portal-site1-www-2   Bound    local-pv-a93f5607   245Gi      RWO            local-storage   51m
web-ejs-portal-site1-www-2      Bound    local-pv-facd4489   245Gi      RWO            local-storage   51m

$ oc delete pvc admin-ejs-portal-site1-www-2 backup-ejs-portal-site1-www-2 web-ejs-portal-site1-www-2
persistentvolumeclaim "admin-ejs-portal-site1-www-2" deleted
persistentvolumeclaim "backup-ejs-portal-site1-www-2" deleted
persistentvolumeclaim "web-ejs-portal-site1-www-2" deleted

$ oc delete po ejs-portal-site1-www-2
pod "ejs-portal-site1-www-2" deleted

If the PVC has persistentVolumeReclaimPolicy: Delete set on it, no cleanup is necessary, as the old data will have been deleted on the worker node that is no longer running the db-1 and www-2 pods. However, if you are using local-storage in a Kubernetes installation, then there might be some cleanup to do to remove the old data from the worker node that had the anti-affinity conflict.

Kubernetes can now reschedule the pods. All pods with persistent storage are now spread across the available worker nodes, and the pods whose PVCs were deleted will get a full copy of the data from the existing running pods. The following pod list is now observed in our example:

$ oc get po -o wide
NAME                                READY   STATUS    RESTARTS   AGE     IP               NODE          NOMINATED NODE   READINESS GATES
ejs-portal-nginx-84f57ffd8c-f85wm   1/1     Running   0          30s     888.16.29.136    apimdev1066   <none>           <none>
ejs-portal-nginx-84f57ffd8c-k5klb   1/1     Running   0          103s    888.16.142.220   apimdev0129   <none>           <none>
ejs-portal-nginx-84f57ffd8c-lqhqs   1/1     Running   0          1m53s   888.16.109.212   apimdev0103   <none>           <none>
ejs-portal-site1-db-0               2/2     Running   0          6m43s   888.16.109.211   apimdev0103   <none>           <none>
ejs-portal-site1-db-1               2/2     Running   0          8m20s   888.16.29.134    apimdev1066   <none>           <none>
ejs-portal-site1-db-2               2/2     Running   0          14m     888.16.142.218   apimdev0129   <none>           <none>
ejs-portal-site1-www-0              2/2     Running   0          93s     888.16.109.213   apimdev0103   <none>           <none>
ejs-portal-site1-www-1              2/2     Running   0          3m55s   888.16.142.219   apimdev0129   <none>           <none>
ejs-portal-site1-www-2              2/2     Running   0          7m27s   888.16.29.135    apimdev1066   <none>           <none>
ibm-apiconnect-75b47f9f87-p25dd     1/1     Running   0          22m     888.16.109.207   apimdev0103   <none>           <none>

Apiconnect operator crashes

Problem: During installation (or upgrade), the Apiconnect operator crashes with the following message:

panic: unable to build API support: unable to get Group and Resources: unable to retrieve the complete list of server APIs: packages.operators.coreos.com/v1: the server is currently unable to handle the request

goroutine 1 [running]:
github.ibm.com/velox/apiconnect-operator/operator-utils/v2/apiversions.GetAPISupport(0x0)
	operator-utils/v2/apiversions/api-versions.go:89 +0x1e5
main.main()
	ibm-apiconnect/cmd/manager/main.go:188 +0x4ee

Additional symptoms:

Apiconnect operator is in crash loopback status

Kube apiserver pods log the following information:

E1122 18:02:07.853093 18 available_controller.go:437] v1.packages.operators.coreos.com failed with:
 failing or missing response from https://10.128.0.3:5443/apis/packages.operators.coreos.com/v1:
 bad status from https://10.128.0.3:5443/apis/packages.operators.coreos.com/v1: 401

The IP logged here belongs to the package server pod present in the openshift-operator-lifecycle-manager namespace

Package server pods log the following: /apis/packages.operators.coreos.com/v1 API call is being rejected with 401 issue

E1122 18:10:25.614179 1 authentication.go:53] Unable to authenticate the request due to an error: x509: 
certificate signed by unknown authority I1122 18:10:25.614224 1 httplog.go:90] 
verb="GET" URI="/apis/packages.operators.coreos.com/v1" latency=161.243µs resp=401 
UserAgent="Go-http-client/2.0" srcIP="10.128.0.1:41370":

Problem is intermittent

Solution:

If you find the exact symptoms as described, the solution is to delete package server pods in the openshift-operator-lifecycle-manager namespace.
New package server pods will log the 200 Success message for the same API call.

Disabling the Portal web endpoint check

When you create or register a Developer Portal service, the Portal subsystem checks that the Portal web endpoint is accessible. However sometimes, for example due to the complexity of public and private networks, the endpoint cannot be reached. The following example shows the errors that you might see in the portal-www pod, admin container logs, if the endpoint cannot be reached:

An error occurred contacting the provided portal web endpoint: example.com
The provided Portal web endpoint example.com returned HTTP status code 504

In this instance, you can disable the Portal web endpoint check so that the Developer Portal service can be created successfully.

To disable the endpoint check, complete the following update:

On Kubernetes, OpenShift, and IBM Cloud Pak for Integration

Add the following section to the Portal custom resource (CR) template:

spec:
  template:
  - containers:
    - env:
      - name: PORTAL_SKIP_WEB_ENDPOINT_VALIDATION
        value: "true"
      name: admin
    name: www