Troubleshooting installation and upgrades on VMware

Review the following troubleshooting guidance if you encounter a problem while installing or upgrading API Connect on VMware.

Note: In the The Help icon.

Help page of the Cloud Manager, API Manager, and API Designer user interfaces, there's a Product information tile that you can click to find out information about your product versions, as well as Git information about the package versions being used. Note that the API Designer product information is based on its associated management server, but the Git information is based on where it was downloaded from.

The cloud-final.service fails during upgrade
Some containers in the kube-system namespace show a status of ErrImageNeverPull
Pod stuck in Pending status during upgrade
Database replica pods stuck in Unknown or Pending state
Additional postgres replica created during installation or upgrade
etcd pod stuck in Terminating state
Management subsystem upgrade imagePullBackOff issue
Portal db pods restart mysqld if the database state transfer takes more than 5 minutes
Issues when installing Drupal 8 based custom modules or sub-themes into the Drupal 9 based Developer Portal
Postgres pods fail to start after upgrade
Upgrading a 3 node profile to IBM API Connect 10.0.1.4-ifix1-eus might result in some portal-db/www pods being stuck in the Pending state
Skipping health check when re-running upgrade

The `cloud-final.service` fails during upgrade

Sometimes during upgrade, there is an issue with the cloud-final.service and a node, and the appliance-manager enters a bad state.

To determine whether your upgrade encountered this problem, complete the following steps:

Check the output of apic health-check command for a result similar to the following example:

# apic health-check
INFO[0000] Log level: info
FATA[0000] Unable to cluster status: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 9.20.153.38:9178: connect: connection refuse

Check the response to the journalctl -u appliance-manager | grep cloud-final command and see if it looks like the following example:

# journalctl -u appliance-manager | grep cloud-final
Nov 21 19:41:58 apimdev1040 apic[2569]: Job for cloud-final.service failed because the control process exited with error code.
Nov 21 19:41:58 apimdev1040 apic[2569]: See "systemctl status cloud-final.service" and "journalctl -xe" for details.

If you determined that this was the problem, you can correct it by running each of the following commands on every node:

systemctl restart cloud-final.service
apic lock
apic unlock

Some containers in the `kube-system` namespace show a status of `ErrImageNeverPull`

When upgrading, some of the containers in the kube-system namespace might show a status ofErrImageNeverPull. This happens due to Docker not successfully loading all images from the upgrade .tgz file. To resolve this issue and enable the upgrade to proceed, complete the following steps:

Run the following command to determine which node is missing the control plane files:
```
kubectl -n kube-system get pods -owide
```
This command returns the names of the nodes containing pods with the ErrImageNeverPull status, which indicates missing control plane files.
On the node that is missing the control plane files, run the following command to determine which versions of the control plane are missing:
```
cat /var/lib/apiconnect/appliance-control-plane-current
```
For each missing control plane, run the following command on the same node to add it, replacing <version> with the version of the control plane:
```
docker load < /usr/local/lib/appliance-control-plane/<version>/kubernetes.tgz
```

Pod stuck in `Pending` status during upgrade

When upgrading, the scheduler might deploy a subset of the same microservice pods on the same node. This can prevent other pods with requiredDuringSchedulingIgnoredDuringExecution affinity rules from being deployed due to a lack of resources on a subset of nodes. To allow the pending containers to be deployed successfully, identify any pods of the same type that are scheduled on the same node and delete one of them. This will free up space and cause the deleted pod to get rescheduled. To identify pods that are eligible to be deleted, and then delete the pods, complete the following steps:

Run the following command and check for any pods of the same type that are on the same node:

kubectl get po -o=custom-columns='name:.metadata.name, node:.spec.nodeName, antiaffinity:.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution' | grep -v '<none>' | awk '{print $1" "$2'}

In the following example snippet, one of the two apim pods on the node test0186 should be deleted:

stv3-management-analytics-proxy-56848c8c69-phpdh test0186
stv3-management-analytics-proxy-56848c8c69-sc45f test0187
stv3-management-analytics-proxy-56848c8c69-scf6g test0188
stv3-management-apim-5574796948-6lnwj test0186
stv3-management-apim-5574796948-h9dgx test0186
stv3-management-apim-5574796948-tsb4g test0188

Run the following command to delete a pod:
```
kubectl delete po <pod_name>
```

Database replica pods stuck in Unknown or Pending state

In certain scenarios, a postgres replica pod may not recover to a healthy state when a restore completes, a node outage occurs, or after a fresh install or upgrade. In these cases, a postgres pod remains in a Unknown or a Pending state after a number of minutes. The pod fail to get into a Running state.

This situation occurs when the replicas do not initialize properly. You can use the patronictl reinit command to reinitialize the replica. Note that this command syncs the replica's volume data from the current Primary pod.

Use the following steps to get the pod back into a working state:

SSH into the VM as root.

Exec onto the failing pod:

kubectl exec -it <postgres_replica_pod_name> -n <namespace> -- bash

List the cluster members:

patronictl list
+ Cluster: fxpk-management-01191b80-postgres (6893134118851096752) --------+--------+--------------+----+-----------+
|                          Member                         |      Host      |  Role  |    State     | TL | Lag in MB |
+---------------------------------------------------------+----------------+--------+--------------+----+-----------+
|    fxpk-management-01191b80-postgres-586f899fdf-6s25b   | 172.16.172.244 |        | start failed |    |   unknown |
| fxpk-management-01191b80-postgres-rkww-795665698f-4rh4s | 172.16.148.51  | Leader |   running    |  3 |           |
|  fxpk-management-01191b80-postgres-uvag-9475f7c5f-qr84m |  172.16.53.68  |        |   running    |  3 |         0 |
+---------------------------------------------------------+----------------+--------+--------------+----+-----------+

In the example shown above fxpk-management-01191b80-postgres-586f899fdf-6s25b is not in running state.

Note the clusterName and replicaName which are not up:

clusterName - fxpk-management-01191b80-postgres
replicaName - fxpk-management-01191b80-postgres-586f899fdf-6s25b

Run:

patronictl reinit <clusterName> <replicaName-which-is-not-running>

Example:

patronictl reinit fxpk-management-01191b80-postgres fxpk-management-01191b80-postgres-586f899fdf-6s25b
+ Cluster: fxpk-management-01191b80-postgres (6893134118851096752) --------+--------+--------------+----+-----------+
|                          Member                         |      Host      |  Role  |    State     | TL | Lag in MB |
+---------------------------------------------------------+----------------+--------+--------------+----+-----------+
|    fxpk-management-01191b80-postgres-586f899fdf-6s25b   | 172.16.172.244 |        | start failed |    |   unknown |
| fxpk-management-01191b80-postgres-rkww-795665698f-4rh4s | 172.16.148.51  | Leader |   running    |  3 |           |
|  fxpk-management-01191b80-postgres-uvag-9475f7c5f-qr84m |  172.16.53.68  |        |   running    |  3 |         0 |
+---------------------------------------------------------+----------------+--------+--------------+----+-----------+
Are you sure you want to reinitialize members fxpk-management-01191b80-postgres-586f899fdf-6s25b? [y/N]: y
Success: reinitialize for member fxpk-management-01191b80-postgres-586f899fdf-6s25b

Run patronictl list again.

You may also observe that the replica is on a different Timeline (TL) and possibly have a Lag in MB. It may take a few minutes for the pod to switch onto the same TL as the others and the Lag should slowly go to 0.

For example:

bash-4.2$ patronictl list
+ Cluster: fxpk-management-01191b80-postgres (6893134118851096752) --------+--------+---------+----+-----------+
|                          Member                         |      Host      |  Role  |  State  | TL | Lag in MB |
+---------------------------------------------------------+----------------+--------+---------+----+-----------+
|    fxpk-management-01191b80-postgres-586f899fdf-6s25b   | 172.16.172.244 |        | running |  1 |     23360 |
| fxpk-management-01191b80-postgres-rkww-795665698f-4rh4s | 172.16.148.51  | Leader | running |  3 |           |
|  fxpk-management-01191b80-postgres-uvag-9475f7c5f-qr84m |  172.16.53.68  |        | running |  3 |         0 |
+---------------------------------------------------------+----------------+--------+---------+----+-----------+

The pod that previously was in an Unknown or Pending state or (0/1) Running state is now in (1/1) Running state.

Additional postgres replica created during installation or upgrade

If you see an installation or upgrade procedure stall with more than 2 postgres replica deployment in a n3 profile, delete the pending replica deployment by completing the following steps:

SSH into the VM as root.

Get the name of the pending postgres deployments:

kubectl get deploy -n <namespace> | grep postgres

Delete the pending postgres replica deployment:

kubectl delete deploy <replica_deployment_name> -n <namespace>

`etcd` pod stuck in `Terminating` state

During an upgrade, you might see a health-check reporting the upgrade is stuck in ETCD stage and that an etcd pod is stuck in the Terminating state.

Verify the issue by completing the following steps:

SSH into the VM as root.
Run the following command and verify that the upgrade stage is ETCD:
```
apic status
```
Run the following command to determine whether an etcd pod is stuck in the Terminating state:
```
kubectl get pods -n namespace | grep etcd
```

Resolve the issue by completing the following steps:

Run the following command to retrieve the names of the etcd pods:
```
kubectl get pods -n namespace -o wide | grep etcd
```
On the pod that is stuck, SSH into the VM as root.
Run the following command to restart the pod:
```
systemctl restart kubelet
```

Management subsystem upgrade `imagePullBackOff` issue

When upgrading the management subsystem, for example from v10.0.1.6 to v10.0.1.11, the postgres pods might run into an imagePullBackOff issue, where the postgres pods are looking for the v10.0.1.6 images that exist in the Docker registry, but the images haven't been loaded into the Containerd registry. To work around this issue, complete the following steps to manually move each image from the Docker registry to the Containerd registry:

Save the following Docker images, ibm-apiconnect-management-k8s-init, crunchy-pgbackrest-repo, crunchy-pgbackrest, crunchy-pgbouncer, and crunchy-postgres-ha, into a .tar file and keep in a secure location. For example, from v10.0.1.6:
```
docker save 127.0.0.1:8675/10.0.1.6/crunchy-pgbouncer > crunchy-pgbouncer.tar
```

Import each saved image into the Containerd registry. For example:

ctr --namespace k8s.io image import crunchy-pgbouncer.tar --digests=true

Re-tag the import-<datestamp>@sha256:nnn images to their appropriate ifix2 tags and digests. For example:
```
ctr --namespace k8s.io image tag <source> <target>
```

Portal `db` pods restart `mysqld` if the database state transfer takes more than 5 minutes

In IBM® API Connect Version 10.0.1.5-eus, if the portal database state transfer takes longer than 5 minutes from one db pod to another, then the db pod sending the data incorrectly thinks that the database process is stuck in a bad state, and will restart the database process. This situation would typically happen if you have more than 10 or 12 sites, a slow network between the db pods, such as in a distant multi-site high availability setup, or both.

In such cases, the following entry would be seen in one of the ready db containers inside a db pod:

dbstatus: ERROR: stuck in Donor mode for 5m31s!, restarting the database

To prevent this situation from happening, add the following environment variable to your portal deployment.

Create an extra values file to contain the environment variable, as follows:

spec:
  template:
  - containers:
    - env:
      - name: DONOR_STALE_SECS
        value: "7200"
      name: db
    name: db

Save the file as dbstate.yaml, and run the following command:

apicup subsys set <portal_subsystem_name> extra-values-file dbstate.yaml

Then, update the portal VMware with the updated setting by running the following command:
```
apicup subsys install <portal_subsystem_name>
```
To ensure that you get the new configuration and the database can startup correctly, you must delete all of the db pods by running the following command:
```
kubectl delete pod <portal-mydc-db-0> <portal-mydc-db-1> <portal-mydc-db-2>
```
To run this command you must log in to the virtual machine portal subsystem by using an SSH tool.

Issues when installing Drupal 8 based custom modules or sub-themes into the Drupal 9 based Developer Portal

From IBM API Connect 10.0.1.4-eus, the Developer Portal is based on the Drupal 9 content management system. If you want to install Drupal 8 custom modules or sub-themes into the Drupal 9 based Developer Portal, you must ensure that they are compatible with Drupal 9, including any custom code that they contain, and not using any deprecated APIs, for example. There are tools available for checking your custom code, such as drupal_check on GitHub, which checks Drupal code for deprecations.

For example, any Developer Portal sites that contain modules or sub-themes that don't contain a Drupal 9 version declaration will fail to upgrade, and errors like the following output will be seen in the admin logs:

[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: Checking theme: emeraldgreen
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: ERROR: Incompatible core_version_requirement '' found for emeraldgreen
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: Checking theme: rubyred
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: ERROR: Incompatible core_version_requirement '8.x' found for rubyred
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: ERROR: Found themes incompatible with Drupal 9: emeraldgreen rubyred
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: ERROR: /tmp/restore_site.355ec8 is NOT Drupal 9 compatible
...
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: Checking module: custom_mod_1
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: ERROR: Incompatible core_version_requirement '' found for custom_mod_1
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: Checking module: custom_mod_2
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: ERROR: Incompatible core_version_requirement '8.x' found for custom_mod_2
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: ERROR: Found modules incompatible with Drupal 9: emeraldgreen rubyred
[     queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: ERROR: site1.com is NOT Drupal 9 compatible

To fix version compatibility errors, all custom modules and sub-themes should declare a core_version_requirement key in their *.info.yml file that indicates Drupal 9 compatibility. For example:

name: Example module
type: module
description: Purely an example
core: 8.x
core_version_requirement: '^8 || ^9'
package: Example module

# Information added by Drupal.org packaging script on 2020-05-31
version: '8.x-1.3'
project: 'example_module'
datestamp: 1590905415

This example specifies that the module is compatible with all versions of Drupal 8 and 9. For more information, see Let Drupal know about your module with an .info.yml file on the drupal.org website.

If you have a backup of a site that you need to restore, and are getting the version compatibility error, but the module or theme *.info.yml file cannot be changed easily, then you have two options. Either:

Add an environment variable into the portal CR for the www pod of the admin container stating SKIP_D9_COMPAT_CHECK: "true". However, if you choose this method, you must be positive that all of the custom modules and themes for your sites are Drupal 9 compatible, as otherwise the sites may end up inaccessible after the upgrade or restore.
1. On VMware, create an extra values file to contain the environment variable, as follows:
```
spec:
  template:
  - containers:
    - env:
      - name: SKIP_D9_COMPAT_CHECK
        value: "true"
      name: admin
    name: www
```
2. Save the file as d9compat.yaml, and run the following command:
```
apicup subsys set <portal_subsystem_name> extra-values-file d9compat.yaml
```
3. Then, update the portal VMware with the updated setting by running the following command:
```
apicup subsys install <portal_subsystem_name>
```

Or:

Extract the site backup, edit the relevant files inside it, and then tar the backup file again. Note that this procedure will overwrite the original backup file, so ensure that you keep a separate copy of the original file before you start the extraction. For example:
1. mkdir /tmp/backup
2. cd /tmp/backup
3. tar xfz path_to_backup.tar.gz
4. Edit the custom module and theme files to make them Drupal 9 compatible, and add the correct core_version_requirement setting.
5. rm -f path_to_backup.tar.gz
6. tar cfz path_to_backup.tar.gz
7. cd /
8. rm -rf /tmp/backup

Postgres pods fail to start after upgrade

When upgrading the management subsystem as part of Upgrading to 10.0.1.8-eus on VMware, you might encounter an error message when checking the subsystem health upon completion of the upgrade. For example:

apic health-check
INFO[0000] Log level: info
FATA[0006] Cluster not in good health:
ManagementCluster (current ha mode: active) is not ready | State: 15/16 Phase: Pending

To troubleshoot when a message like this occurs:

Check the state of postgres pods:

kubectl get pods | grep postgres

For example:

root@apimdev1146:~# kubectl get pods | grep postgres
fxpk-management-fd8b0b1f-postgres-577594c7f-k54pk                 0/1     Init:CrashLoopBackOff   17         22h
fxpk-management-fd8b0b1f-postgres-backrest-shared-repo-7fctp88w   1/1     Running                 2          22h 
fxpk-management-fd8b0b1f-postgres-elbx-698f445649-rlc2g           0/1     Init:CrashLoopBackOff   16         22h
fxpk-management-fd8b0b1f-postgres-pgbouncer-64f57b7cc7-52bk8      1/1     Running                 2          22h
fxpk-management-fd8b0b1f-postgres-pgbouncer-64f57b7cc7-jjjvb      1/1     Running                 1          22h
fxpk-management-fd8b0b1f-postgres-pgbouncer-64f57b7cc7-qp4zh      1/1     Running                 2          22h
fxpk-management-fd8b0b1f-postgres-stanza-create-6pt6c             0/1     Completed               0          22h
fxpk-management-fd8b0b1f-postgres-ubba-79ccdd5cc6-kj4zx           0/1     Init:CrashLoopBackOff   17         22h
postgres-operator-85fb96db4b-gk8k8                                4/4     Running                 8          22h

If any pods show Init:CrashLoopBackOff status, restart the pods. To force a restart, delete the pods:

kubectl delete pod <name_of_postgres_pod>

For example:

kubectl delete pod fxpk-management-fd8b0b1f-postgres-577594c7f-k54pk
kubectl delete pod fxpk-management-fd8b0b1f-postgres-elbx-698f445649-rlc2g
kubectl delete pod fxpk-management-fd8b0b1f-postgres-ubba-79ccdd5cc6-kj4zx

When pods are deleted, the deployment automatically restarts them.

Re-run the health check. For example:

apicup subsys health-check <subsys_name>

When health check is successful, return to Upgrading from v10.0.1.2 to v10.0.1.2 with the latest iFix on VMware.

Upgrading a 3 node profile to IBM API Connect 10.0.1.4-ifix1-eus might result in some `portal-db/www` pods being stuck in the `Pending` state

IBM API Connect 10.0.1.4-ifix1-eus introduces the pod anti-affinity required rule, meaning that in a 3 node profile deployment, all 3 db and www pods can run only if there are at least 3 running worker nodes. This rule can cause some upgrades to version from 10.0.1.4-eus or earlier to become stuck in the Pending state, in which case some extra steps are needed during the upgrade to workaround the issue. See the following example for detailed information about the issue, and how to continue with the upgrade.

If you are upgrading from 10.0.1.4-ifix1-eus or later, to 10.0.1.5-eus or later, this problem does not occur. The problem can occur only when upgrading from 10.0.1.4-eus or earlier.

Important: You must have a backup of your current deployment before starting the upgrade.

To run the commands that are needed to workaround this issue, you must log in to the virtual machine portal subsystem by using an SSH tool.

Run the following command to log in as apicadm, which is the API Connect ID that has administrator privileges:
```
ssh portal_ip_address -l apicadm
```
Where portal_ip_address is the IP address of the portal subsystem.
Then get a root shell by running the following command:
```
sudo -i
```

In the following example, the VMware stack has 3 VMs (worker nodes):

$ kubectl get nodes
NAME          STATUS   ROLES    AGE   VERSION
apimdev0103   Ready    worker   42m   v1.20.0
apimdev0129   Ready    worker   45m   v1.20.0
apimdev1066   Ready    worker   39m   v1.20.0

The pods have been scheduled across only 2 of the 3 worker nodes due to a transient problem with apimdev1066, as shown in the following pod list. Pods without persistent storage, such as nginx-X, can be rescheduled to apimdev1066 as soon as they are restarted, but any pods with persistent local storage, such as db-X and www-X, have to be rescheduled onto the same worker node as that is where their files live.

$ kubectl get po -o wide
NAME                                READY   STATUS    RESTARTS   AGE     IP               NODE          NOMINATED NODE   READINESS GATES
ejs-portal-nginx-84f57ffd8c-hbf66   1/1     Running   0          5m12s   888.16.109.208   apimdev0103   <none>           <none>
ejs-portal-nginx-84f57ffd8c-mvq96   1/1     Running   0          5m12s   888.16.142.215   apimdev0129   <none>           <none>
ejs-portal-nginx-84f57ffd8c-vpmtl   1/1     Running   0          5m12s   888.16.142.214   apimdev0129   <none>           <none>
ejs-portal-site1-db-0               2/2     Running   0          4m39s   888.16.109.209   apimdev0103   <none>           <none>
ejs-portal-site1-db-1               2/2     Running   0          6m37s   888.16.109.206   apimdev0103   <none>           <none>
ejs-portal-site1-db-2               2/2     Running   0          4m39s   888.16.142.216   apimdev0129   <none>           <none>
ejs-portal-site1-www-0              2/2     Running   0          4m9s    888.16.109.210   apimdev0103   <none>           <none>
ejs-portal-site1-www-1              2/2     Running   0          6m37s   888.16.142.213   apimdev0129   <none>           <none>
ejs-portal-site1-www-2              2/2     Running   0          4m9s    888.16.142.217   apimdev0129   <none>           <none>
ibm-apiconnect-75b47f9f87-p25dd     1/1     Running   0          5m12s   888.16.109.207   apimdev0103   <none>           <none>

The upgrade to version 10.0.1.4-ifix1 is started and the following pod list is observed:

$ kubectl get po -o wide
NAME                                READY   STATUS    RESTARTS   AGE     IP               NODE          NOMINATED NODE   READINESS GATES
ejs-portal-nginx-84f57ffd8c-hbf66   1/1     Running   0          10m     888.16.109.208   apimdev0103   <none>           <none>
ejs-portal-nginx-84f57ffd8c-mvq96   1/1     Running   0          10m     888.16.142.215   apimdev0129   <none>           <none>
ejs-portal-nginx-84f57ffd8c-vpmtl   1/1     Running   0          10m     888.16.142.214   apimdev0129   <none>           <none>
ejs-portal-site1-db-0               2/2     Running   0          10m     888.16.109.209   apimdev0103   <none>           <none>
ejs-portal-site1-db-1               0/2     Pending   0          91s     <none>           <none>        <none>           <none>
ejs-portal-site1-db-2               2/2     Running   0          2m41s   888.16.142.218   apimdev0129   <none>           <none>
ejs-portal-site1-www-0              2/2     Running   0          9m51s   888.16.109.210   apimdev0103   <none>           <none>
ejs-portal-site1-www-1              2/2     Running   0          12m     888.16.142.213   apimdev0129   <none>           <none>
ejs-portal-site1-www-2              0/2     Pending   0          111s    <none>           <none>        <none>           <none>
ibm-apiconnect-75b47f9f87-p25dd     1/1     Running   0          10m     888.16.109.207   apimdev0103   <none>           <none>

The pod list shows that db-2 has restarted, and has been rescheduled to apimdev0129 as there were no other db pods running on that node. However, db-1 and www-2 are both stuck in Pending state, as there is already a pod of the same type running on the worker node that is hosting the local storage that they are bound to. If you run a describe command of either pod you will see the following output:

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  10s (x4 over 2m59s)  default-scheduler  0/3 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't match pod anti-affinity rules, 2 node(s) had volume node affinity conflict.

To resolve this situation you need to delete the PVCs for each pod, and then delete the pod itself, so that Kubernetes will regenerate the PVCs and schedule the pod on the worker node that does not have the anti-affinity conflict.

Therefore, for the db-1 pod the following commands must be run:

$ kubectl get pvc | grep ejs-portal-site1-db-1
db-ejs-portal-site1-db-1        Bound    local-pv-fa445e30   250Gi      RWO            local-storage   15m
dblogs-ejs-portal-site1-db-1    Bound    local-pv-d57910e7   250Gi      RWO            local-storage   15m

$ kubectl delete pvc db-ejs-portal-site1-db-1 dblogs-ejs-portal-site1-db-1
persistentvolumeclaim "db-ejs-portal-site1-db-1" deleted
persistentvolumeclaim "dblogs-ejs-portal-site1-db-1" deleted

$ kubectl delete po ejs-portal-site1-db-1
pod "ejs-portal-site1-db-1" deleted

For the www-2 pod the following commands must be run:

$ kubectl get pvc | grep ejs-portal-site1-www-2
admin-ejs-portal-site1-www-2    Bound    local-pv-48799536   245Gi      RWO            local-storage   51m
backup-ejs-portal-site1-www-2   Bound    local-pv-a93f5607   245Gi      RWO            local-storage   51m
web-ejs-portal-site1-www-2      Bound    local-pv-facd4489   245Gi      RWO            local-storage   51m

$ kubectl delete pvc admin-ejs-portal-site1-www-2 backup-ejs-portal-site1-www-2 web-ejs-portal-site1-www-2
persistentvolumeclaim "admin-ejs-portal-site1-www-2" deleted
persistentvolumeclaim "backup-ejs-portal-site1-www-2" deleted
persistentvolumeclaim "web-ejs-portal-site1-www-2" deleted

$ kubectl delete po ejs-portal-site1-www-2
pod "ejs-portal-site1-www-2" deleted

If the PVC has persistentVolumeReclaimPolicy: Delete set on it, as is the case for the OVA deployments, then no cleanup is necessary as the old data will have been deleted on the worker node that is no longer running the db-1 and www-2 pods.

Kubernetes can now reschedule the pods. All pods with persistent storage are now spread across the available worker nodes, and the pods whose PVCs were deleted will get a full copy of the data from the existing running pods. The following pod list is now observed in our example:

$ kubectl get po -o wide
NAME                                READY   STATUS    RESTARTS   AGE     IP               NODE          NOMINATED NODE   READINESS GATES
ejs-portal-nginx-84f57ffd8c-f85wm   1/1     Running   0          30s     888.16.29.136    apimdev1066   <none>           <none>
ejs-portal-nginx-84f57ffd8c-k5klb   1/1     Running   0          103s    888.16.142.220   apimdev0129   <none>           <none>
ejs-portal-nginx-84f57ffd8c-lqhqs   1/1     Running   0          1m53s   888.16.109.212   apimdev0103   <none>           <none>
ejs-portal-site1-db-0               2/2     Running   0          6m43s   888.16.109.211   apimdev0103   <none>           <none>
ejs-portal-site1-db-1               2/2     Running   0          8m20s   888.16.29.134    apimdev1066   <none>           <none>
ejs-portal-site1-db-2               2/2     Running   0          14m     888.16.142.218   apimdev0129   <none>           <none>
ejs-portal-site1-www-0              2/2     Running   0          93s     888.16.109.213   apimdev0103   <none>           <none>
ejs-portal-site1-www-1              2/2     Running   0          3m55s   888.16.142.219   apimdev0129   <none>           <none>
ejs-portal-site1-www-2              2/2     Running   0          7m27s   888.16.29.135    apimdev1066   <none>           <none>
ibm-apiconnect-75b47f9f87-p25dd     1/1     Running   0          22m     888.16.109.207   apimdev0103   <none>           <none>

Skipping health check when re-running upgrade

The apicup subsys install command automatically runs apicup health-check prior to attempting the upgrade. An error is displayed if a problem is found that will prevent successful upgrade

In some scenarios, if you encounter an upgrade failure, an attempt to rerun apicup subsys install is blocked by errors found by apicup health-check. Even when you have fixed the error (such as reconfiguration of an incorrect upgrade CR), the failed upgrade can continue to cause the health check to fail.

You can workaround the problem by adding the --skip-health-check flag to suppress the health check:

apicup subsys install <subsystem_name> --skip-health-check

In this case, use of --skip-health-check allows the upgrade to rerun successfully.

Troubleshooting installation and upgrades on VMware

The cloud-final.service fails during upgrade

Some containers in the kube-system namespace show a status of ErrImageNeverPull

Pod stuck in Pending status during upgrade

Database replica pods stuck in Unknown or Pending state

Additional postgres replica created during installation or upgrade

etcd pod stuck in Terminating state

Management subsystem upgrade imagePullBackOff issue

Portal db pods restart mysqld if the database state transfer takes more than 5 minutes

Issues when installing Drupal 8 based custom modules or sub-themes into the Drupal 9 based Developer Portal

Postgres pods fail to start after upgrade

Upgrading a 3 node profile to IBM API Connect 10.0.1.4-ifix1-eus might result in some portal-db/www pods being stuck in the Pending state

Skipping health check when re-running upgrade

The `cloud-final.service` fails during upgrade

Some containers in the `kube-system` namespace show a status of `ErrImageNeverPull`

Pod stuck in `Pending` status during upgrade

`etcd` pod stuck in `Terminating` state

Management subsystem upgrade `imagePullBackOff` issue

Portal `db` pods restart `mysqld` if the database state transfer takes more than 5 minutes

Upgrading a 3 node profile to IBM API Connect 10.0.1.4-ifix1-eus might result in some `portal-db/www` pods being stuck in the `Pending` state