Troubleshooting upgrades on VMware
Review the following troubleshooting guidance if you encounter a problem while upgrading API Connect on VMware.
Locating your product version
In the Help page of the Cloud Manager, API Manager, and API Designer user interfaces,
there's a Product information tile that you can click to find out information
about your product versions, as well as Git information about the package versions being used. Note
that the API Designer product
information is based on its associated management server, but the Git information is based on where
it was downloaded from.
False positive result from health check after upgrading to 10.0.5.6
Sometimes an upgrade to API Connect 10.0.5.6 appears to complete successfully and the health-check reports that the upgraded subsystem is healthy, but the older version is still installed. A false positive response from the health check indicates a problem with the upgrade.
Complete the following steps to resolve the issue:
- If you did not do so already, run the follow command on each subsystem to determine what version
of API Connect is deployed:
kubectl get apic
If the new version of API Connect is deployed on a subsystem, then the upgrade was successful for that subsystem.
- If the older version of API Connect is still deployed on a subsystem, run the following command
to check the status of that subsystem in case there is a message about the underlying
problem:
Sometimes the status indicates the problem and you can correct it before upgrading that subsystem again.apic status
- If you do not see any messages indicating the source of the problem, contact IBM Support for assistance.
Appliance upgrade from 10.0.4.x to 10.0.5.3 and 10.0.5.4 fails
A direct upgrade from 10.0.4.x to 10.0.5.3 or to 10.0.5.4 on an appliance might fail with a message similar to the following example:
Error: failed to install the subsystem: unable to close stream to remote server: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Correct the problem by completing the following steps:
- Restore the content of the
/usr/lib/python3/dist-packages/cloudinit/config/cc_apiconnect.py file to the
original 10.0.4.x definition as shown in the following
example:
from cloudinit.settings import PER_ALWAYS from yaml import dump from subprocess import call import time frequency = PER_ALWAYS UNLOCK_ATTEMPT_MAX_COUNT = 50 UNLOCK_ATTEMPT_DELAY = 10 def handle(name, _cfg, _cloud, log, _args): if (name == 'apiconnect'): f = open('/var/lib/cloud/instance/apiconnect.yml', 'w') f.write(dump(_cfg["apiconnect"])) f.close() for i in range(1,UNLOCK_ATTEMPT_MAX_COUNT): rc = call(["/usr/bin/apic", "unlock", "--prompt=false"]) if rc == 0: return log.warning("apic unlock attempt %s of %s failed", i, UNLOCK_ATTEMPT_MAX_COUNT) time.sleep(UNLOCK_ATTEMPT_DELAY) log.warning("Use apic unlock to mount the secure partition and start the node")
- Reboot the appliance.
The original primary PVC is not attached to any pods in this namespace
apicops version:pre-upgrade
check reports
The original primary PVC is not attached to any pods in this namespace
and you
have not started the upgrade yet (you have not run the apicup subsys install
command) then follow these steps:- Take a backup of your management subsystem: Backing up and restoring the Management subsystem.
- Restore the new backup: Restoring the management subsystem.
Original PostgreSQL primary not found
apicup subsys install <tar file>
command to
start an upgrade, the upgrade does not complete, and the apicup health-check
command returns:apicup subsys health-check management
Error: Cluster not in good health:
expect member apicdev.ibm.com 'Install stage' to be 'Done'(actual: SETUP_SUBSYSTEM) | Detail: reconcile.go#apply subsystem not ready
then
SSH into the management appliance and switch to root user to get more details of the
error:ssh apicadm@<management appliance>
sudo -i
Run kubectl describe mgmt, and check the output for the following
message:Original PostgreSQL primary not found. Upgrade is blocked. Set apiconnect-operator/db-primary-not-found-allow-upgrade: true annotation to unblock the upgrade. A warning will be issued after the upgrade with further steps
To
fix this problem, complete the following steps:- Create an extra-values.yaml file and set the
annotation:
metadata: annotations: apiconnect-operator/db-primary-not-found-allow-upgrade: "true"
Note: If you already have an extra-values.yaml file, update the existing file with above annotation. - Set the extra-values-file property on your management subsystem with
apicup:
apicup subsys set <management subsystem name> extra-values-file=<extra values filename>
- Restart the upgrade, with the --skip-health-check
flag:
apicup subsys install <management subsystem name> <upgrade tar file> <control plane file - if required> --skip-health-check
- When upgrade is complete:
- Remove the annotation from your extra-values.yaml file.
- Take a backup of your management subsystem: Backing up and restoring the Management subsystem.
- Restore the new backup: Restoring the management subsystem. The action of taking and
restoring a management backup fixes the problem that causes the error message
The original primary PVC is not attached to any pods in this namespace
. Be careful to restore from the backup that is taken after the upgrade, and not from a backup taken before upgrade.
The cloud-final.service
fails during upgrade
Sometimes during upgrade, there is an issue with the cloud-final.service
and a
node, and the appliance-manager enters a bad state.
- Check the output of
apic health-check
command for a result similar to the following example:# apic health-check INFO[0000] Log level: info FATA[0000] Unable to cluster status: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 9.20.153.38:9178: connect: connection refuse
- Check the response to the
journalctl -u appliance-manager | grep cloud-final
command and see if it looks like the following example:# journalctl -u appliance-manager | grep cloud-final Nov 21 19:41:58 apimdev1040 apic[2569]: Job for cloud-final.service failed because the control process exited with error code. Nov 21 19:41:58 apimdev1040 apic[2569]: See "systemctl status cloud-final.service" and "journalctl -xe" for details.
systemctl restart cloud-final.service
apic lock
apic unlock
Upgrade to API Connect 10.0.5.3 fails to start because management subsystem fails the health check
This issue might be caused by a corrupt compliance entry in the
pgbouncer.ini file. When the operator is upgraded to v10.0.5.3 but the
ManagementCluster CR is not yet upgraded, the operator might update the
pgbouncer.ini file in the PGBouncer secret with the older ManagementCluster
CR's profile file, which does not contain any value for the compliance pool_size
.
As a result, the value gets incorrectly set to the string <no value>
.
When this issue occurs, then the health check does not report the management subsystem as healthy so the upgrade will not start.
Wed May 17 09:00:02 UTC 2023 INFO: Starting pgBouncer..
2023-05-17 09:00:02.897 UTC [24] ERROR syntax error in connection string
2023-05-17 09:00:02.897 UTC [24] ERROR invalid value "host=mgmt-a516d013-postgres port=5432 dbname=compliance pool_size=<no value>" for parameter compliance in configuration (/pgconf/pgbouncer.ini:7)
2023-05-17 09:00:02.897 UTC [24] FATAL cannot load config file
Resolve the issue by completing the following steps:
- SSH to the root user:
ssh apicadm@<vm-hostname> sudo -i
- Run the commands in the following
script:
NAMESPACE=<mgmt-namespace> BOUNCER_SECRET=<management-prefix>-postgres-pgbouncer-secret TEMP_FILE=/tmp/pgbouncer.ini **Step 1 - Common: Get the existing pgbouncer.ini file** kubectl get secret -n $NAMESPACE $BOUNCER_SECRET -o jsonpath='{.data.pgbouncer\.ini}' | base64 -d > $TEMP_FILE **Step 2 - Linux version: Update the file and use it to patch the Secret on the cluster** sed 's/<no value>/20/' $TEMP_FILE | base64 -w0 | xargs -I{} kubectl patch secret -n $NAMESPACE $BOUNCER_SECRET --type='json' -p="[{'op' : 'replace' ,'path' : '/data/pgbouncer.ini' ,'value' : {} }]" **Step 2 - Mac version: Update the file and use it to patch the Secret on the cluster** sed 's/<no value>/20/' $TEMP_FILE | base64 -b0 | xargs -S2000 -I{} kubectl patch secret -n $NAMESPACE $BOUNCER_SECRET --type='json' -p="[{'op' : 'replace' ,'path' : '/data/pgbouncer.ini' ,'value' : {} }]" # Step 3 - Common: Restart pgbouncer to pick up the updated Secret configuration kubectl delete pod <bouncer_pod_name> -n $NAMESPACE
Postgres replica fails to start due to permissions issue
Symptom:
2023-04-11 11:31:40,535 INFO: Lock owner: v10-0-5-up-ce0f24dc-site1-postgres-iszz-c5d9c4f5c-9d9fj; I am v10-0-5-up-ce0f24dc-site1-postgres-75845474bd-kzcjl
2023-04-11 11:31:40,537 INFO: starting as a secondary
2023-04-11 11:31:41.032 UTC [151610][]FATAL: data directory "/pgdata/v10-0-5-up-ce0f24dc-site1-postgres" has invalid permissions
2023-04-11 11:31:41.032 UTC [151610][]DETAIL: Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
2023-04-11 11:31:41,055 INFO: postmaster pid=151610
/tmp:5432 - no response
2023-04-11 11:31:41,091 INFO: Lock owner: v10-0-5-up-ce0f24dc-site1-postgres-iszz-c5d9c4f5c-9d9fj; I am v10-0-5-up-ce0f24dc-site1-postgres-75845474bd-kzcjl
2023-04-11 11:31:41,092 INFO: failed to start postgres
Resolve the issue by completing the following steps to correct the permissions and restart the replica pod:
- Run the following command to exec into the
pod:
k exec -it <pg-primary-pod> -- bash
- Run the following command to reset the
permissions:
chmod 0700 /pgdata/<pg-cluster-name>
- Restart the failed pod.
Upgrade stuck with "Unable to upgrade appliance-base: exit status 100" message
This error can occur when the deployment is using an incorrect version of
containerd
. There is a known issue where upgrades from 10.0.1.7-eus or 10.0.1.8-eus
to 10.0.5.1 incorrectly upgraded the containerd
version.
On each node, complete the following steps to determine the version of
containerd
and if necessary, downgrade it to the correct version:
- Run the following set of commands on the healthy node to downgrade the version of
containerd
:systemctl stop appliance-manager systemctl stop kubelet systemctl stop containerd apt-get --allow-downgrades upgrade -y containerd.io sed -i 's/KillMode=process/KillMode=mixed/g' /lib/systemd/system/containerd.service systemctl daemon-reload systemctl restart containerd apic lock apic unlock
- Run the following command to verify that
containerd
now displayscontainerd://1.5.11
for the CONTAINER-RUNTIME:kubectl get nodes -o wide
Remember to complete these steps on every node.
Some containers in the kube-system
namespace show a status of
ErrImageNeverPull
When upgrading, some of the containers in the kube-system
namespace might show a
status ofErrImageNeverPull
. This happens due to Docker not successfully loading all
images from the upgrade .tgz file. To resolve this issue and enable the upgrade
to proceed, complete the following steps:
- Run the following command to determine which node is missing the control plane
files:
kubectl -n kube-system get pods -owide
This command returns the names of the nodes containing pods with the
ErrImageNeverPull
status, which indicates missing control plane files. - On the node that is missing the control plane files, run the following command to determine
which versions of the control plane are
missing:
cat /var/lib/apiconnect/appliance-control-plane-current
- For each missing control plane, run the following command on the same node to add it, replacing
<version> with the version of the control
plane:
docker load < /usr/local/lib/appliance-control-plane/<version>/kubernetes.tgz
Pod stuck in Pending
status during upgrade
When upgrading, the scheduler might deploy a subset of the same microservice pods on the same
node. This can prevent other pods with
requiredDuringSchedulingIgnoredDuringExecution
affinity rules from being deployed
due to a lack of resources on a subset of nodes. To allow the pending containers to be deployed
successfully, identify any pods of the same type that are scheduled on the same node and delete one
of them. This will free up space and cause the deleted pod to get rescheduled. To identify pods that
are eligible to be deleted, and then delete the pods, complete the following steps:
- Run the following command and check for any pods of the same type that are on the same
node:
kubectl get po -o=custom-columns='name:.metadata.name, node:.spec.nodeName, antiaffinity:.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution' | grep -v '<none>' | awk '{print $1" "$2'}
In the following example snippet, one of the two
apim
pods on the node test0186 should be deleted:stv3-management-analytics-proxy-56848c8c69-phpdh test0186 stv3-management-analytics-proxy-56848c8c69-sc45f test0187 stv3-management-analytics-proxy-56848c8c69-scf6g test0188 stv3-management-apim-5574796948-6lnwj test0186 stv3-management-apim-5574796948-h9dgx test0186 stv3-management-apim-5574796948-tsb4g test0188
- Run the following command to delete a
pod:
kubectl delete po <pod_name>
Database replica pods stuck in Unknown or Pending state
In certain scenarios, a postgres replica pod may not recover to a healthy state when a restore
completes, a node outage occurs, or after a fresh install or upgrade. In these cases, a postgres pod
remains in a Unknown
or a Pending
state after a number of minutes.
The pod fail to get into a Running
state.
This situation occurs when the replicas do not initialize properly. You can use the
patronictl reinit
command to reinitialize the replica. Note that this command syncs
the replica's volume data from the current Primary pod.
Use the following steps to get the pod back into a working state:
- SSH into the VM as root.
- Exec onto the failing
pod:
kubectl exec -it <postgres_replica_pod_name> -n <namespace> -- bash
- List the cluster members:
patronictl list + Cluster: fxpk-management-01191b80-postgres (6893134118851096752) --------+--------+--------------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +---------------------------------------------------------+----------------+--------+--------------+----+-----------+ | fxpk-management-01191b80-postgres-586f899fdf-6s25b | 172.16.172.244 | | start failed | | unknown | | fxpk-management-01191b80-postgres-rkww-795665698f-4rh4s | 172.16.148.51 | Leader | running | 3 | | | fxpk-management-01191b80-postgres-uvag-9475f7c5f-qr84m | 172.16.53.68 | | running | 3 | 0 | +---------------------------------------------------------+----------------+--------+--------------+----+-----------+
In the example shown above
fxpk-management-01191b80-postgres-586f899fdf-6s25b
is not in running state.Note the
clusterName
andreplicaName
which are not up:clusterName
-fxpk-management-01191b80-postgres
replicaName
-fxpk-management-01191b80-postgres-586f899fdf-6s25b
- Run:
patronictl reinit <clusterName> <replicaName-which-is-not-running>
Example:
patronictl reinit fxpk-management-01191b80-postgres fxpk-management-01191b80-postgres-586f899fdf-6s25b + Cluster: fxpk-management-01191b80-postgres (6893134118851096752) --------+--------+--------------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +---------------------------------------------------------+----------------+--------+--------------+----+-----------+ | fxpk-management-01191b80-postgres-586f899fdf-6s25b | 172.16.172.244 | | start failed | | unknown | | fxpk-management-01191b80-postgres-rkww-795665698f-4rh4s | 172.16.148.51 | Leader | running | 3 | | | fxpk-management-01191b80-postgres-uvag-9475f7c5f-qr84m | 172.16.53.68 | | running | 3 | 0 | +---------------------------------------------------------+----------------+--------+--------------+----+-----------+ Are you sure you want to reinitialize members fxpk-management-01191b80-postgres-586f899fdf-6s25b? [y/N]: y Success: reinitialize for member fxpk-management-01191b80-postgres-586f899fdf-6s25b
- Run
patronictl list
again.You may also observe that the replica is on a different Timeline (TL) and possibly have a Lag in MB. It may take a few minutes for the pod to switch onto the same TL as the others and the Lag should slowly go to 0.
For example:
bash-4.2$ patronictl list + Cluster: fxpk-management-01191b80-postgres (6893134118851096752) --------+--------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +---------------------------------------------------------+----------------+--------+---------+----+-----------+ | fxpk-management-01191b80-postgres-586f899fdf-6s25b | 172.16.172.244 | | running | 1 | 23360 | | fxpk-management-01191b80-postgres-rkww-795665698f-4rh4s | 172.16.148.51 | Leader | running | 3 | | | fxpk-management-01191b80-postgres-uvag-9475f7c5f-qr84m | 172.16.53.68 | | running | 3 | 0 | +---------------------------------------------------------+----------------+--------+---------+----+-----------+
- The pod that previously was in an
Unknown
orPending state
or(0/1) Running
state is now in(1/1) Running
state.
etcd
pod stuck in Terminating
state
During an upgrade, you might see a health-check reporting the upgrade is stuck in
ETCD
stage and that an etcd
pod is stuck in the
Terminating
state.
- SSH into the VM as root.
- Run the following command and verify that the upgrade stage is
ETCD
:apic status
- Run the following command to determine whether an
etcd
pod is stuck in theTerminating
state:kubectl get pods -n namespace | grep etcd
- Run the following command to retrieve the names of the
etcd
pods:kubectl get pods -n namespace -o wide | grep etcd
- On the pod that is stuck, SSH into the VM as root.
- Run the following command to restart the
pod:
systemctl restart kubelet
Postgres pods fail to start after upgrade
When upgrading the management subsystem from v10.0.2.0 or later, as part of Upgrading to the latest release on VMware, you might encounter an error message when checking the subsystem health upon completion of the upgrade. For example:
apic health-check
INFO[0000] Log level: info
FATA[0006] Cluster not in good health:
ManagementCluster (current ha mode: active) is not ready | State: 15/16 Phase: Pending
To troubleshoot when a message like this occurs:
- Check the state of postgres pods:
kubectl get pods | grep postgres
For example:
root@apimdev1146:~# kubectl get pods | grep postgres fxpk-management-fd8b0b1f-postgres-577594c7f-k54pk 0/1 Init:CrashLoopBackOff 17 22h fxpk-management-fd8b0b1f-postgres-backrest-shared-repo-7fctp88w 1/1 Running 2 22h fxpk-management-fd8b0b1f-postgres-elbx-698f445649-rlc2g 0/1 Init:CrashLoopBackOff 16 22h fxpk-management-fd8b0b1f-postgres-pgbouncer-64f57b7cc7-52bk8 1/1 Running 2 22h fxpk-management-fd8b0b1f-postgres-pgbouncer-64f57b7cc7-jjjvb 1/1 Running 1 22h fxpk-management-fd8b0b1f-postgres-pgbouncer-64f57b7cc7-qp4zh 1/1 Running 2 22h fxpk-management-fd8b0b1f-postgres-stanza-create-6pt6c 0/1 Completed 0 22h fxpk-management-fd8b0b1f-postgres-ubba-79ccdd5cc6-kj4zx 0/1 Init:CrashLoopBackOff 17 22h postgres-operator-85fb96db4b-gk8k8 4/4 Running 8 22h
- If any pods show
Init:CrashLoopBackOff
status, restart the pods. To force a restart, delete the pods:kubectl delete pod <name_of_postgres_pod>
For example:
kubectl delete pod fxpk-management-fd8b0b1f-postgres-577594c7f-k54pk kubectl delete pod fxpk-management-fd8b0b1f-postgres-elbx-698f445649-rlc2g kubectl delete pod fxpk-management-fd8b0b1f-postgres-ubba-79ccdd5cc6-kj4zx
When pods are deleted, the deployment automatically restarts them.
- Re-run the health check. For
example:
apicup subsys health-check <subsys_name>
- When health check is successful, return to the next upgrade step in Upgrading to the latest release on VMware.
Upgrading a 3 node profile to IBM API Connect 10.0.3.0 or later
might result in some portal-db/www
pods being stuck in the Pending
state
IBM® API Connect 10.0.3.0
introduces the pod anti-affinity required rule, meaning that in a 3 node profile deployment, all 3
db and www pods can run only if there are at least 3 running worker nodes. This rule can cause some
upgrades to version 10.0.3.0 or later to become stuck in the Pending
state, in
which case some extra steps are needed during the upgrade to workaround the issue. See the following
example for detailed information about the issue, and how to continue with the upgrade.
- Run the following command to log in as
apicadm
, which is the API Connect ID that has administrator privileges:
Wheressh portal_ip_address -l apicadm
portal_ip_address
is the IP address of the portal subsystem. - Then get a root shell by running the following command:
sudo -i
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
apimdev0103 Ready worker 42m v1.20.0
apimdev0129 Ready worker 45m v1.20.0
apimdev1066 Ready worker 39m v1.20.0
The pods have been scheduled across only 2 of the 3 worker nodes due to a transient problem with
apimdev1066
, as shown in the following pod list. Pods without persistent storage,
such as nginx-X, can be rescheduled to apimdev1066
as soon as they are restarted,
but any pods with persistent local storage, such as db-X and www-X, have to be rescheduled onto the
same worker node as that is where their files live.
$ kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ejs-portal-nginx-84f57ffd8c-hbf66 1/1 Running 0 5m12s 888.16.109.208 apimdev0103 <none> <none>
ejs-portal-nginx-84f57ffd8c-mvq96 1/1 Running 0 5m12s 888.16.142.215 apimdev0129 <none> <none>
ejs-portal-nginx-84f57ffd8c-vpmtl 1/1 Running 0 5m12s 888.16.142.214 apimdev0129 <none> <none>
ejs-portal-site1-db-0 2/2 Running 0 4m39s 888.16.109.209 apimdev0103 <none> <none>
ejs-portal-site1-db-1 2/2 Running 0 6m37s 888.16.109.206 apimdev0103 <none> <none>
ejs-portal-site1-db-2 2/2 Running 0 4m39s 888.16.142.216 apimdev0129 <none> <none>
ejs-portal-site1-www-0 2/2 Running 0 4m9s 888.16.109.210 apimdev0103 <none> <none>
ejs-portal-site1-www-1 2/2 Running 0 6m37s 888.16.142.213 apimdev0129 <none> <none>
ejs-portal-site1-www-2 2/2 Running 0 4m9s 888.16.142.217 apimdev0129 <none> <none>
ibm-apiconnect-75b47f9f87-p25dd 1/1 Running 0 5m12s 888.16.109.207 apimdev0103 <none> <none>
$ kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ejs-portal-nginx-84f57ffd8c-hbf66 1/1 Running 0 10m 888.16.109.208 apimdev0103 <none> <none>
ejs-portal-nginx-84f57ffd8c-mvq96 1/1 Running 0 10m 888.16.142.215 apimdev0129 <none> <none>
ejs-portal-nginx-84f57ffd8c-vpmtl 1/1 Running 0 10m 888.16.142.214 apimdev0129 <none> <none>
ejs-portal-site1-db-0 2/2 Running 0 10m 888.16.109.209 apimdev0103 <none> <none>
ejs-portal-site1-db-1 0/2 Pending 0 91s <none> <none> <none> <none>
ejs-portal-site1-db-2 2/2 Running 0 2m41s 888.16.142.218 apimdev0129 <none> <none>
ejs-portal-site1-www-0 2/2 Running 0 9m51s 888.16.109.210 apimdev0103 <none> <none>
ejs-portal-site1-www-1 2/2 Running 0 12m 888.16.142.213 apimdev0129 <none> <none>
ejs-portal-site1-www-2 0/2 Pending 0 111s <none> <none> <none> <none>
ibm-apiconnect-75b47f9f87-p25dd 1/1 Running 0 10m 888.16.109.207 apimdev0103 <none> <none>
db-2
has restarted, and has been rescheduled to
apimdev0129
as there were no other db pods running on that node. However,
db-1
and www-2
are both stuck in Pending
state,
as there is already a pod of the same type running on the worker node that is hosting the local
storage that they are bound to. If you run a describe
command of either pod you
will see the following
output:Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 10s (x4 over 2m59s) default-scheduler 0/3 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't match pod anti-affinity rules, 2 node(s) had volume node affinity conflict.
To resolve this situation you need to delete the PVCs for each pod, and then delete the pod itself, so that Kubernetes will regenerate the PVCs and schedule the pod on the worker node that does not have the anti-affinity conflict.
db-1
pod the following commands must be
run:$ kubectl get pvc | grep ejs-portal-site1-db-1
db-ejs-portal-site1-db-1 Bound local-pv-fa445e30 250Gi RWO local-storage 15m
dblogs-ejs-portal-site1-db-1 Bound local-pv-d57910e7 250Gi RWO local-storage 15m
$ kubectl delete pvc db-ejs-portal-site1-db-1 dblogs-ejs-portal-site1-db-1
persistentvolumeclaim "db-ejs-portal-site1-db-1" deleted
persistentvolumeclaim "dblogs-ejs-portal-site1-db-1" deleted
$ kubectl delete po ejs-portal-site1-db-1
pod "ejs-portal-site1-db-1" deleted
www-2
pod the following commands must be
run:$ kubectl get pvc | grep ejs-portal-site1-www-2
admin-ejs-portal-site1-www-2 Bound local-pv-48799536 245Gi RWO local-storage 51m
backup-ejs-portal-site1-www-2 Bound local-pv-a93f5607 245Gi RWO local-storage 51m
web-ejs-portal-site1-www-2 Bound local-pv-facd4489 245Gi RWO local-storage 51m
$ kubectl delete pvc admin-ejs-portal-site1-www-2 backup-ejs-portal-site1-www-2 web-ejs-portal-site1-www-2
persistentvolumeclaim "admin-ejs-portal-site1-www-2" deleted
persistentvolumeclaim "backup-ejs-portal-site1-www-2" deleted
persistentvolumeclaim "web-ejs-portal-site1-www-2" deleted
$ kubectl delete po ejs-portal-site1-www-2
pod "ejs-portal-site1-www-2" deleted
If the PVC has persistentVolumeReclaimPolicy: Delete
set on it, as is the case
for the OVA deployments, then no cleanup is necessary as the old data will have been deleted on the
worker node that is no longer running the db-1
and www-2
pods.
$ kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ejs-portal-nginx-84f57ffd8c-f85wm 1/1 Running 0 30s 888.16.29.136 apimdev1066 <none> <none>
ejs-portal-nginx-84f57ffd8c-k5klb 1/1 Running 0 103s 888.16.142.220 apimdev0129 <none> <none>
ejs-portal-nginx-84f57ffd8c-lqhqs 1/1 Running 0 1m53s 888.16.109.212 apimdev0103 <none> <none>
ejs-portal-site1-db-0 2/2 Running 0 6m43s 888.16.109.211 apimdev0103 <none> <none>
ejs-portal-site1-db-1 2/2 Running 0 8m20s 888.16.29.134 apimdev1066 <none> <none>
ejs-portal-site1-db-2 2/2 Running 0 14m 888.16.142.218 apimdev0129 <none> <none>
ejs-portal-site1-www-0 2/2 Running 0 93s 888.16.109.213 apimdev0103 <none> <none>
ejs-portal-site1-www-1 2/2 Running 0 3m55s 888.16.142.219 apimdev0129 <none> <none>
ejs-portal-site1-www-2 2/2 Running 0 7m27s 888.16.29.135 apimdev1066 <none> <none>
ibm-apiconnect-75b47f9f87-p25dd 1/1 Running 0 22m 888.16.109.207 apimdev0103 <none> <none>
Issues when installing Drupal 8 based custom modules or sub-themes into the Drupal 9 based Developer Portal
From IBM API Connect 10.0.3.0, the Developer Portal is based on the Drupal 9 content management system. If you want to install Drupal 8 custom modules or sub-themes into the Drupal 9 based Developer Portal, you must ensure that they are compatible with Drupal 9, including any custom code that they contain, and not using any deprecated APIs, for example. There are tools available for checking your custom code, such as drupal_check on GitHub, which checks Drupal code for deprecations.
admin
logs:[ queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: Checking theme: emeraldgreen
[ queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: ERROR: Incompatible core_version_requirement '' found for emeraldgreen
[ queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: Checking theme: rubyred
[ queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: ERROR: Incompatible core_version_requirement '8.x' found for rubyred
[ queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: ERROR: Found themes incompatible with Drupal 9: emeraldgreen rubyred
[ queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:34:49: check_d9_compat: ERROR: /tmp/restore_site.355ec8 is NOT Drupal 9 compatible
...
[ queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: Checking module: custom_mod_1
[ queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: ERROR: Incompatible core_version_requirement '' found for custom_mod_1
[ queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: Checking module: custom_mod_2
[ queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: ERROR: Incompatible core_version_requirement '8.x' found for custom_mod_2
[ queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: ERROR: Found modules incompatible with Drupal 9: emeraldgreen rubyred
[ queue stdout] 14834 729319:355ec8:a7d29c 2021-09-04 20:44:49: check_d9_compat: ERROR: site1.com is NOT Drupal 9 compatible
To
fix version compatibility errors, all custom modules and sub-themes should declare a
core_version_requirement
key in their *.info.yml file that
indicates Drupal 9 compatibility. For
example:name: Example module
type: module
description: Purely an example
core: 8.x
core_version_requirement: '^8 || ^9'
package: Example module
# Information added by Drupal.org packaging script on 2020-05-31
version: '8.x-1.3'
project: 'example_module'
datestamp: 1590905415
This
example specifies that the module is compatible with all versions of Drupal 8 and 9. For more
information, see Let Drupal know about your module with an .info.yml file on
the drupal.org website.If you have a backup of a site that you need to restore, and are getting the version compatibility error, but the module or theme *.info.yml file cannot be changed easily, then you have two options. Either:
- Add an environment variable into the portal CR for the
www
pod of theadmin
container statingSKIP_D9_COMPAT_CHECK: "true"
. However, if you choose this method, you must be positive that all of the custom modules and themes for your sites are Drupal 9 compatible, as otherwise the sites may end up inaccessible after the upgrade or restore.- On VMware, create an extra values file to contain the environment variable, as
follows:
spec: template: - containers: - env: - name: SKIP_D9_COMPAT_CHECK value: "true" name: admin name: www
- Save the file as d9compat.yaml, and run the following
command:
apicup subsys set <portal_subsystem_name> extra-values-file d9compat.yaml
- Then, update the portal VMware with the updated setting by running the following
command:
apicup subsys install <portal_subsystem_name>
- On VMware, create an extra values file to contain the environment variable, as
follows:
- Extract the site backup, edit the relevant files inside it, and then tar
the backup file again. Note that this procedure will overwrite the original backup file, so ensure
that you keep a separate copy of the original file before you start the extraction. For example:
mkdir /tmp/backup
cd /tmp/backup
tar xfz path_to_backup.tar.gz
- Edit the custom module and theme files to make them Drupal 9 compatible, and add the correct
core_version_requirement
setting. rm -f path_to_backup.tar.gz
tar cfz path_to_backup.tar.gz
cd /
rm -rf /tmp/backup
Skipping health check when re-running upgrade
The apicup subsys install
command automatically runs
apicup health-check
prior to attempting the upgrade. An error is displayed if a
problem is found that will prevent successful upgrade
In some scenarios, if you encounter an upgrade failure, an attempt to rerun apicup
subsys install
is blocked by errors found by apicup health-check
. Even
when you have fixed the error (such as reconfiguration of an incorrect upgrade CR), the failed
upgrade can continue to cause the health check to fail.
You can workaround the problem by adding the --skip-health-check
flag to
suppress the health check:
apicup subsys install <subsystem_name> --skip-health-check
In this case, use of --skip-health-check
allows the upgrade to rerun
successfully.