Troubleshooting resiliency issues on OpenShift and Cloud Pak for Integration

Resolve resiliency issues with API Connect management system backups.

pgbackrest-shared-repo resiliency problems

If the Kubernetes node hosting pgbackrest-shared-repo pod is down and the PVC attached to the pod has strict zone requirements (for example, in AWS or other clouds) or if the storage class is set to local-storage, then pgbackrest-shared-repo pod will not get rescheduled to another Kubernetes node.

As a result, there is a single point of failure and the following conditions might occur:

  • Backups of the management database fails
  • Disk space fills with accumulated Postgres wal (Write-Ahead Logging) files

To avoid this problem, monitor the disk usage, see: Monitoring Postgres disk usage on OpenShift or Monitoring Postgres disk usage on Cloud Pak for Integration.

When the Kubernetes node comes back up, the pod is scheduled and all the processes should resume properly.