Troubleshooting resiliency issues

Resolve resiliency issues with API Connect management system backups.

pgbackrest-shared-repo resiliency problems

If the Kubernetes node hosting pgbackrest-shared-repo pod is down and the PVC attached to the pod has strict zone requirements (for example, in AWS or other clouds) or if the storage class is set to local-storage, then pgbackrest-shared-repo pod will not get rescheduled to another Kubernetes node.

As a result, there is a single point of failure and the following conditions might occur:

  • Backups of the management database fails
  • Disk space fills with accumulated Postgres wal (Write-Ahead Logging) files

To avoid this problem, monitor the disk usage: Monitoring Postgres disk usage on VMware.

When the Kubernetes node comes back up, the pod is scheduled and all the processes should resume properly.