Troubleshooting management database backup and restore

Common problems and how to diagnose issues with backup and restore.

Common problems with backup and restore

These are some frequent issues that can cause backup or restore to fail:

Invalid login credentials for your remote backup server. Check that the username and password that you configured in your subsystem database backup settings is correct.
The user does not have write permissions on the backup server. Check that the username you configured in your subsystem database backup settings has write permission to the specified backup path.
Remote backup server storage full. Check your remote backup server and if the storage is full either extend the storage space or delete older backups.
No network access to the remote backup server. Check that you can communicate with your remote backup server from your API Connect environment.
TLS handshaking failure with the remote backup server. If your remote backup server has a self-signed CA certificate, check that this certificate is trusted by your API Connect deployment, see the database backup configuration steps for your subsystem for more information.

Backup or restore CR stuck in running or in a failed state.

Describe the backup or restore CR for more information on the problem:

kubectl -n <namespace> describe backup backup-1700143802

Name:         management-1700143800
Namespace:    apic
Labels:       k8s.enterprisedb.io/cluster=management-dc1-db
              k8s.enterprisedb.io/immediateBackup=false
              k8s.enterprisedb.io/scheduled-backup=management
Annotations:  <none>
API Version:  postgresql.k8s.enterprisedb.io/v1
Kind:         Backup
Metadata:
  Creation Timestamp:  2023-11-16T14:10:00Z
  Generation:          1
  Resource Version:    2316802
  UID:                 c7a72883-1f6f-44a7-b360-c18f8c3ba24d
Spec:
  Cluster:
    Name:  management-dc1-db
Status:
  Backup Id:         20231116T141005
  Backup Name:       backup-1700143802
  Begin LSN:         0/D000028
  Begin Wal:         00000001000000000000000D
  Destination Path:  s3://ent-edb-bnr/2dcdr-mgmt-active-1538
  End LSN:           0/D01AB58
  End Wal:           00000001000000000000000D
  Instance ID:
    Container ID:  cri-o://13f5f44d0b84cb18b8a88b0f747f4c95f24a5e2c1423f5ec0a8615463874ffa2
    Pod Name:      management-dc1-db-1
  Phase:           completed
  s3Credentials:
    Access Key Id:
      Key:   key
      Name:  mgmt-backup-secret
    Region:
      Key:   region
      Name:  mgmt-backup-secret
    Secret Access Key:
      Key:      keysecret
      Name:     mgmt-backup-secret
  Server Name:  management-dc1-db-2023-11-16T10:56:39Z
  Started At:   2023-11-16T14:10:05Z
  Stopped At:   2023-11-16T14:10:29Z
Events:
  Type    Reason     Age   From                            Message
  ----    ------     ----  ----                            -------
  Normal  Starting   14m   cloud-native-postgresql-backup  Starting backup for cluster management-dc1-db
  Normal  Starting   14m   instance-manager                Backup started
  Normal  Completed  13m   instance-manager                Backup completed

Unable to find database backup CRs

If kubectl get backup does not return the backup that you want to restore, then search for the backup on your SFTP server or object-store.

The path to API Connect management database backups on the remote SFTP server or object-store has the following format:

<backup path>/<mgmt db cluster name>-<time when db was created>/base/<backup ID>

<backup path> is the path that is defined in your management CR: backup settings. The format is bucket_name/folder.
If your API Connect deployment was upgraded from v10.0.6.0 or earlier, then /edb is appended to the path.
<mgmt db cluster name> is the name of your management database cluster. Identify this name with:
```
kubectl -n <management namespace> get cluster -n active
```
See get cluster name.
<time when db was created> is a timestamp of when the management subsystem database was created. The format is: 2023-12-25T00:00:00Z.
Whenever you do a management database restore, a new management database is created, and so a new directory is created where subsequent database backups are stored.
<backup ID> is the ID of a management database backup. The format is YYYYMMDDTHHMMSS. This is a directory that contains all the files that comprise the database backup.

Example path:

s3bucket1/folder2/mgmt-db-2023-12-25T00:00:00Z/base/20231225T094400

With the information identified, you can restore the backup by using the Restoring the management database with a backup ID method.

Error when restoring a management subsystem with the API discovery service

If you enable the API discovery service, the backups that you take can be restored only onto a management subsystem that also has the API discovery service enabled. If you try to restore a backup onto a management subsystem that doesn't have the API discovery service enabled, the database recovery jobs will go into an error state, and the logs will contain the following tablespaces directory error:

[Errno 30] Read-only file system: '/var/lib/postgresql/tablespaces'

The status of the management job pods will look like the following output:

management-apim-restore-1bc41b74-sjzzp                           0/1     Completed   0               3h28m
management-bf631b70-db-1-full-recovery-4lbdv                     0/1     Error       0               11m
management-bf631b70-db-1-full-recovery-725l9                     0/1     Error       0               13m
management-bf631b70-db-1-full-recovery-8mhdk                     0/1     Error       0               103s

In this case, you must take the following steps.

Update the ManagementCluster CR to enable the API discovery service:

spec:
  ...
  discovery:
    enabled: true
    proxyCollectorEnabled: true

Apply the updated CR by running the following command:
```
kubectl apply -f management_cr.yaml -n <management_namespace>
```
Where management_namespace is the name of the target installation namespace in the Kubernetes cluster.

Find the name of the EDB PostgreSQL cluster, by running the following command:

kubectl get cluster -n <namespace>

The output will look similar to the following output:

NAME                     AGE   INSTANCES   READY   STATUS               PRIMARY
management-bf631b70-db   13m   1                   Setting up primary

Wait until the EDB PostgreSQL cluster has been updated with the tablespaces configuration. You can check this status by running the following command:
```
kubectl get cluster <cluster-name> -o yaml -n <namespace>
```
Look in the output for the tablespaces entry, which looks like the following example:
```
    tablespaces:
    - name: apidiscoverysvc
      owner:
        name: apicuser
      storage:
        resizeInUseVolumes: false
        size: 10Gi
        storageClass: local-storage
      temporary: false
```
Note that the cluster status will still be Setting up primary at this point.
Delete the EDB PostgresSQL cluster by running the following command:
```
kubectl delete cluster <cluster-name> -n <namespace>
```
After a few moments the EDB PostgreSQL cluster will be recreated by the IBM® API Connect operator. You can check this by running the following command:
```
kubectl get cluster -n <namespace>
```
The database recovery jobs will eventually be recreated, and succeed. You can then verify that the PVCs for the tablespaces have been recreated by running the following command:
```
kubectl get pvc -n <namespace>
```
The output of this command will show that there are one or more PVCs present, with a name similar to management-bf631b70-db-1-tbs-apidiscoverysvc.
The API discovery service pods will now start, and the restore operation will complete.

For more information about the API discovery service, see Enabling API discovery on Kubernetes.