Troubleshooting the management database
You can troubleshoot the API Connect management database
Database Replica Pods stuck in Unknown or Pending state
In certain scenarios, a postgres replica pod may not recover to a healthy state when a restore
completes, a node outage occurs, or after a fresh install or upgrade. In these cases, a postgres pod
remains in a Unknown
or a Pending
state after a number of minutes.
The pod fail to get into a Running
state.
This situation occurs when the replicas do not initialize properly. You can use the
patronictl reinit
command to reinitialize the replica. Note that this command syncs
the replica's volume data from the current Primary pod.
Use the following steps to get the pod back into a working state:
- Exec onto the failing
pod:
kubectl exec -it <postgres_replica_pod_name> -n <namespace> -- bash
- List the cluster
members:
patronictl list + Cluster: fxpk-management-01191b80-postgres (6893134118851096752) --------+--------+--------------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +---------------------------------------------------------+----------------+--------+--------------+----+-----------+ | fxpk-management-01191b80-postgres-586f899fdf-6s25b | 172.16.172.244 | | start failed | | unknown | | fxpk-management-01191b80-postgres-rkww-795665698f-4rh4s | 172.16.148.51 | Leader | running | 3 | | | fxpk-management-01191b80-postgres-uvag-9475f7c5f-qr84m | 172.16.53.68 | | running | 3 | 0 | +---------------------------------------------------------+----------------+--------+--------------+----+-----------+
In the example shown above
fxpk-management-01191b80-postgres-586f899fdf-6s25b
is not in running state.Note the
clusterName
andreplicaName
which are not up:clusterName
-fxpk-management-01191b80-postgres
replicaName
-fxpk-management-01191b80-postgres-586f899fdf-6s25b
- Run:
patronictl reinit <clusterName> <replicaName-which-is-not-running>
Example:
patronictl reinit fxpk-management-01191b80-postgres fxpk-management-01191b80-postgres-586f899fdf-6s25b + Cluster: fxpk-management-01191b80-postgres (6893134118851096752) --------+--------+--------------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +---------------------------------------------------------+----------------+--------+--------------+----+-----------+ | fxpk-management-01191b80-postgres-586f899fdf-6s25b | 172.16.172.244 | | start failed | | unknown | | fxpk-management-01191b80-postgres-rkww-795665698f-4rh4s | 172.16.148.51 | Leader | running | 3 | | | fxpk-management-01191b80-postgres-uvag-9475f7c5f-qr84m | 172.16.53.68 | | running | 3 | 0 | +---------------------------------------------------------+----------------+--------+--------------+----+-----------+ Are you sure you want to reinitialize members fxpk-management-01191b80-postgres-586f899fdf-6s25b? [y/N]: y Success: reinitialize for member fxpk-management-01191b80-postgres-586f899fdf-6s25b
- Run
patronictl list
again.You may also observe that the replica is on a different Timeline (TL) and possibly have a Lag in MB. It may take a few minutes for the pod to switch onto the same TL as the others and the Lag should slowly go to 0.
For example:
bash-4.2$ patronictl list + Cluster: fxpk-management-01191b80-postgres (6893134118851096752) --------+--------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +---------------------------------------------------------+----------------+--------+---------+----+-----------+ | fxpk-management-01191b80-postgres-586f899fdf-6s25b | 172.16.172.244 | | running | 1 | 23360 | | fxpk-management-01191b80-postgres-rkww-795665698f-4rh4s | 172.16.148.51 | Leader | running | 3 | | | fxpk-management-01191b80-postgres-uvag-9475f7c5f-qr84m | 172.16.53.68 | | running | 3 | 0 | +---------------------------------------------------------+----------------+--------+---------+----+-----------+
- The pod that previously was in an
Unknown
orPending state
or(0/1) Running
state is now in(1/1) Running
state.