Troubleshooting the management database

You can troubleshoot the API Connect management database

Database Replica Pods stuck in Unknown or Pending state

In certain scenarios, a postgres replica pod may not recover to a healthy state when a restore completes, a node outage occurs, or after a fresh install or upgrade. In these cases, a postgres pod remains in a Unknown or a Pending state after a number of minutes. The pod fail to get into a Running state.

This situation occurs when the replicas do not initialize properly. You can use the patronictl reinit command to reinitialize the replica. Note that this command syncs the replica's volume data from the current Primary pod.

Use the following steps to get the pod back into a working state:

  1. Exec onto the failing pod:
    kubectl exec -it <postgres_replica_pod_name> -n <namespace> -- bash
  2. List the cluster members:
    patronictl list
    + Cluster: fxpk-management-01191b80-postgres (6893134118851096752) --------+--------+--------------+----+-----------+
    |                          Member                         |      Host      |  Role  |    State     | TL | Lag in MB |
    +---------------------------------------------------------+----------------+--------+--------------+----+-----------+
    |    fxpk-management-01191b80-postgres-586f899fdf-6s25b   | 172.16.172.244 |        | start failed |    |   unknown |
    | fxpk-management-01191b80-postgres-rkww-795665698f-4rh4s | 172.16.148.51  | Leader |   running    |  3 |           |
    |  fxpk-management-01191b80-postgres-uvag-9475f7c5f-qr84m |  172.16.53.68  |        |   running    |  3 |         0 |
    +---------------------------------------------------------+----------------+--------+--------------+----+-----------+

    In the example shown above fxpk-management-01191b80-postgres-586f899fdf-6s25b is not in running state.

    Note the clusterName and replicaName which are not up:

    • clusterName - fxpk-management-01191b80-postgres
    • replicaName - fxpk-management-01191b80-postgres-586f899fdf-6s25b
  3. Run:
    patronictl reinit <clusterName> <replicaName-which-is-not-running>

    Example:

    patronictl reinit fxpk-management-01191b80-postgres fxpk-management-01191b80-postgres-586f899fdf-6s25b
    + Cluster: fxpk-management-01191b80-postgres (6893134118851096752) --------+--------+--------------+----+-----------+
    |                          Member                         |      Host      |  Role  |    State     | TL | Lag in MB |
    +---------------------------------------------------------+----------------+--------+--------------+----+-----------+
    |    fxpk-management-01191b80-postgres-586f899fdf-6s25b   | 172.16.172.244 |        | start failed |    |   unknown |
    | fxpk-management-01191b80-postgres-rkww-795665698f-4rh4s | 172.16.148.51  | Leader |   running    |  3 |           |
    |  fxpk-management-01191b80-postgres-uvag-9475f7c5f-qr84m |  172.16.53.68  |        |   running    |  3 |         0 |
    +---------------------------------------------------------+----------------+--------+--------------+----+-----------+
    Are you sure you want to reinitialize members fxpk-management-01191b80-postgres-586f899fdf-6s25b? [y/N]: y
    Success: reinitialize for member fxpk-management-01191b80-postgres-586f899fdf-6s25b
    
  4. Run patronictl list again.

    You may also observe that the replica is on a different Timeline (TL) and possibly have a Lag in MB. It may take a few minutes for the pod to switch onto the same TL as the others and the Lag should slowly go to 0.

    For example:

    bash-4.2$ patronictl list
    + Cluster: fxpk-management-01191b80-postgres (6893134118851096752) --------+--------+---------+----+-----------+
    |                          Member                         |      Host      |  Role  |  State  | TL | Lag in MB |
    +---------------------------------------------------------+----------------+--------+---------+----+-----------+
    |    fxpk-management-01191b80-postgres-586f899fdf-6s25b   | 172.16.172.244 |        | running |  1 |     23360 |
    | fxpk-management-01191b80-postgres-rkww-795665698f-4rh4s | 172.16.148.51  | Leader | running |  3 |           |
    |  fxpk-management-01191b80-postgres-uvag-9475f7c5f-qr84m |  172.16.53.68  |        | running |  3 |         0 |
    +---------------------------------------------------------+----------------+--------+---------+----+-----------+
    
  5. The pod that previously was in an Unknown or Pending state or (0/1) Running state is now in (1/1) Running state.