Management subsystem Postgres failover steps

If your Postgres leader pod is not the original Postgres leader, complete the failover steps to restore the original pod as leader.

Before you begin

This topic applies only to three replica deployments of API Connect.

Follow the steps in this topic if one or more of the following conditions are reported:
  • If the apicops version:pre-upgrade command reports:
    Current postgres leader is not the original leader
  • Openshift and Cloud Pak for Integration only: Your API Connect cluster CR reports Error status, and the output of:
    oc get apiconnectcluster -n <namespace> -o yaml
    shows:
    Original PostgreSQL primary is running as replica, please perform failover to original primary before attempting upgrade. Refer - https://ibm.biz/BdPLWU
  • When you attempt to upgrade the management subsystem by editing the management CR to change the version and license and you see this error:
    Original PostgreSQL primary is running as replica, please perform failover to original primary before attempting upgrade. Refer - https://ibm.biz/BdPLWU

About this task

The management subsystem uses Postgres as the database. A three replica API Connect deployment has a Postgres cluster that consists of 1 Postgres leader, and 2 Postgres replicas.
Note: In Kubernetes terminology, every pod in a cluster is called a replica. In Postgres terminology, a Postgres cluster consists of members, one of the members is designated as the leader, and the others are designated as replicas. During normal operation, the Postgres cluster member that is the leader can change; one of the Postgres replicas can become the leader, and the original leader become a replica.
If the current Postgres cluster leader in your management deployment is not the original leader, then problems can occur during upgrade. To prevent upgrade problems, follow the steps in this topic to identify the original Postgres leader and assign it as the leader again.
Note: If you have an OVA deployment, run the steps that are documented in this topic from your management virtual appliance, as the root user: Login to your management virtual appliance with an SSH client and the username apicadm, then switch to the root user with sudo -i.

Procedure

  1. Assign your management subsystem namespace to a new environment variable called $NAMESPACE
    export NAMESPACE=<management namespace>
    Note: On OVA deployments, set this to "default":
    export NAMESPACE=default
  2. Identify the original name of your Postgres cluster, and assign it to a new environment variable $PGCLUSTER_NAME.
    export PGCLUSTER_NAME=$(kubectl -n $NAMESPACE get pgcluster | awk 'NR==2{print $1}')
    Confirm the variable is set correctly with:
    echo $PGCLUSTER_NAME
  3. Get a list of the PVCs whose names start with $PGCLUSTER_NAME in your API Connect namespace.
    kubectl -n $NAMESPACE get pvc $PGCLUSTER_NAME $PGCLUSTER_NAME-wal
    
    NAME                       STATUS   VOLUME              CAPACITY   ACCESS MODES   STORAGECLASS    AGE
    m1-33253de3-postgres       Bound    local-pv-2b61c0bd   234Gi      RWO            local-storage   6h42m
    m1-33253de3-postgres-wal   Bound    local-pv-bd893695   234Gi      RWO            local-storage   6h42m
    Note: If no PVCs are returned then check the apicops error or management CR warning, if it says The original primary PVC is not attached to any pods in this namespace, then do not proceed with these steps. Instead, to re-create your Postgres cluster with a new leader:
    1. Take a new management database backup Backup and restore.
    2. Restore the management database from the new backup.
  4. Get the names of all the Postgres pods.
    kubectl get pods -n $NAMESPACE -l 'role in (master,replica)'
    
    NAME                                         READY   STATUS    RESTARTS   AGE
    m1-33253de3-postgres-dbd7966db-ltfhq         1/1     Running   0          61s
    m1-33253de3-postgres-dxxx-c54776d87-wldgb    1/1     Running   0          26m
    m1-33253de3-postgres-xwjf-8577b6c669-9chcp   1/1     Running   0          26m
  5. Identify which pod was the original Postgres leader.
    Describe each Postgres pod to see which one is using the two PVCs identified in step 3.
    kubectl -n $NAMESPACE describe pod <pod-name> | grep 'ClaimName'
    
    k describe pod m1-33253de3-postgres-dbd7966db-ltfhq | grep ClaimName
        ClaimName:  m1-33253de3-postgres
        ClaimName:  m1-33253de3-postgres-wal
    
    k describe pod m1-33253de3-postgres-dxxx-c54776d87-wldgb | grep ClaimName
        ClaimName:  m1-33253de3-postgres-dxxx
        ClaimName:  m1-33253de3-postgres-dxxx-wal
    
    k describe pod m1-33253de3-postgres-xwjf-8577b6c669-9chcp | grep ClaimName
        ClaimName:  m1-33253de3-postgres-xwjf
        ClaimName:  m1-33253de3-postgres-xwjf-wal
    In this example, the highlighted pod m1-33253de3-postgres-dbd7966db-ltfhq is the pod that is using the PVCs identified in step 3.
    Set this pod to the environment variable $ORIGINAL_POSTGRES_LEADER:
    export ORIGINAL_POSTGRES_LEADER=<pod-name>
    Note: If none of the Postgres pods are using the PVCs identified in step 3, then do not proceed with these steps. Instead, to re-create your Postgres cluster with a new leader:
    1. Take a new management database backup Backup and restore.
    2. Restore the management database from the new backup.
  6. Run the patronictl list command in each Postgres pod and examine the output.
    kubectl -n $NAMESPACE exec -it <postgres-pod> -- patronictl list
    Example output for each Postgres pod:
    Defaulted container "database" out of: database, set-libpq-certs (init)
    +--------------------------------------------+-----------------+---------+---------+----+-----------+
    | Member                                     | Host            | Role    | State   | TL | Lag in MB |
    + Cluster: m1-33253de3-postgres (7039951359768572098) ---------+---------+---------+----+-----------+
    | m1-33253de3-postgres-dbd7966db-ltfhq       | 192.168.51.112  | Replica | running |  8 |         0 |
    | m1-33253de3-postgres-dxxx-c54776d87-wldgb  | 192.168.112.17  | Replica | running |  8 |         0 |
    | m1-33253de3-postgres-xwjf-8577b6c669-9chcp | 192.168.117.219 | Leader  | running |  8 |           |
    +--------------------------------------------+-----------------+---------+---------+----+-----------+
    Each Postgres pod should show the same output, although the order of the records might differ. The names in the Member column should match the names of the Postgres pods. Check the following:
    • If the original Postgres pod that you identified in step 5 has a Lag in MB of zero and a TL the same as the Leader pod, then continue to step 7.
    • If the pod labeled Leader shows a TL that is less than the other pods, or a non-null Lag in MB, then proceed no further and open a support case.
    • If the original Postgres pod that you identified in step 5 shows a lower TL than the other pods, or a nonzero value in Lag in MB, then reinitialize that pod:
      kubectl exec -it $ORIGINAL_POSTGRES_LEADER -n $NAMESPACE -- patronictl reinit <cluster name> $ORIGINAL_POSTGRES_LEADER
      where:
      • <cluster name> is the name of the Postgres cluster that is shown in the patronictl list output:
        + Cluster: <cluster name> ...
      Run patronictl list again to confirm that the Lag in MB is reducing. Do not proceed until it is zero. If it is not reducing, open a support case.
  7. Make the pod that you identified in step 5 the Leader.
    Run:
    kubectl -n $NAMESPACE exec -it <postgres pod> -- patronictl failover
    and select your $ORIGINAL_POSTGRES_LEADER as the candidate to failover to:
    
    Candidate ['m1-33253de3-postgres-dbd7966db-ltfhq', 'm1-33253de3-postgres-dxxx-c54776d87-wldgb'] []: m1-33253de3-postgres-dbd7966db-ltfhq
    Current cluster topology
    +--------------------------------------------+-----------------+---------+---------+----+-----------+
    | Member                                     | Host            | Role    | State   | TL | Lag in MB |
    + Cluster: m1-33253de3-postgres (7039951359768572098) ---------+---------+---------+----+-----------+
    | m1-33253de3-postgres-dbd7966db-ltfhq       | 192.168.51.112  | Replica | running |  8 |         0 |
    | m1-33253de3-postgres-dxxx-c54776d87-wldgb  | 192.168.112.17  | Replica | running |  8 |         0 |
    | m1-33253de3-postgres-xwjf-8577b6c669-9chcp | 192.168.117.219 | Leader  | running |  8 |           |
    +--------------------------------------------+-----------------+---------+---------+----+-----------+
    Are you sure you want to failover cluster m1-33253de3-postgres, demoting current master m1-33253de3-postgres-xwjf-8577b6c669-9chcp? [y/N]: y
    2021-12-10 22:07:02.15154 Successfully failed over to "m1-33253de3-postgres-dbd7966db-ltfhq"
    +--------------------------------------------+-----------------+---------+---------+----+-----------+
    | Member                                     | Host            | Role    | State   | TL | Lag in MB |
    + Cluster: m1-33253de3-postgres (7039951359768572098) ---------+---------+---------+----+-----------+
    | m1-33253de3-postgres-dbd7966db-ltfhq       | 192.168.51.112  | Leader  | running |  8 |           |
    | m1-33253de3-postgres-dxxx-c54776d87-wldgb  | 192.168.112.17  | Replica | running |  8 |         0 |
    | m1-33253de3-postgres-xwjf-8577b6c669-9chcp | 192.168.117.219 | Replica | stopped |    |   unknown |
    +--------------------------------------------+-----------------+---------+---------+----+-----------+
    
  8. Run patronictl list to confirm that the pod that was identified in step 5 is the new leader, and that TL and Lag in MB are reported to be the same in each pod.
    kubectl exec -it <postgres-pod> -n $NAMESPACE -- patronictl list
    
    +--------------------------------------------+-----------------+---------+---------+----+-----------+
    | Member                                     | Host            | Role    | State   | TL | Lag in MB |
    + Cluster: m1-33253de3-postgres (7039951359768572098) ---------+---------+---------+----+-----------+
    | m1-33253de3-postgres-dbd7966db-ltfhq       | 192.168.51.112  | Leader  | running |  9 |           |
    | m1-33253de3-postgres-dxxx-c54776d87-wldgb  | 192.168.112.17  | Replica | running |  9 |         0 |
    | m1-33253de3-postgres-xwjf-8577b6c669-9chcp | 192.168.117.219 | Replica | running |  9 |         0 |
    +--------------------------------------------+-----------------+---------+---------+----+-----------+