PostgreSQL cluster replicas get out of sync
The replicas for Cloud Native PostgreSQL and EDB Postgres clusters occasionally get out of sync with the primary node.
Symptoms
This issue can manifest in several ways:- You see a mismatch in the data between the replicas.
- The database cluster is stuck in the
Waiting for the instances to become activestate.To check the status of the database cluster, run:
oc get clusters.postgresql.k8s.enterprisedb.io \ -n ${PROJECT_CPD_INST_OPERANDS} - One or more of the database pods are in the
CrashLoopBackOffstate or are stuck in theRunningstate.To check the status of the database pods, run:
oc get pods \ -n ${PROJECT_CPD_INST_OPERANDS} \ -l k8s.enterprisedb.io/podRole=instanceImportant: If the command returnscommon-service-dbpods in theCrashLoopBackOffstate, complete step 1b in Diagnosing the problem to determine whether you need to contact IBM Support for help resolving the issue. - A backup checkpoint job (
zen-checkpoint-backup-job-*) fails.
Diagnosing the problem
To confirm that the replicas are out of sync:
- Check the logs for errors:
- Search the PostgreSQL operator logs for
the phrase
WAL:
oc logs -n ${PROJECT_CPD_INST_OPERATORS} \ -l app.kubernetes.io/name=cloud-native-postgresql \ | grep "WAL"Look for messages with the following phrases:- missing WAL files
- tried restoring WALs
- Search the PostgreSQL pod logs for the
phrase Failed to execute
pg_rewind:
oc logs -n ${PROJECT_CPD_INST_OPERANDS} \ -l app.kubernetes.io/name=cloud-native-postgresql \ | grep '"level":"error"' \ | grep 'Failed to execute pg_rewind'Important: If thecommon-service-dbpods are in theCrashLoopBackOffstate and you see the preceding phrases in thecommon-service-dbpod logs, contact IBM Support for assistance resolving the following issue:common-service-dbpod inCrashLoopBackOffstate due to EDB WAL storage full.
For other pods, if you see either of the preceding phrases in the log files, proceed to the next step.
- Search the PostgreSQL operator logs for
the phrase
WAL:
- Set the
POSTGRES_CLUSTERenvironment variable to the name of the PostgreSQL cluster where you want to check the replica status:export POSTGRES_CLUSTER=<cluster-name>Tip: If you don't know the name of the PostgreSQL cluster, run the following command to see a list of any PostgreSQL cluster in the project:oc get clusters.postgresql.k8s.enterprisedb.io \ -n ${PROJECT_CPD_INST_OPERANDS} - If you have the Cloud Native PostgreSQL (CNP) plug-in for
kubectl, you can use it to check the status of the PostgreSQL cluster:- Run the following command to check the status of the
cluster:
kubectl cnp status ${POSTGRES_CLUSTER} - Compare the
Current LSNof the primary node with theReceived LSNandReplay LSNfor the replicas.- If the log sequence number (LSN) is the same, the replicas are in sync.
- If the log sequence number (LSN) is different, proceed to Resolving the problem.
- Run the following command to check the status of the
cluster:
- If you do not have the CNP plug-in, run the following commands to compare the log sequence
number on the primary node and on the replicas:
- Set the
PRIMARY_POD,PRIMARY_POD_LSN,REPLICA_POD, andREPLICA_POD_LSNenvironment variables:- Environments with one replica pod
-
- Set the
PRIMARY_PODandREPLICA_PODenvironment variables:export PRIMARY_POD=$(oc get pod -n ${PROJECT_CPD_INST_OPERANDS} -l k8s.enterprisedb.io/cluster=${POSTGRES_CLUSTER},role=primary -o jsonpath="{.items[0].metadata.name}") export REPLICA_POD=$(oc get pod -n ${PROJECT_CPD_INST_OPERANDS} -l k8s.enterprisedb.io/cluster=${POSTGRES_CLUSTER},role=replica -o name) - Set the
PRIMARY_POD_LSNandREPLICA_POD_LSNenvironment variables:export PRIMARY_POD_LSN=$(oc exec ${PRIMARY_POD} -n ${PROJECT_CPD_INST_OPERANDS} -c postgres -- psql -U postgres -c "SELECT pg_current_wal_lsn();" | sed -n '3p' | sed -e 's/^[ \t]*//' -e 's/[ \t]*$//') export REPLICA_POD_LSN=$(oc -n ${PROJECT_CPD_INST_OPERANDS} exec -t ${REPLICA_POD} $ -c postgres -- psql -U postgres -c "SELECT pg_last_wal_replay_lsn();" | sed -n '3p' | sed -e 's/^[ \t]*//' -e 's/[ \t]*$//')
- Set the
- Environments with multiple replica pods
-
- Set the
PRIMARY_PODenvironment variable:export PRIMARY_POD=$(oc get pod -n ${PROJECT_CPD_INST_OPERANDS} -l k8s.enterprisedb.io/cluster=${POSTGRES_CLUSTER},role=primary -o jsonpath="{.items[0].metadata.name}") - Get the name of the replica
pods:
oc get pod -n ${PROJECT_CPD_INST_OPERANDS} -l k8s.enterprisedb.io/cluster=${POSTGRES_CLUSTER},role=replica -o nameThe command returns the list of pods that are associated with the specified PostgreSQL cluster (
${).POSTGRES_CLUSTER} - Set the
REPLICA_PODenvironment variable to one of the replica pods:export REPLICA_POD=<pod-name> - Set the
PRIMARY_POD_LSNandREPLICA_POD_LSNenvironment variables:export PRIMARY_POD_LSN=$(oc exec ${PRIMARY_POD} -n ${PROJECT_CPD_INST_OPERANDS} -c postgres -- psql -U postgres -c "SELECT pg_current_wal_lsn();" | sed -n '3p' | sed -e 's/^[ \t]*//' -e 's/[ \t]*$//') export REPLICA_POD_LSN=$(oc -n ${PROJECT_CPD_INST_OPERANDS} exec -t ${REPLICA_POD} $ -c postgres -- psql -U postgres -c "SELECT pg_last_wal_replay_lsn();" | sed -n '3p' | sed -e 's/^[ \t]*//' -e 's/[ \t]*$//')
- Set the
- For each replica, compare the LSN for the primary node and the replica
pod:
echo $PRIMARY_POD_LSN, $REPLICA_POD_LSNThe LSN of the primary node and the replica pod should be the same. If the LSN of any of the replica pods is different than the LSN of the primary pod, the PostgreSQL cluster is in an out-of-sync state and you must complete Resolving the problem.
- Set the
Resolving the problem
- Check whether the EDB Postgres cluster
is
fenced:
oc get clusters.postgresql.k8s.enterprisedb.io ${POSTGRES_CLUSTER} -o yaml | grep fencedInstances - If the EDB Postgres cluster is fenced,
lift the
fencing:
oc annotate clusters.postgresql.k8s.enterprisedb.io ${POSTGRES_CLUSTER} k8s.enterprisedb.io/fencedInstances- - Wait several minutes, then check the status of the EDB Postgres cluster:
oc get clusters.postgresql.k8s.enterprisedb.io ${POSTGRES_CLUSTER} \ -n ${PROJECT_CPD_INST_OPERANDS}- If the status is
Cluster in healthy state, no additional resolution is required. - If the status is not
Cluster in healthy state, continue to the next step.
- If the status is
- Get the name of the persistent volume claim associated with the replica
pod:
oc describe pod ${REPLICA_POD} -n ${PROJECT_CPD_INST_OPERANDS}In the
Volumessection, get theClaimName. The name of the persistent volume claim is usually the same as the name of the replica pod. - Delete the persistent volume claim:
- The claim name is the same as the replica pod name
-
oc delete pvc ${REPLICA_POD} -n ${PROJECT_CPD_INST_OPERANDS} - The claim name is different from the replica pod name
-
oc delete pvc <pvc-name> -n ${PROJECT_CPD_INST_OPERANDS}
- Delete the replica
pod:
oc delete pod ${REPLICA_POD} -n ${PROJECT_CPD_INST_OPERANDS}