Monitoring Postgres disk usage on OpenShift

In API Connect, you can monitor the disk space that is used by the Postgres database in the Management subsystem.

The Postgres database is the core database used in the Management subsystem. It is important to monitor the Postgres disk usage to avoid running out of space and causing an outage in your deployment.

In Version 10.0.1.4-eus and later, the APIConnect operator tracks the current disk usage of the Postgres components, and regularly updates the ManagementCluster CR's status. When one or more of the Postgres components occupy 50% of the PVC (persistent volume claim) capacity, the APIConnect operator changes the CR's status from Running to Warning.

Important: If the Postgres disk usage reaches 80%, then the APIConnect operator brings down Postgres. If this situation occurs, contact IBM Support.

Note that there are various types of storage classes which can be used to deploy Management subsystem; for example, local-storage or ceph block. When local-storage is used, the entire disk is allocated to the worker node. In some cases, before the APIConnect operator reacts to an 80% usage condition, Kubernetes itself might face disk pressure and start evicting pods. In this situation, you might want to increase the size of the disk allocated to the worker node as explained in Recovering on OpenShift and Cloud Pak for Integration when disks are filled by the management database.

Viewing the current disk usage

The ManagementCluster instance includes the .status.postgresDataStats field, where the operator displays the current disk usage of Postgres components. Run the following command to get the disk usage:

ocp get mgmt m1 -n APIC_namespace -o json | jq .status.postgresDataStats

The response looks like the following example:

[
  {
    "instanceName": "m1-ed60c42d-postgres",
    "podName": "m1-ed60c42d-postgres-86766f69cb-xs7t5",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres",
    "pvcType": "PostgreSQL",
    "pvcUsed": 60895232,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres-backrest-shared-repo",
    "podName": "m1-ed60c42d-postgres-backrest-shared-repo-859484b5f6-r7cv6",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-pgbr-repo",
    "pvcType": "pgBackRest",
    "pvcUsed": 16203776,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres",
    "podName": "m1-ed60c42d-postgres-86766f69cb-xs7t5",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-wal",
    "pvcType": "WAL",
    "pvcUsed": 201338880,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres-fqbg",
    "podName": "m1-ed60c42d-postgres-fqbg-84869d976c-49qgc",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-fqbg",
    "pvcType": "PostgreSQL",
    "pvcUsed": 59994112,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres-fqbg",
    "podName": "m1-ed60c42d-postgres-fqbg-84869d976c-49qgc",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-fqbg-wal",
    "pvcType": "WAL",
    "pvcUsed": 201334784,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres-cimo",
    "podName": "m1-ed60c42d-postgres-cimo-5769868b75-5z7ln",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-cimo",
    "pvcType": "PostgreSQL",
    "pvcUsed": 59994112,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres-cimo",
    "podName": "m1-ed60c42d-postgres-cimo-5769868b75-5z7ln",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-cimo-wal",
    "pvcType": "WAL",
    "pvcUsed": 201334784,
    "pvcUsedPercentage": 0
  }
]

Warning condition is populated at 50% usage

When one of the Postgres components uses 50% of its allocated space, the operator changes the overall status of the ManagementCluster to Warning with an appropriate warning message. The operator also updates the postgresDataStats section with the current data usage of the Postgres components.

Attention: If you encounter the Warning condition, contact IBM Support for help correcting the root cause.

To view the status, run the following command:

oc get mgmt m1

The following example response shows the warning state:



NAME   READY   STATUS    VERSION      RECONCILED VERSION      AGE
m1     17/17   Warning   10.0.1-eus   10.0.1.4-ifix1-44-eus   26h

To view just the status, run the following command:

oc get mgmt m1 -o json | jq .status.phase

In this example, the status displayed in the response is "Warning":

"Warning"

To see the list of current status conditions, run the following command:

oc get mgmt m1 -o json | jq .status.conditions

The following example response displays the list of status conditions. Notice that the Warning"condition is set to "True" and the other conditions are set to "False":

[
  {
    "lastTransitionTime": "2021-09-08T19:49:51Z",
    "message": "Current WAL disk usage of m1-ed60c42d-postgres is 51 percent. DATABASE SHUTDOWN starts at 80 percent utilization. Please contact IBM Support immediately",
    "reason": "disk_usage_more_than_50_percent",
    "status": "True",
    "type": "Warning"
  },
  {
    "lastTransitionTime": "2021-09-08T19:42:30Z",
    "message": "",
    "reason": "na",
    "status": "False",
    "type": "Ready"
  },
  {
    "lastTransitionTime": "2021-09-08T19:41:35Z",
    "message": "",
    "reason": "na",
    "status": "False",
    "type": "Pending"
  },
  {
    "lastTransitionTime": "2021-09-07T17:18:02Z",
    "message": "",
    "reason": "na",
    "status": "False",
    "type": "Error"
  },
  {
    "lastTransitionTime": "2021-09-07T17:18:02Z",
    "message": "",
    "reason": "na",
    "status": "False",
    "type": "Failed"
  }
]

The warning condition includes an explanation of why the condition was set to "True"; for example:

Current WAL disk usage of m1-ed60c42d-postgres is 51 percent. DATABASE SHUTDOWN starts at 80 percent utilization. Please contact IBM Support immediately

Error condition is populated at 80% usage

When one of the Postgres components uses 80% of its allocated space, the operator changes the overall status of the ManagementCluster to Error and brings down Postgres avoid problems that can occur if the disk becomes completely filled.

Attention: If you encounter the Error condition, contact IBM Support for help correcting the root cause.

In this case, when you run the following command to view the status conditions, you will see that the Error condition is set to "True:

oc get mgmt m1 -o json | jq .status.conditions

For example, the following condition shows the Error status set to "True" due to disk usage:

 - lastTransitionTime: "2021-08-28T01:42:56Z"
   message: Current WAL disk usage of nihar-stac-3955b42d-site1-postgres is more
     than 70 percent, initiating DATABASE SHUTDOWN
   reason: disk_usage_more_than_70_percent
   status: "True"
   type: Error