Monitoring Postgres disk usage on OpenShift

In API Connect, you can monitor the disk space that is used by the Postgres database in the Management subsystem.

The Postgres database is the core database used in the Management subsystem. It is important to monitor the Postgres disk usage to avoid running out of space and causing an outage in your deployment.

The APIConnect operator tracks the current disk usage of the Postgres components, and regularly updates the ManagementCluster CR's status. When one or more of the Postgres components occupy 60% of the PVC (persistent volume claim) capacity, the APIConnect operator changes the CR's status from Running to Warning.

Important: If the Postgres disk usage reaches 80%, then the APIConnect operator brings down Postgres. If this situation occurs, contact IBM Support.

Note that there are various types of storage classes which can be used to deploy Management subsystem; for example, local-storage or ceph block. When local-storage is used, the entire disk is allocated to the worker node. In some cases, before the APIConnect operator reacts to an 80% usage condition, Kubernetes itself might face disk pressure and start evicting pods.

Viewing the current disk usage

The ManagementCluster instance includes the .status.postgresDataStats field, where the operator displays the current disk usage of Postgres components. Run the following command to get the disk usage:

oc get mgmt -o json | jq .items[].status.postgresDataStats

The response looks like the following example:

[
  {
    "instanceName": "<mgmt cr name>-site1-db-1",
    "podName": "",
    "pvcCapacity": 51200,
    "pvcName": "",
    "pvcType": "WAL",
    "pvcUsed": 400,
    "pvcUsedPercentage": 8
  },
  {
    "instanceName": "<mgmt cr name>-site1-db-1",
    "podName": "",
    "pvcCapacity": 184320,
    "pvcName": "",
    "pvcType": "PostgreSQL",
    "pvcUsed": 95,
    "pvcUsedPercentage": 8
  }
]

Warning condition is populated at 60% usage

When one of the Postgres components uses 60% of its allocated space, the operator changes the overall status of the ManagementCluster to Warning with an appropriate warning message. The operator also updates the postgresDataStats section with the current data usage of the Postgres components.

Attention: If you encounter the Warning condition, contact IBM Support for help correcting the root cause.

To view the status, run the following command:

oc get mgmt

The following example response shows the warning state:

NAME              READY   STATUS    VERSION    RECONCILED VERSION   MESSAGE                                                          AGE
stv3-management   6/8     Warning   10.0.8.0   10.0.8.0-5363        Some services are not ready - see status condition for details   3d20h

To view just the status, run the following command:

oc get mgmt -o json | jq .items[].status.phase

In this example, the status displayed in the response is "Warning":

"Warning"

To see the list of current status conditions, run the following command:

oc get mgmt -o json | jq .items[].status.conditions

The following example response displays the list of status conditions. Notice that the Warning"condition is set to "True" and the other conditions are set to "False":

status:
  conditions:
  - lastTransitionTime: "2022-06-02T19:38:09Z"
    message: Warning threshold=60%, Current disk usage=63%, Is wal archiving working?=false.
      Database shutdown starts at 80%. Please contact IBM Support immediately.
    reason: wal_disk_usage_more_than_warning_threshold
    status: "True"
    type: Warning

Error condition is populated at 80% usage

When one of the Postgres components uses 80% of its allocated space, the operator changes the overall status of the ManagementCluster to Error and brings down Postgres avoid problems that can occur if the disk becomes completely filled.

Attention: If you encounter the Error condition, contact IBM Support for help correcting the root cause.

In this case, when you run the following command to view the status conditions, you will see that the "Error" condition is set to "True:

oc get mgmt -o json | jq .items[].status.conditions

For example, the following condition shows the Error status set to "True" due to disk usage:

status:
  conditions:
- lastTransitionTime: "2022-06-03T00:40:54Z"
    message: Error threshold=80%, Current disk usage=82%, Is wal archiving working?=false.
      Database is in shutdown mode and management services are disabled. Please contact
      IBM Support immediately.
    reason: wal_disk_usage_more_than_error_threshold
    status: "True"
    type: Error