Monitoring Postgres disk usage

Monitor disk usage by the Postgres database in the management subsystem.

Postgres Database is the core database used in Management Subsystem. It is important to monitor the Postgres disk usage.

In 10.0.3.0 and greater, the APIConnect operator tracks the current disk usage of the Postgres components, and regularly updates the ManagementCluster status. When one or more of the Postgres components occupy 50% of the PVC (persistent volume claim) capacity, the APIConnect operator changes the status from Running to Warning.

APIConnect operator brings down Postgres if the disk allocation reaches 80%. In this case, you must contact IBM support.

Note that there are various types of storage classes which can be used to deploy Management Subsystem. For example, local-storage or ceph block.

Management Cluster disk stats

The ManagementCluster instance has .status.postgresDataStats where the operator prints the current disk usage of Postgres components. For example:

kubectl get mgmt m1  -o json | jq .status.postgresDataStats
[
  {
    "instanceName": "m1-ed60c42d-postgres",
    "podName": "m1-ed60c42d-postgres-86766f69cb-xs7t5",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres",
    "pvcType": "PostgreSQL",
    "pvcUsed": 60895232,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres-backrest-shared-repo",
    "podName": "m1-ed60c42d-postgres-backrest-shared-repo-859484b5f6-r7cv6",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-pgbr-repo",
    "pvcType": "pgBackRest",
    "pvcUsed": 16203776,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres",
    "podName": "m1-ed60c42d-postgres-86766f69cb-xs7t5",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-wal",
    "pvcType": "WAL",
    "pvcUsed": 201338880,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres-fqbg",
    "podName": "m1-ed60c42d-postgres-fqbg-84869d976c-49qgc",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-fqbg",
    "pvcType": "PostgreSQL",
    "pvcUsed": 59994112,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres-fqbg",
    "podName": "m1-ed60c42d-postgres-fqbg-84869d976c-49qgc",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-fqbg-wal",
    "pvcType": "WAL",
    "pvcUsed": 201334784,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres-cimo",
    "podName": "m1-ed60c42d-postgres-cimo-5769868b75-5z7ln",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-cimo",
    "pvcType": "PostgreSQL",
    "pvcUsed": 59994112,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres-cimo",
    "podName": "m1-ed60c42d-postgres-cimo-5769868b75-5z7ln",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-cimo-wal",
    "pvcType": "WAL",
    "pvcUsed": 201334784,
    "pvcUsedPercentage": 0
  }
]

Warning condition is populated at 50% usage

When one of the Postgres components occupy 50% usage, the operator marks the overall status of the ManagementCluster to Warning with an appropriate warning message. The operator also updates the postgresDataStats section with the current data usage of the Postgres components.

Example of warning state:

kubectl get mgmt m1

NAME   READY   STATUS    VERSION      RECONCILED VERSION      AGE
m1     17/17   Warning   10.0.1-eus   10.0.1.4-ifix1-44-eus   26h

kubectl get mgmt m1 -o json | jq .status.phase

"Warning"

Example of list of warning conditions:

kubectl get mgmt m1 -o json | jq .status.conditions

[
  {
    "lastTransitionTime": "2021-09-08T19:49:51Z",
    "message": "Current WAL disk usage of m1-ed60c42d-postgres is 51 percent. DATABASE SHUTDOWN starts at 80 percent utilization. Please contact IBM Support immediately",
    "reason": "disk_usage_more_than_50_percent",
    "status": "True",
    "type": "Warning"
  },
  {
    "lastTransitionTime": "2021-09-08T19:42:30Z",
    "message": "",
    "reason": "na",
    "status": "False",
    "type": "Ready"
  },
  {
    "lastTransitionTime": "2021-09-08T19:41:35Z",
    "message": "",
    "reason": "na",
    "status": "False",
    "type": "Pending"
  },
  {
    "lastTransitionTime": "2021-09-07T17:18:02Z",
    "message": "",
    "reason": "na",
    "status": "False",
    "type": "Error"
  },
  {
    "lastTransitionTime": "2021-09-07T17:18:02Z",
    "message": "",
    "reason": "na",
    "status": "False",
    "type": "Failed"
  }
]

The warning condition explains why the warning condition was set. For example:

Current WAL disk usage of m1-ed60c42d-postgres is 51 percent. DATABASE SHUTDOWN starts at 80 percent utilization. Please contact IBM Support immediately

If you encounter the warning, contact IBM support to fix the root cause.

Error condition is populated at 80% usage

An error condition is populated when the operator brings down Postgres because is using 80% of available disk space. Postgres is brought down to avoid problems, such as disk corruption, that can occur if the disk becomes completely filled.

 - lastTransitionTime: "2021-08-28T01:42:56Z"
   message: Current WAL disk usage of nihar-stac-3955b42d-site1-postgres is more
     than 70 percent, initiating DATABASE SHUTDOWN
   reason: disk_usage_more_than_70_percent
   status: "True"
   type: Error

If you encounter the error, contact IBM support to fix the root cause.

Important:

When local-storage is used, it uses the entire disk that is allocated to the worker node. In some cases, Kubernetes itself can face disk pressure before the operator reacts to 80% usage, and will start evicting pods. In these cases, you might need to increase the disk allocated to the worker node, in order to stabilize the worker node, by following Recovering when disks are filled by the management database.