Monitoring Postgres disk usage on VMware

Monitor disk usage by the Postgres database in the management subsystem, when deployed on VMware.

Postgres Database is the core database used in Management Subsystem. It is important to monitor the Postgres disk usage.

In v10.0.3.0 and greater, the APIConnect operator tracks the current disk usage of the Postgres components, and regularly updates apic health-check command.

APIConnect operator regularly updates the ManagementCluster instance status which is reflected in the apic health-check command.

apic health-check command reports WARNING state when disk utilization reaches 50%.

apic health-check command reports ERROR state when disk utilization reaches 70% and Postgres is brought down.

See:

Management Cluster disk stats

The ManagementCluster instance has .status.postgresDataStats where the operator prints the current disk usage of Postgres components. Use kubectl to view the status.

To use kubectl, first ssh into the appliance as root.

kubectl get mgmt m1  -o json | jq .status.postgresDataStats
[
  {
    "instanceName": "m1-ed60c42d-postgres",
    "podName": "m1-ed60c42d-postgres-86766f69cb-xs7t5",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres",
    "pvcType": "PostgreSQL",
    "pvcUsed": 60895232,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres-backrest-shared-repo",
    "podName": "m1-ed60c42d-postgres-backrest-shared-repo-859484b5f6-r7cv6",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-pgbr-repo",
    "pvcType": "pgBackRest",
    "pvcUsed": 16203776,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres",
    "podName": "m1-ed60c42d-postgres-86766f69cb-xs7t5",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-wal",
    "pvcType": "WAL",
    "pvcUsed": 201338880,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres-fqbg",
    "podName": "m1-ed60c42d-postgres-fqbg-84869d976c-49qgc",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-fqbg",
    "pvcType": "PostgreSQL",
    "pvcUsed": 59994112,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres-fqbg",
    "podName": "m1-ed60c42d-postgres-fqbg-84869d976c-49qgc",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-fqbg-wal",
    "pvcType": "WAL",
    "pvcUsed": 201334784,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres-cimo",
    "podName": "m1-ed60c42d-postgres-cimo-5769868b75-5z7ln",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-cimo",
    "pvcType": "PostgreSQL",
    "pvcUsed": 59994112,
    "pvcUsedPercentage": 0
  },
  {
    "instanceName": "m1-ed60c42d-postgres-cimo",
    "podName": "m1-ed60c42d-postgres-cimo-5769868b75-5z7ln",
    "pvcCapacity": 250181844992,
    "pvcName": "m1-ed60c42d-postgres-cimo-wal",
    "pvcType": "WAL",
    "pvcUsed": 201334784,
    "pvcUsedPercentage": 0
  }
]

Warning condition is populated at 50% usage

When one of the Postgres components occupy 50% usage, the operator marks the overall status of the ManagementCluster to Warning with an appropriate warning message, which is included in the output from apic health-check command.

The warning condition explains why the warning condition was set.

To use apic, first ssh into the appliance as root.

apic health-check 

FATA[0007] Cluster not in good health:
ManagementCluster is not Ready or Complete | State: 17/17 Phase: 
  Warning Message: Current WAL disk usage of nihar-stac-3955b42d-site1-postgres is 51 percent. 
  DATABASE SHUTDOWN starts at 70 percent utilization. Please contact IBM Support immediately

If you encounter the warning, contact IBM support to fix the root cause.

Error condition is populated at 70% usage

An error condition is populated when the operator brings down Postgres because it is using 70% of available disk space. Postgres is brought down to avoid problems, such as disk corruption, that can occur if the disk becomes completely filled.

To use apic, first ssh into the appliance as root.

apic health-check 

FATA[0011] Cluster not in good health:
ManagementCluster is not Ready or Complete | State: 17/17 Phase: 
  Error Message: Current WAL disk usage of nihar-stac-3955b42d-site1-postgres is more than 70 percent, 
  initiating DATABASE SHUTDOWN

If you encounter the error, contact IBM support to fix the root cause.

Important:

On VMware, because the entire disk is allocated to the worker node, in some cases before the operator reacts to 70% usage, , Kubernetes itself can face disk pressure and starts evicting pods. In these cases, increase the disk allocated to worker node in order to stabilize the worker node. See Adding disk space to a VMware appliance.