PostgreSQL self-monitoring and health alerts

PostgreSQL self-monitoring provides comprehensive health monitoring for your PostgreSQL databases through automated alert rules.

Overview

PostgreSQL self-monitoring provides comprehensive health monitoring for your PostgreSQL databases through automated alert rules. These alerts help you proactively identify and resolve issues that are related to connection management, replication health, autovacuum operations, and query performance.

Enabling self-monitoring

Self‑monitoring is not enabled by default in Cloud Pak for AIOps. Enable PostgreSQL self‑monitoring to proactively identify and resolve PostgreSQL health issues. To enable the feature, update the installation custom resource.

Prerequisites

PostgreSQL self‑monitoring requires the following OpenShift monitoring components:
  1. Cluster monitoring – To enable cluster monitoring, see Configuring and enabling OpenShift Container Platform monitoring.
  2. User-defined workload monitoring – To enable user-defined workload monitoring, see Enabling and configuring user-workload monitoring.
Note: OpenShift monitoring must be enabled before you manually enable PostgreSQL self‑monitoring.

Manually enable or disable self‑monitoring

Self‑monitoring is not enabled by default. You must manually enable the feature. You can disable it later if required.

Manually enable self-monitoring:
apiVersion: orchestrator.aiops.ibm.com/v1alpha1
kind: Installation
metadata:
  name: ibm-cp-aiops
  namespace: cp4aiops
spec:
  monitoring:
    enabled: true

Enable self-monitoring from the command line:

oc project <AIOps namespace>

INSTALLATION_NAME="$(oc get installations.orchestrator.aiops.ibm.com -o jsonpath='{.items[0].metadata.name}')"
oc patch installations.orchestrator.aiops.ibm.com "${INSTALLATION_NAME}" --type=json -p='[{"op":"replace","path":"/spec/monitoring","value":{"enabled":true}}]'

Disable self-monitoring:

oc project <AIOps namespace>

INSTALLATION_NAME="$(oc get installations.orchestrator.aiops.ibm.com -o jsonpath='{.items[0].metadata.name}')"
oc patch installations.orchestrator.aiops.ibm.com "${INSTALLATION_NAME}" --type=json -p='[{"op":"replace","path":"/spec/monitoring","value":{"enabled":false}}]'

Understanding postgreSQL alerts

Connection management alerts

These alerts monitor connection usage and help you identify connection pool issues.

AIOpsPostgresConnectionsCritical

  • What it means:Active connections are critically high and near the maximum limit.
  • Severity:Critical
  • When you see it:Active connections exceed 90% of the maximum for 2 minutes.
  • What might be wrong:
    • Misconfigured application connection pool
    • Connection leaks
    • Excessive concurrent requests
    • Long-running queries holding connections
    • Maximum connections setting too low
  • Potential impact:
    • New connections are rejected
    • Application failures
    • Service disruption
    • Users unable to access the system
    • Database becomes unresponsive

AIOpsPostgresConnectionsWarning

  • What it means: Active connections are trending high and may reach critical levels.
  • Severity: Warning
  • When you see it: Active connections exceed 80% for 5 minutes.
  • What might be wrong:
    • Increasing load
    • Connection pool requires tuning
    • Slow queries
    • Application failing to release connections
    • Gradual connection leaks
  • Potential impact:
    • Increased risk of reaching connection limits
    • Performance degradation
    • Connection exhaustion
    • Early warning for corrective action

AIOpsPostgresIdleInTransactionConnections

  • What it means: Many open transactions remain idle.
  • Severity: Warning
  • When you see it: Idle-in-transaction connections exceed 50% of the maximum for 10 minutes.
  • What might be wrong:
    • Application failing to commit or roll back
    • Connection leaks
    • Long-running open transactions
    • Application errors preventing completion
    • Poor transaction management
  • Potential impact:
    • Connections wasted
    • Locks block other queries
    • Autovacuum cannot clean dead tuples
    • Table bloat increases
    • Performance degradation
    • Connection exhaustion

Replication Health Alerts

These alerts monitor replication between primary and replica PostgreSQL instances.

AIOpsPostgresReplicationLagCritical

  • What it means: Replica lags significantly behind the primary.
  • Severity: Critical
  • When you see it: Lag exceeds 300 seconds for 5 minutes.
  • What might be wrong:
    • Limited network bandwidth
    • High write load
    • Slow replica disk I/O
    • Replica queries blocking replication
    • Resource constraints
    • Unstable network
  • Potential impact:
    • High data loss risk during failover
    • Replica serves stale data
    • Reduced disaster recovery capability
    • Failover results in data loss
    • Read replicas provide outdated data

AIOpsPostgresReplicationLagWarning

  • What it means: Replica begins falling behind the primary.
  • Severity: Warning
  • When you see it: Lag exceeds 30 seconds for 10 minutes.
  • What might be wrong:
    • Increased primary write load
    • Network latency
    • Slow replica performance
    • Disk bottlenecks
    • Temporary network instability
  • Potential impact:
    • Mild data consistency issues
    • Increased failover risk
    • Stale read replica results
    • Early replication issue warning
    • May escalate to critical

Autovacuum and Table Maintenance Alerts

These alerts track PostgreSQL autovacuum behavior, which protects long-term performance.

AIOpsPostgresAutovacuumNotRunning

  • What it means: The autovacuum is not running when cleanup is required.
  • Severity: Warning
  • When you see it: Dead tuples exceed threshold and autovacuum has not run for 1 hour, persisting 30 minutes.
  • What might be wrong:
    • Autovacuum disabled
    • All workers busy
    • Long-running transactions blocking autovacuum
    • Autovacuum crashed
    • Misconfiguration
  • Potential impact:
    • Growing table bloat
    • Slower queries
    • Increasing disk usage
    • Transaction ID wraparound risk
    • Significant performance degradation

AIOpsPostgresDeadTuplesWarning

  • What it means: Tables accumulate many dead tuples.
  • Severity: Warning
  • When you see it: Table exceeds 10,000 dead tuples and dead tuples exceed 10% of live tuples for 15 minutes.
  • What might be wrong:
    • Heavy DELETE/UPDATE activity
    • Autovacuum not keeping pace
    • Conservative autovacuum settings
    • Long-running transactions
    • Too few autovacuum workers
  • Potential impact:
    • Table bloat
    • Slower scans
    • Poor index performance
    • Increased disk usage
    • Cache inefficiency
    • Noticeable query slowdown

AIOpsPostgresDeadTuplesCritical

  • What it means: The dead tuple count is dangerously high.
  • Severity: Critical
  • When you'll see it: Dead tuples exceed 20,000 and 20% of live tuples for 10 minutes.
  • What might be wrong:
    • Autovacuum severely behind
    • Frequent updates/deletes
    • Long-running blocking transactions
    • Misconfigured autovacuum
    • System resource issues
  • Potential impact:
    • Severe bloat
    • Major performance degradation
    • High disk usage
    • Slow queries
    • Manual cleanup required

AIOpsPostgresTableNotAnalyzed

  • What it means: Table statistics are outdated and inaccurate.
  • Severity: Warning
  • When you'll see it: Table not analyzed in 24 hours, has >1000 modifications, and modifications exceed 10% for 1 hour.
  • What might be wrong:
    • Autoanalyze disabled
    • Thresholds not triggered
    • Long-running transactions blocking analyze
    • Insufficient workers
    • Misconfiguration
  • Potential impact:
    • Outdated statistics harm query planning
    • Suboptimal execution plans
    • Slower query performance
    • Unpredictable behavior

Performance Monitoring Alerts

These alerts monitor query performance, cache effectiveness, and resource usage.

AIOpsPostgresBlockedQueries

  • What it means: Many queries wait for locks.
  • Severity: Warning
  • When you see it: More than 5 queries were blocked for 5 minutes.
  • What might be wrong:
    • Long-running transactions
    • Lock contention
    • Deadlocks
  • Potential impact:
    • Query timeouts
    • Application latency
    • Deadlock risks

AIOpsPostgresCacheHitRatioCritical

  • What it means: Database reads rely heavily on disk instead of cache.
  • Severity: Critical
  • When you see it: Cache hit ratio falls below 70% for 15 minutes.
  • What might be wrong:
    • Insufficient shared_buffers
    • Working set larger than memory
    • Memory pressure
    • System resource limits
  • Potential impact:
    • Severe performance degradation
    • Excessive disk I/O
    • Slow queries
    • Application performance issues
  • Understanding cache hit ratio:
    • 95% = Excellent
    • 85–95% = Good
    • 70–85% = Warning
    • 70% = Critical

AIOpsPostgresCacheHitRatioLow

  • What it means: Cache efficiency is below optimal levels.
  • Severity: Warning
  • When you see it: Cache hit ratio falls below 85% for 20 minutes.
  • What might be wrong:
    • Low shared_buffers value
    • Large working set
    • Cache warming after restart
  • Potential impact:
    • Suboptimal performance
    • Higher disk I/O
    • Slower queries

AIOpsPostgresSequentialScansHigh

  • What it means: The database performs many full table scans.
  • Severity: Info
  • When you see it: Sequential scans exceed 100 per second on large tables for 10 minutes.
  • What might be wrong:
    • Missing indexes
    • Unused indexes
    • Poor query patterns
    • Low index selectivity
  • Potential impact:
    • Slow queries
    • High CPU usage
    • Inefficient resource utilization
    • Scalability issues
    • Poor performance on large tables