Monitoring and alerting in Cloud Pak for Data

You can use the IBM Cloud Pak for Data monitoring and alerting framework to monitor the state of the platform. You can set up events to alert when action is needed, based on thresholds that you define.

By default, Cloud Pak for Data is initialized with one monitor that runs every ten minutes. The diagnostic monitor records the status of deployments, StatefulSets, and persistent volume claims. It also tracks your system usage of virtual processors (vCPUs) and memory. The data that is collected can be used for analysis and to alert customers in a production environment based on set alert rules.

Alerting framework

Glossary

To get started, you should understand the following terms:

event
An event is the report of the state of an entity such as a pod, persistent volume claim (PVC), or other resource.
severity
The severity of the event indicates the criticality of the event. The severity of an event can be: critical, warning, or info. Each type of event includes metadata for the event, including a description and steps to resolve the event.
critical
Monitored resources are unstable. Alerting is essential if this state persists.
warning
Monitored resources have reached a warning threshold. Immediate alerting might not be required.
info
Monitored resources behave as expected. Informational messages only.
alert
An alert is an event that indicates an issue or potential issue. Alerts can be sent by using either traps (SNMP) or email (SMTP). Each alert type can be associated with different alert rules. For example, an alert type might alert immediately or wait for an event to occur a specified number of times before the alert forwarder sends an alert.
quota
A quota for a resource, such as vCPUs and memory, is a target that determines the severity of an alert. If resource usage exceeds a quota, the event is considered critical. If resource usage exceeds the percentage of the quota that is defined by the alert threshold, the event is considered a warning.
monitor
A monitor is a script whose purpose is to check the state of an entity periodically and generate events. A single monitor can register events for different purposes. For example, the platform monitor that comes with Cloud Pak for Data generates events to check the status of persistent volume claims, StatefulSets, and deployments.
watchdog alert manager
A watchdog alert manager (WAM) monitors all monitors to ensure that they run on schedule. The WAM also exposes an API that listens to events generated by the monitors. These events persist in Metastore for generating alerts when the alerting rules are met. The persisted events can also be used to study historic patterns. For more information, see Alerting APIs.
alert profile
An alert profile defines the setup for alerting. The default profile enables SMTP and SNMP.
alert forwarder
An alert forwarder is the service that is responsible for sending the alerts and traps. After the watchdog alert manager identifies a possible alert, it invokes the alert forwarder to forward them to the customer environment.

How does it work?

Alerting framework flow diagram
  1. The watchdog alerting manager cron job iterates through the extensions for zen_alert_monitor extension type and creates cron jobs for the monitors with the metadata provided. It uses product metrics as input and updates policies in Metastore.
  2. The cron job monitors report events by using the v1/monitoring/events API.
  3. The event is stored in the Metastore database. For example:
    Monitor_type Event_type Reference Alerted_time Metadata Severity History
    diagnostics check-pvc-status zen-metastore-edb-1 NOT_ALERTED {Metadata about the resource} info
    {
    “time”:”critical/warning/info”,
    }
    diagnostics check-quota-status IBM Knowledge Catalog 08-23-2020:05:03:00 {Metadata about the resource} critical
    {
    “time”:”critical/warning/info”,
    }

    If the same monitor, event_type, and reference reports another event, the record is updated with the latest metadata and an account of the event severity and reported time is made in the history column.

  4. The platform monitor runs every 10 minutes, checking the status of PVCs and pods.
  5. The alerting cron job runs every 10 minutes, checking for possible alerts using quotas and thresholds to determine the severity of the alert. The watchdog monitoring cron job goes through all the events in the Metastore database and checks for events with critical or warning severity. Depending on the count needed to satisfy an alert condition, which is defined by the rules set for the alert_type and corresponding severity, alerts are either sent or postponed until the conditions are satisfied.
  6. The administrator can change the quotas, thresholds, and the amount of flexibility that you want over the system when quotas are reached. These changes are fed back into the alerting framework immediately. For more information, see Monitoring the platform.

Learn more