Monitoring and alerting in Cloud Pak for Data
You can use the IBM Cloud Pak for Data monitoring and alerting framework to monitor the state of the platform. You can set up events to alert when action is needed, based on thresholds that you define.
By default, Cloud Pak for Data is initialized with one monitor that runs every ten minutes. The diagnostic monitor records the status of deployments, StatefulSets, and persistent volume claims. It also tracks your system usage of virtual processors (vCPUs) and memory. The data that is collected can be used for analysis and to alert customers in a production environment based on set alert rules.

Glossary
To get started, you should understand the following terms:
- event
- An event is the report of the state of an entity such as a pod, persistent volume claim (PVC), or other resource.
- severity
- The severity of the event indicates the criticality of the event. The severity of
an event can be: critical, warning, or info. Each
type of event includes metadata for the event, including a description and steps to resolve the event.
- critical
- Monitored resources are unstable. Alerting is essential if this state persists.
- warning
- Monitored resources have reached a warning threshold. Immediate alerting might not be required.
- info
- Monitored resources behave as expected. Informational messages only.
- alert
- An alert is an event that indicates an issue or potential issue. Alerts can be sent by using either traps (SNMP) or email (SMTP). Each alert type can be associated with different alert rules. For example, an alert type might alert immediately or wait for an event to occur a specified number of times before the alert forwarder sends an alert.
- quota
- A quota for a resource, such as vCPUs and memory, is a target that determines the severity of an alert. If resource usage exceeds a quota, the event is considered critical. If resource usage exceeds the percentage of the quota that is defined by the alert threshold, the event is considered a warning.
- monitor
- A monitor is a script whose purpose is to check the state of an entity periodically and generate events. A single monitor can register events for different purposes. For example, the platform monitor that comes with Cloud Pak for Data generates events to check the status of persistent volume claims, StatefulSets, and deployments.
- watchdog alert manager
- A watchdog alert manager (WAM) monitors all monitors to ensure that they run on schedule. The WAM also exposes an API that listens to events generated by the monitors. These events persist in Metastore for generating alerts when the alerting rules are met. The persisted events can also be used to study historic patterns. For more information, see Alerting APIs.
- alert profile
- An alert profile defines the setup for alerting. The default profile enables SMTP and SNMP.
- alert forwarder
- An alert forwarder is the service that is responsible for sending the alerts and traps. After the watchdog alert manager identifies a possible alert, it invokes the alert forwarder to forward them to the customer environment.
How does it work?

- The watchdog alerting manager cron job iterates through the extensions for
zen_alert_monitorextension type and creates cron jobs for the monitors with the metadata provided. It uses product metrics as input and updates policies in Metastore. - The cron job monitors report events by using the
v1/monitoring/eventsAPI. - The event is stored in the Metastore database. For example:
Monitor_type Event_type Reference Alerted_time Metadata Severity History diagnostics check-pvc-statuszen-metastore-edb-1NOT_ALERTED{Metadata about the resource} info { “time”:”critical/warning/info”, }diagnostics check-quota-statusIBM Knowledge Catalog 08-23-2020:05:03:00{Metadata about the resource} critical { “time”:”critical/warning/info”, }If the same monitor, event_type, and reference reports another event, the record is updated with the latest metadata and an account of the event severity and reported time is made in the history column.
- The platform monitor runs every 10 minutes, checking the status of PVCs and pods.
- The alerting cron job runs every 10 minutes, checking for possible alerts using quotas and thresholds to determine the severity of the alert. The watchdog monitoring cron job goes through all the events in the Metastore database and checks for events with critical or warning severity. Depending on the count needed to satisfy an alert condition, which is defined by the rules set for the alert_type and corresponding severity, alerts are either sent or postponed until the conditions are satisfied.
- The administrator can change the quotas, thresholds, and the amount of flexibility that you want over the system when quotas are reached. These changes are fed back into the alerting framework immediately. For more information, see Monitoring the platform.
Learn more