Monitoring and alerting in Cloud Pak for Data

IBM® Cloud Pak for Data provides as monitoring and alerting framework that you can use to monitor the state of the platform and set up events to alert when action is needed, based on thresholds that you define.

Glossary

To get started, you should understand the following terms:

An event is the report of the state of an entity such as a pod, persistent volume claim (PVC), or other resource. The following event types are delivered with Cloud Pak for Data.
- check-pvc-status - Records a critical event if a PVC is unbound.
- check-replica-status - Records a critical event if a StatefulSet or deployment has unavailable replicas.
- check-resource-status - Records a warning event if a service has reached the threshold and a critical event if it has exceeded the quota. For more information about setting quotas and thresholds for services, see Managing the platform.
The severity of the event indicates the criticality of the event. The severity of an event can be: critical, warning, or info. Each type of event includes metadata for the event, including a description and steps to resolve the event.
- critical - Monitored resources are unstable. Alerting is essential if this state persists.
- warning - Monitored resources have reached a warning threshold. Immediate alerting might not be required.
- info - Monitored resources behave as expected. Informational messages only.
An alert is an event that indicates an issue or potential issue. Alerts can be sent by using either traps (SNMP) or email (SMTP). Each alert type can be associated with different alert rules. For example, an alert type might alert immediately or wait for an event to occur a specified number of times before the alert forwarder sends an alert.
A quota for a resource, such as vCPUs and memory, is a target that determines the severity of an alert. If resource usage exceeds a quota, the event is considered critical. If resource usage exceeds the percentage of the quota that is defined by the alert threshold, the event is considered a warning.
A monitor is a script whose purpose is to check the state of an entity periodically and generate events. A single monitor can register events for different purposes. For example, the diagnostics monitor that comes with Cloud Pak for Data generates events to check the status of persistent volume claims, stateful sets, and deployments.
A watchdog alert manager (WAM) monitors all monitors to ensure that they run on schedule. The WAM also exposes an API that listens to events generated by the monitors. These events persist in Metastore for generating alerts when the alerting rules are met. The persisted events can also be used to study historic patterns. For more information, see Alerting APIs.
An alert profile defines the setup for alerting. The default profile enables SMTP and SNMP.
An alert forwarder is the service that is responsible for sending the alerts and traps. After the watchdog alert manager identifies a possible alert, it invokes the alert forwarder to forward them to the customer environment.

Introduction

By default, Cloud Pak for Data is initialized with one monitor that runs every ten minutes. The diagnostic monitor records the status of deployments, statefulsets, and persistent volume claims. It also tracks your system usage of virtual processors (vCPUs) and memory. The data that is collected can be used for analysis and to alert customers in a production environment based on set alert rules.

To configure monitoring and alerting, you can perform the following tasks:

Set quotas on the platform.
Set up alerting rules.
Understand alert profiles.
Configure SMTP alerts.
Configure SNMP alerts.
Configure Slack alerts.

How does it work?

The watchdog alerting manager cron job iterates through the extensions for zen_alert_monitor extension type and creates cron jobs for the monitors with the metadata provided. It uses product metrics as input and updates policies in Metastore.
The cron job monitors report events by using the vi/monitoring/events API.

The event is stored in the Metastore database. For example:

Monitor_type	Event_type	Reference	Alerted_time	Metadata	Severity	History
Diagnostics	`check-pvc-status`	`User-home-pvc (zen)`	`NOT_ALERTED`	{Metadata about the resource}	info	{ “time”:”critical/warning/info”, }
Diagnostics	`check-resource-status`	Watson™ Knowledge Catalog	`08-23-2020:05:03:00`	{Metadata about the resource}	critical	{ “time”:”critical/warning/info”, }

If the same monitor, event_type, and reference reports another event, the record is updated with the latest metadata and an account of the event severity and reported time is made in the history column.

The diagnostics monitor runs every 10 minutes, checking the status of PVCs and pods.
The alerting cron job runs every 10 minutes, checking for possible alerts using quotas and thresholds to determine the severity of the alert. The watchdog monitoring cron job goes through all the events in the Metastore database and checks for events with critical or warning severity. Depending on the count needed to satisfy an alert condition, which is defined by the rules set for the alert_type and corresponding severity, alerts are either sent or postponed until the conditions are satisfied.
The administrator can change the quotas, thresholds, and the amount of flexibility that you want over the system when quotas are reached. These changes are fed back into the alerting framework immediately. For more information, see Managing the platform.

Set up alerting rules

You can enable alerts for critical and warning events and define when to forward a certain alert to the user. You can set the throttle time so users are not spammed with alerts when an event persists. To change the default alerting rules, you must use the Alerting APIs. For more information, see Configure alert rules.

The default alerting rules are set as follows:

For critical events, a condition persists for 30 minutes when 3 consecutive critical events are recorded during monitor runs. When the condition is alerted, it is snoozed for 12 hours.
For warning events, 5 warning events are recorded during the last 20 monitor runs with a snooze period of 24 hours.

You can set the following parameters:

Parameter	Description
severity	The severity to be set. Can be one of the following options: critical warning You can't configure the alert rules for informational alerts.
trigger_type	Determines how to trigger alerts. Can be one of the following options: immediate custom: This custom optoin is associated with alert_count and alert_over_count.
alert_count	Count of events with the severity type.
alert_over_count	Count of total events to be referenced.
snooze_time	The number of hours to wait before an alert is sent after an event occurs.
notify_when_condition_clears	Determines whether to send an alert when the condition clears. This alert is sent with an alert_type of info.

Technical details

Alerting rules are defined through zen_alert_type extensions. Each alert type defines of rules for each event severity: critical, warning, info. For example, the following alert extension defines an alert to monitor Kubernetes resources daily.

  extensions: |
    [
      {
        "extension_point_id": "zen_alert_type",
        "extension_name": "zen_alert_type_platform",
        "display_name": "Platform alert type",
        "details": {
          "name": "platform",
          "description": "defines rules for alerting on diagnostics monitors",
          "rules": {
            "critical": { 
              "trigger_type": "custom",
              "alert_count": 3,
              "alert_over_count": 3,
              "snooze_time": 12,
              "notify_when_condition_clears": true
            }, 
            "warning": { 
              "trigger_type": "custom",
              "alert_count": 5,
              "alert_over_count": 20,
              "snooze_time": 24
            }
          }
        }
      }
    ]

Understand alert profiles

A default profile is installed when Cloud Pak for Data is installed. Currently, you cannot set up custom alert profiles. The default profile enables SMTP and SNMP. If an administrator has set up SMTP or SNMP, these alerts are forwarded by default. An administrator must configure Slack alerts to enable alerts to Slack.

For more information, see:

Configure SMTP alerts
Configure SNMP alerts
Configure Slack alerts

Technical details

Alert profiles are defined in zen_alert_profile extensions. For example, the following alert profile extension enables SMTP, SNMP and Slack alerts.

  extensions: |
    [
      {
        "extension_point_id": "zen_alert_profile",
        "extension_name": "zen_alert_profile_default",
        "display_name": "Default alert profile",
        "details": {
          "name": "default",
          "description": "Default alert profile which enables all possible alerts, as long as the respective configuration details are provided via endpoints.",
          "alerts": {
            "smtp": true,
            "snmp": true,
            "slack": true
          },
          "smtp":{
            "registered_email_ids": []
          }
        }
      }
    ]

Configure SMTP alerts

Alerts can be sent as email by using SMTP. You can configure a connection to your SMTP server in Administration > Platform configuration.

For more information, see Enabling email notifications and Configure email recipients.

Configure SNMP alerts

Alerts can be sent as traps by using SNMP (simple network management protocol). SNMP is a standard protocol for collecting and organizing information about managed devices or services. It exposes management data in the form of variables that are defined in managed information base (MIB) files.

You must configure an SNMP server with a trap listener. For more information, see Installing Net-SNMP.

If you already have a trap listener running, you can use the following command to configure the alerting watchdog to use SNMP.

curl -X POST '<https://<my-deployment-url>/zen-watchdog/v1/monitoring/config/snmp>' \
-F host=<value> \
-F port=<value> \
-F community=<value> \
-H 'Authorization: Bearer <authorization-token>'

Set the following parameters:

Parameter	Type	Description
host	`String`	The SNMP server host address.
port	`String`	The SNMP server port.
community	`String`	The community string associated with the SNMP connection.

To check whether your configuration is stored correctly, you can use the following command:

curl -X GET 'https://<my-deployment-url>/zen-watchdog/v1/monitoring/config/snmp' \
-H 'Authorization: Bearer <authorization-token>'

For more information, see Configure SNMP.

Configure Slack alerts

To enable Slack alerts, an administrator must provide a webhook URL, which can be set up to receive notifications on a channel. When the webhook URL is available, you can use the following command:

curl -X POST '<https://<my-deployment-url>/zen-watchdog/v1/monitoring/config/slack>' \
-F webhook=<webhook URL> \ 
-H 'Authorization: Bearer <authorization-token>'

Parameter	Type	Description
webhook	`String`	The URL of the webhook to post to a Slack channel.

For more information, see Configure Slack.