Monitoring and alerting in Cloud Pak for Data

IBM Cloud Pak® for Data provides as monitoring and alerting framework that you can use to monitor the state of the platform and set up events to alert when action is needed, based on thresholds that you define.

Glossary

To get started, you should understand the following terms:

  • An event is the report of the state of an entity such as a pod, persistent volume claim (PVC), or other resource. The following event types are delivered with Cloud Pak for Data.
    • check-instance-status - A service instance is composed of one or more pods. The check-instance-status event monitors the status of service instances to determine whether the pods that are associated with the instance are running as expected. A critical state indicates that one or more pods that are associated with the instance are in a failed or unknown state.
    • check-monitor-status - A monitor is a script that checks the state of an entity periodically and generates events based on the state of the entity. The check-monitor-status event monitors the status of monitoring jobs to determining whether the jobs completed successfully. A critical state indicates that one or more jobs did not complete successfully.
    • check-pvc-status - A persistent volume claim (PVC) is a request for storage that meets specific criteria, such as a minimum size or a specific access mode. The check-pvc-status event monitors the status of the PVCs that are associated with Cloud Pak for Data and reports any issues. A critical state indicates that the PVC is not associated with a storage volume, which means that the service cannot store data.
    • check-quota-status - An administrator set a vCPU quota and a memory quota for services or for the platform. The check-quota-status event monitors the quotas and requests that are associated with Cloud Pak for Data to determine whether services have sufficient resources to fulfill requests. A critical state indicates that the service has insufficient resources to fulfill requests.

      For more information about setting quotas and thresholds, see Monitoring the platform.

    • check-deployment-status - Each service is configured to maintain a specific number of Deployment replicas. The check-deployment-status event monitors the status of Deployment replicas that are associated with Cloud Pak for Data and reports any issues. A critical state indicates that the service does not have enough replicas.
    • check-statefulset-status - Each service is configured to maintain a specific number of StatefulSet replicas. The check-statefulset-status event monitors the status of StatefulSet replicas that are associated with Cloud Pak for Data and reports any issues. A critical state indicates that the service does not have enough replicas.
    • check-service-status - A service is composed of pods and one or more service instances. A critical state indicates that either a service instance is in a failed state or a pod is in a failed or unknown state.
  • The severity of the event indicates the criticality of the event. The severity of an event can be: critical, warning, or info. Each type of event includes metadata for the event, including a description and steps to resolve the event.
    • critical - Monitored resources are unstable. Alerting is essential if this state persists.
    • warning - Monitored resources have reached a warning threshold. Immediate alerting might not be required.
    • info - Monitored resources behave as expected. Informational messages only.
  • An alert is an event that indicates an issue or potential issue. Alerts can be sent by using either traps (SNMP) or email (SMTP). Each alert type can be associated with different alert rules. For example, an alert type might alert immediately or wait for an event to occur a specified number of times before the alert forwarder sends an alert.
  • A quota for a resource, such as vCPUs and memory, is a target that determines the severity of an alert. If resource usage exceeds a quota, the event is considered critical. If resource usage exceeds the percentage of the quota that is defined by the alert threshold, the event is considered a warning.
  • A monitor is a script whose purpose is to check the state of an entity periodically and generate events. A single monitor can register events for different purposes. For example, the diagnostics monitor that comes with Cloud Pak for Data generates events to check the status of persistent volume claims, StatefulSets, and deployments.
  • A watchdog alert manager (WAM) monitors all monitors to ensure that they run on schedule. The WAM also exposes an API that listens to events generated by the monitors. These events persist in Metastore for generating alerts when the alerting rules are met. The persisted events can also be used to study historic patterns. For more information, see Alerting APIs.
  • An alert profile defines the setup for alerting. The default profile enables SMTP and SNMP.
  • An alert forwarder is the service that is responsible for sending the alerts and traps. After the watchdog alert manager identifies a possible alert, it invokes the alert forwarder to forward them to the customer environment.

Introduction

By default, Cloud Pak for Data is initialized with one monitor that runs every ten minutes. The diagnostic monitor records the status of deployments, StatefulSets, and persistent volume claims. It also tracks your system usage of virtual processors (vCPUs) and memory. The data that is collected can be used for analysis and to alert customers in a production environment based on set alert rules.

Tip: The Cloud Pak for Data monitoring-utils repository on GitHub includes monitoring modules that you can:
  • Install on your own cluster to help you monitor your Cloud Pak for Data deployment.
  • Use as examples to develop your own custom monitors.

For details on these monitors, see https://github.com/IBM-ICP4D/monitoring-utils.

Alerting framework

To configure monitoring and alerting, you can perform the following tasks:

  1. Set quotas on the platform.
  2. Set up alerting rules.
  3. Understand alert profiles.
  4. Configure SMTP alerts.
  5. Configure SNMP alerts.
  6. Configure Slack alerts.

How does it work?

Alerting framework flow diagram.
  1. The watchdog alerting manager cron job iterates through the extensions for zen_alert_monitor extension type and creates cron jobs for the monitors with the metadata provided. It uses product metrics as input and updates policies in Metastore.
  2. The cron job monitors report events by using the vi/monitoring/events API.
  3. The event is stored in the Metastore database. For example:
    Monitor_type Event_type Reference Alerted_time Metadata Severity History
    Diagnostics check-pvc-status User-home-pvc (zen) NOT_ALERTED {Metadata about the resource} info
    {
    “time”:”critical/warning/info”,
    }
    Diagnostics check-quota-status Watson™ Knowledge Catalog 08-23-2020:05:03:00 {Metadata about the resource} critical
    {
    “time”:”critical/warning/info”,
    }

    If the same monitor, event_type, and reference reports another event, the record is updated with the latest metadata and an account of the event severity and reported time is made in the history column.

  4. The diagnostics monitor runs every 10 minutes, checking the status of PVCs and pods.
  5. The alerting cron job runs every 10 minutes, checking for possible alerts using quotas and thresholds to determine the severity of the alert. The watchdog monitoring cron job goes through all the events in the Metastore database and checks for events with critical or warning severity. Depending on the count needed to satisfy an alert condition, which is defined by the rules set for the alert_type and corresponding severity, alerts are either sent or postponed until the conditions are satisfied.
  6. The administrator can change the quotas, thresholds, and the amount of flexibility that you want over the system when quotas are reached. These changes are fed back into the alerting framework immediately. For more information, see Monitoring the platform.

Set up alerting rules

You can enable alerts for critical and warning events and define when to forward a certain alert to the user. You can set the throttle time so users are not spammed with alerts when an event persists. To change the default alerting rules, you must use the Alerting APIs. For more information, see Configure alert rules.

The default alerting rules are set as follows:

  • For critical events, a condition persists for 30 minutes when 3 consecutive critical events are recorded during monitor runs. When the condition is alerted, it is snoozed for 12 hours.
  • For warning events, 5 warning events are recorded during the last 20 monitor runs with a snooze period of 24 hours.

You can set the following parameters:

Parameter Description
severity The severity to be set.

Can be one of the following options:

  • critical
  • warning

You can't configure the alert rules for informational alerts.

trigger_type

Determines how to trigger alerts.

Can be one of the following options:
  • immediate
  • custom: This custom option is associated with alert_count and alert_over_count.
alert_count Count of events with the severity type.
alert_over_count Count of total events to be referenced.
snooze_time The number of hours to wait before an alert is sent after an event occurs.
notify_when_condition_clears Determines whether to send an alert when the condition clears. This alert is sent with an alert_type of info.

Technical details
Alerting rules are defined through zen_alert_type extensions. Each alert type defines of rules for each event severity: critical, warning, info. For example, the following alert extension defines an alert to monitor Kubernetes resources daily.
  extensions: |
    [
      {
        "extension_point_id": "zen_alert_type",
        "extension_name": "zen_alert_type_platform",
        "display_name": "Platform alert type",
        "details": {
          "name": "platform",
          "description": "defines rules for alerting on diagnostics monitors",
          "rules": {
            "critical": { 
              "trigger_type": "custom",
              "alert_count": 3,
              "alert_over_count": 3,
              "snooze_time": 12,
              "notify_when_condition_clears": true
            }, 
            "warning": { 
              "trigger_type": "custom",
              "alert_count": 5,
              "alert_over_count": 20,
              "snooze_time": 24
            }
          }
        }
      }
    ]

Understand alert profiles

A default profile is installed when Cloud Pak for Data is installed. Currently, you cannot set up custom alert profiles. The default profile enables SMTP and SNMP. If an administrator has set up SMTP or SNMP, these alerts are forwarded by default. An administrator must configure Slack alerts to enable alerts to Slack.

For more information, see:


Technical details
Alert profiles are defined in zen_alert_profile extensions. For example, the following alert profile extension enables SMTP, SNMP and Slack alerts.
  extensions: |
    [
      {
        "extension_point_id": "zen_alert_profile",
        "extension_name": "zen_alert_profile_default",
        "display_name": "Default alert profile",
        "details": {
          "name": "default",
          "description": "Default alert profile which enables all possible alerts, as long as the respective configuration details are provided via endpoints.",
          "alerts": {
            "smtp": true,
            "snmp": true,
            "slack": true
          },
          "smtp":{
            "registered_email_ids": []
          }
        }
      }
    ]

Configure SMTP alerts

Alerts can be sent as email by using SMTP. You can configure a connection to your SMTP server in Administration > Platform configuration.

For more information, see Enabling email notifications and Configure email recipients.

Configure SNMP alerts

Alerts can be sent as traps by using SNMP (simple network management protocol). SNMP is a standard protocol for collecting and organizing information about managed devices or services. It exposes management data in the form of variables that are defined in managed information base (MIB) files.

You must configure an SNMP server with a trap listener. For more information, see Installing Net-SNMP.

If you already have a trap listener running, you can use the following command to configure the alerting watchdog to use SNMP.

curl -X POST '<https://<my-deployment-url>/zen-watchdog/v1/monitoring/config/snmp>' \
-F host=<value> \
-F port=<value> \
-F community=<value> \
-H 'Authorization: Bearer <authorization-token>'

Set the following parameters:

Parameter Type Description
host String The SNMP server host address.
port String The SNMP server port. The default port is 162.
community String The community string associated with the SNMP connection.

To check whether your configuration is stored correctly, you can use the GET command.

curl -X GET 'https://<my-deployment-url>/zen-watchdog/v1/monitoring/config/snmp' \
-H 'Authorization: Bearer <authorization-token>'

For more information, see Configure SNMP.

Configure Slack alerts

To enable Slack alerts, an administrator must provide a webhook URL, which can be set up to receive notifications on a channel. When the webhook URL is available,you can use the following command:

curl -X POST '<https://<my-deployment-url>/zen-watchdog/v1/monitoring/config/slack>' \
-F webhook=<webhook URL> \ 
-H 'Authorization: Bearer <authorization-token>'
Parameter Type Description
webhook String The URL of the webhook to post to a Slack channel.

For more information, see Configure Slack.