Monitoring and alerting in Cloud Pak for Data
Glossary
To get started, you should understand the following terms:
- An event is the report of the state of an entity such as a pod, persistent volume
claim (PVC), or other resource. The following event types are delivered with Cloud Pak for Data.
- check-instance-status - A service instance is composed of one or more pods. The
check-instance-status
event monitors the status of service instances to determine whether the pods that are associated with the instance are running as expected. A critical state indicates that one or more pods that are associated with the instance are in a failed or unknown state. - check-monitor-status - A monitor is a script that checks the state of an entity
periodically and generates events based on the state of the entity. The
check-monitor-status
event monitors the status of monitoring jobs to determining whether the jobs completed successfully. A critical state indicates that one or more jobs did not complete successfully. - check-pvc-status - A persistent volume claim (PVC) is a request for storage
that meets specific criteria, such as a minimum size or a specific access mode. The
check-pvc-status
event monitors the status of the PVCs that are associated with Cloud Pak for Data and reports any issues. A critical state indicates that the PVC is not associated with a storage volume, which means that the service cannot store data. - check-quota-status - An administrator set a vCPU quota and a memory quota for
services or for the platform. The
check-quota-status
event monitors the quotas and requests that are associated with Cloud Pak for Data to determine whether services have sufficient resources to fulfill requests. A critical state indicates that the service has insufficient resources to fulfill requests.For more information about setting quotas and thresholds, see Monitoring the platform.
- check-deployment-status - Each service is configured to maintain a specific
number of
Deployment
replicas. Thecheck-deployment-status
event monitors the status ofDeployment
replicas that are associated with Cloud Pak for Data and reports any issues. A critical state indicates that the service does not have enough replicas. - check-statefulset-status - Each service is configured to maintain a specific
number of
StatefulSet
replicas. Thecheck-statefulset-status
event monitors the status ofStatefulSet
replicas that are associated with Cloud Pak for Data and reports any issues. A critical state indicates that the service does not have enough replicas. - check-service-status - A service is composed of pods and one or more service instances. A critical state indicates that either a service instance is in a failed state or a pod is in a failed or unknown state.
- check-instance-status - A service instance is composed of one or more pods. The
- The severity of the event indicates the criticality of the event. The severity of
an event can be: critical, warning, or info. Each
type of event includes metadata for the event, including a description and steps to resolve the event.
- critical - Monitored resources are unstable. Alerting is essential if this state persists.
- warning - Monitored resources have reached a warning threshold. Immediate alerting might not be required.
- info - Monitored resources behave as expected. Informational messages only.
- An alert is an event that indicates an issue or potential issue. Alerts can be sent by using either traps (SNMP) or email (SMTP). Each alert type can be associated with different alert rules. For example, an alert type might alert immediately or wait for an event to occur a specified number of times before the alert forwarder sends an alert.
- A quota for a resource, such as vCPUs and memory, is a target that determines the severity of an alert. If resource usage exceeds a quota, the event is considered critical. If resource usage exceeds the percentage of the quota that is defined by the alert threshold, the event is considered a warning.
- A monitor is a script whose purpose is to check the state of an entity periodically and generate events. A single monitor can register events for different purposes. For example, the diagnostics monitor that comes with Cloud Pak for Data generates events to check the status of persistent volume claims, StatefulSets, and deployments.
- A watchdog alert manager (WAM) monitors all monitors to ensure that they run on schedule. The WAM also exposes an API that listens to events generated by the monitors. These events persist in Metastore for generating alerts when the alerting rules are met. The persisted events can also be used to study historic patterns. For more information, see Alerting APIs.
- An alert profile defines the setup for alerting. The default profile enables SMTP and SNMP.
- An alert forwarder is the service that is responsible for sending the alerts and traps. After the watchdog alert manager identifies a possible alert, it invokes the alert forwarder to forward them to the customer environment.
Introduction
By default, Cloud Pak for Data is initialized with one monitor that runs every ten minutes. The diagnostic monitor records the status of deployments, StatefulSets, and persistent volume claims. It also tracks your system usage of virtual processors (vCPUs) and memory. The data that is collected can be used for analysis and to alert customers in a production environment based on set alert rules.
monitoring-utils
repository on GitHub includes monitoring modules that you can:- Install on your own cluster to help you monitor your Cloud Pak for Data deployment.
- Use as examples to develop your own custom monitors.
For details on these monitors, see https://github.com/IBM-ICP4D/monitoring-utils.
To configure monitoring and alerting, you can perform the following tasks:
How does it work?
- The watchdog alerting manager cron job iterates through the extensions for
zen_alert_monitor
extension type and creates cron jobs for the monitors with the metadata provided. It uses product metrics as input and updates policies in Metastore. - The cron job monitors report events by using the
vi/monitoring/events
API. - The event is stored in the Metastore database. For example:
Monitor_type Event_type Reference Alerted_time Metadata Severity History Diagnostics check-pvc-status
User-home-pvc (zen)
NOT_ALERTED
{Metadata about the resource} info { “time”:”critical/warning/info”, }
Diagnostics check-quota-status
Watson™ Knowledge Catalog 08-23-2020:05:03:00
{Metadata about the resource} critical { “time”:”critical/warning/info”, }
If the same monitor, event_type, and reference reports another event, the record is updated with the latest metadata and an account of the event severity and reported time is made in the history column.
- The diagnostics monitor runs every 10 minutes, checking the status of PVCs and pods.
- The alerting cron job runs every 10 minutes, checking for possible alerts using quotas and thresholds to determine the severity of the alert. The watchdog monitoring cron job goes through all the events in the Metastore database and checks for events with critical or warning severity. Depending on the count needed to satisfy an alert condition, which is defined by the rules set for the alert_type and corresponding severity, alerts are either sent or postponed until the conditions are satisfied.
- The administrator can change the quotas, thresholds, and the amount of flexibility that you want over the system when quotas are reached. These changes are fed back into the alerting framework immediately. For more information, see Monitoring the platform.
Set up alerting rules
You can enable alerts for critical and warning events and define when to forward a certain alert to the user. You can set the throttle time so users are not spammed with alerts when an event persists. To change the default alerting rules, you must use the Alerting APIs. For more information, see Configure alert rules.
The default alerting rules are set as follows:
- For critical events, a condition persists for 30 minutes when 3 consecutive critical events are recorded during monitor runs. When the condition is alerted, it is snoozed for 12 hours.
- For warning events, 5 warning events are recorded during the last 20 monitor runs with a snooze period of 24 hours.
You can set the following parameters:
Parameter | Description |
---|---|
severity | The severity to be set. Can be one of the following options:
You can't configure the alert rules for informational alerts. |
trigger_type |
Determines how to trigger alerts. Can be one of the following options:
|
alert_count | Count of events with the severity type. |
alert_over_count | Count of total events to be referenced. |
snooze_time | The number of hours to wait before an alert is sent after an event occurs. |
notify_when_condition_clears | Determines whether to send an alert when the condition clears. This alert is sent with an alert_type of info. |
Technical details
zen_alert_type
extensions. Each alert type
defines of rules for each event severity: critical, warning,
info. For example, the following alert extension defines an alert to monitor
Kubernetes resources daily.
extensions: |
[
{
"extension_point_id": "zen_alert_type",
"extension_name": "zen_alert_type_platform",
"display_name": "Platform alert type",
"details": {
"name": "platform",
"description": "defines rules for alerting on diagnostics monitors",
"rules": {
"critical": {
"trigger_type": "custom",
"alert_count": 3,
"alert_over_count": 3,
"snooze_time": 12,
"notify_when_condition_clears": true
},
"warning": {
"trigger_type": "custom",
"alert_count": 5,
"alert_over_count": 20,
"snooze_time": 24
}
}
}
}
]
Understand alert profiles
A default profile is installed when Cloud Pak for Data is installed. Currently, you cannot set up custom alert profiles. The default profile enables SMTP and SNMP. If an administrator has set up SMTP or SNMP, these alerts are forwarded by default. An administrator must configure Slack alerts to enable alerts to Slack.
For more information, see:
Technical details
zen_alert_profile
extensions. For example, the
following alert profile extension enables SMTP, SNMP and Slack alerts.
extensions: |
[
{
"extension_point_id": "zen_alert_profile",
"extension_name": "zen_alert_profile_default",
"display_name": "Default alert profile",
"details": {
"name": "default",
"description": "Default alert profile which enables all possible alerts, as long as the respective configuration details are provided via endpoints.",
"alerts": {
"smtp": true,
"snmp": true,
"slack": true
},
"smtp":{
"registered_email_ids": []
}
}
}
]
Configure SMTP alerts
Alerts can be sent as email by using SMTP. You can configure a connection to your SMTP server in
.For more information, see Enabling email notifications and Configure email recipients.
Configure SNMP alerts
Alerts can be sent as traps by using SNMP (simple network management protocol). SNMP is a standard protocol for collecting and organizing information about managed devices or services. It exposes management data in the form of variables that are defined in managed information base (MIB) files.
You must configure an SNMP server with a trap listener. For more information, see Installing Net-SNMP.
If you already have a trap listener running, you can use the following command to configure the alerting watchdog to use SNMP.
curl -X POST '<https://<my-deployment-url>/zen-watchdog/v1/monitoring/config/snmp>' \
-F host=<value> \
-F port=<value> \
-F community=<value> \
-H 'Authorization: Bearer <authorization-token>'
Set the following parameters:
Parameter | Type | Description |
---|---|---|
host | String |
The SNMP server host address. |
port | String |
The SNMP server port. The default port is 162. |
community | String |
The community string associated with the SNMP connection. |
To check whether your configuration is stored correctly, you can use the GET command.
curl -X GET 'https://<my-deployment-url>/zen-watchdog/v1/monitoring/config/snmp' \
-H 'Authorization: Bearer <authorization-token>'
For more information, see Configure SNMP.
Configure Slack alerts
To enable Slack alerts, an administrator must provide a webhook URL, which can be set up to receive notifications on a channel. When the webhook URL is available,you can use the following command:
curl -X POST '<https://<my-deployment-url>/zen-watchdog/v1/monitoring/config/slack>' \
-F webhook=<webhook URL> \
-H 'Authorization: Bearer <authorization-token>'
Parameter | Type | Description |
---|---|---|
webhook | String |
The URL of the webhook to post to a Slack channel. |
For more information, see Configure Slack.