IBM® Cloud Private cluster monitoring

You can use the IBM Cloud Private cluster monitoring dashboard to monitor the status of your cluster and applications.

The monitoring dashboard uses Grafana and Prometheus to present detailed data about your cluster nodes and containers. For more information about Grafana, see the Grafana documentation Opens in a new tab . For more information about Prometheus, see the Prometheus documentation .

Accessing the monitoring dashboard
Role-based access
Configuring alerts
Accessing monitoring service APIs

Accessing the monitoring dashboard

Log in to the IBM Cloud Private management console.

Note: When you log in to the management console, you have administrative access to Grafana. Do not create more users within the Grafana dashboard or modify the existing users or org.
Click Menu > Platform > Monitoring. Alternatively, you can open https://<master_ip>:8443/grafana, where master_ip is the IP address of the master node.
From the Grafana dashboard, open one of the three default dashboards:
- Docker Host & Container Overview
- Kubernetes cluster monitoring (via Prometheus)
- Prometheus Stats

If you want to view other data, you can create new dashboards or import dashboards from JSON definition files for Grafana.

Role-based access

Role-base access for monitoring API

A user with role ClusterAdministrator，Administrator or Operator can access monitoring service. A user with role ClusterAdministrator or Administrator can perform write operations in monitoring service, including deleting Prometheus metrics data, and updating Grafana configurations.

Role-base access for monitoring data

Starting with version 1.2.0, the ibm-icpmonitoring Helm chart introduces an important feature. It offers a new module that provides role-based access controls (RBAC) for access to the Prometheus metrics data.

The RBAC module is effectively a proxy that sits in front of the Prometheus client pod. It examines the requests for authorization headers, and at that point, enforces role-based controls. In general, the rules concerning RBAC are as follows:

A user with the role ClusterAdministrator can access any resource. A user with any other role can only access data in the namespaces for which that user is authorized.

Configuring alerts

You can configure Prometheus alerts or integrate external alert service providers, such as Slack or PagerDuty for IBM Cloud Private.

Important: ConfigMap changes are lost when you upgrade, roll back, or update the monitoring release. In addition, the ConfigMap format can change between releases.

From the IBM Cloud Private management console, click Menu > Configuration > ConfigMaps.

To configure Prometheus alerts, complete the following steps:

In the same row as the monitoring-prometheus-alertrules ConfigMap, select Action > Edit.
In the data section, provide the alerts. For more information about the alert configuration, see Alerting Rules in the Prometheus documentation.

For example, to configure two sample alerts to test the alertmanager dashboard, replace the data section, with the following text:

"data": {
 "sample.rules": "groups:\n  - name: a.rules\n    rules:\n      - alert: NodeMemoryUsage\n        expr: ((node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes))/ node_memory_MemTotal_bytes) * 100 > 75\n        for: 1m\n        labels:\n          severity: page\n        annotations:\n          DESCRIPTION: '{{  $labels.instance  }}: Memory usage is above 75% (current value is: {{  $value  }})'\n          SUMMARY: '{{ $labels.instance }}: High memory usage detected'\n      - alert: HighCPUUsage\n        expr: ((sum(node_cpu_seconds_total{mode=~\"user|nice|system|irq|softirq|steal|idle|iowait\"}) BY (instance, job)) - (sum(node_cpu_seconds_total{mode=~\"idle|iowait\"}) BY (instance, job))) / (sum(node_cpu_seconds_total{mode=~\"user|nice|system|irq|softirq|steal|idle|iowait\"}) BY (instance, job)) * 100 > 3\n        for: 1m\n        labels:\n          service: backend\n        annotations:\n          description: This machine  has really high CPU usage for over 10m\n          summary: High CPU Usage\""
}

After you replace the data section, click Submit.

To integrate external alert services:
- In the same row as the monitoring-prometheus-alertmanager ConfigMap, select Action > Edit.
- For more information about configuring the alerts, see Configuration and Notification Template Examples in the Prometheus documentation.
Allow several minutes for the updates to take effect, and open the alert manager dashboard at https://<master_ip>:8443/alertmanager.
- If you configured the sample alerts, they display.
- Any other valid alerts that you configure display.
  
  After the alert manager dashboard displays valid data, if you configured an external alert service, you can view those alerts in the dashboard for that service.
  
  You can return to this dashboard to view alerts at any time.

Accessing monitoring service APIs

You can access monitoring service APIs such as Prometheus and Grafana APIs. Before you can access the APIs, you must obtain authentication tokens to specify in your request headers. For information about obtaining authentication tokens, see Preparing to run component or management API commands.

After you obtain the authentication tokens, complete the following steps to access the Prometheus and Grafana APIs.

Access the Prometheus API at url, https://<cluster_CA_domain>:<Port>/prometheus/* and get boot times of all nodes.
- $ACCESS_TOKEN is the variable that stores the authentication token for your cluster.
- <cluster_CA_domain> is the certificate authority (CA) domain that was set in your config.yaml file during installation.
- <Port> is the port that is used to access the management console.
```
curl -k -s -X GET -H "Authorization:Bearer $ACCESS_TOKEN" https://<cluster_CA_domain>:<Port>/prometheus/api/v1/query?query=node_boot_time_seconds
```
For detailed information about Prometheus APIs, see Prometheus HTTP API .
Access the Grafana API at url, https://<cluster_CA_domain>:8443/grafana/* and obtain the sample dashboard.
- $ACCESS_TOKEN is the variable that stores the authentication token for your cluster.
- <cluster_CA_domain> is the certificate authority (CA) domain that was set in your config.yaml file during installation.
```
curl -k -s -X GET -H "Authorization: Bearer $ACCESS_TOKEN” "https://<cluster_CA_domain>:8443/grafana/api/dashboards/db/sample"
```
For detailed information about Grafana APIs, see Grafana HTTP API Reference .