IBM Cloud Private cluster monitoring

You can use the IBM® Cloud Private cluster monitoring dashboard to monitor the status of your cluster and applications.

The monitoring dashboard uses Grafana and Prometheus to present detailed data about your cluster nodes and containers. For more information about Grafana, see the Grafana documentation Opens in a new tab. For more information about Prometheus, see the Prometheus documentation Opens in a new tab.

Accessing the monitoring dashboard

  1. Log in to the IBM Cloud Private management console.

    Note: When you log in to the management console, you have administrative access to Grafana. Do not create more users within the Grafana dashboard or modify the existing users or org.

  2. To access the Grafana dashboard, click Menu > Platform > Monitoring. Alternatively, you can open https://<IP_address>:<port>/grafana, where <IP_address> is the DNS or IP address that is used to access the IBM Cloud Private console. <port> is the port that is used to access the IBM Cloud Private console.
  3. To access the Alertmanager dashboard, click Menu > Platform > Alerting. Alternatively, you can open https://<IP_address>:<port>/alertmanager.
  4. To access the Prometheus dashboard, open https://<IP_address>:<port>/prometheus.
  5. From the Grafana dashboard, open one of the following default dashboards:

    • IBM Multicloud Manager Monitoring

    Provides information for metrics such as CPU, Memory, and Network for managed clusters. This dashboard is available only for IBM Multicloud Manager hub clusters.

    • Elasticsearch

    Provides information about Elasticsearch cluster statistics, shard, and other system information.

    • Etcd by Prometheus

    Etcd Dashboard for Prometheus metrics scraper.

    • Helm Release Metrics

    Provides information about system metrics such as CPU and Memory for each Helm release that is filtered by pods.

    • IBM Cloud Private Namespaces Performance IBM Provided 2.5

    Provides information about namespace performance and status metrics.

    • Cluster Network Health (Calico)

    Calico hosts workload and system metric performance information.

    • IBM Cloud Private Performance IBM Provided 2.5

    Provides TCP system performance information about Nodes, Memory, and Containers.

    • Kubernetes Cluster Monitoring

    Monitors Kubernetes clusters that use Prometheus. Provides information about cluster CPU, Memory, and Filesystem usage. The dashboard also provides statistics for individual pods, containers, and systemd services.

    • Kubernetes POD Overview

    Monitors pod metrics such as CPU, Memory, Network pod status, and restarts.

    • NGINX Ingress controller
      Provides information about NGINX Ingress controller metrics that can be sorted by namespace, controller class, controller, and ingress.

    • Node Performance Summary
      Provides information about system performance metrics such as CPU, Memory, Disk, and Network for all nodes in the cluster.

    • Prometheus Stats Dashboard for monitoring Prometheus v2.x.x.

    • Storage GlusterFS Health Provides GlusterFS Health metrics such as Status, Storage, and Node.

    • Rook-Ceph Dashboard that provides statistics about Ceph instances.

    • Storage Minio Health Provides storage and network details about Minio server instances.

    • IBM Cloud Private MongoDB Overview
      Provides server status metrics such as Connections, Commands, and Operations.

    • IBM Cloud Private MongoDB ReplSet

    Provides replica-set metrics such as Members, Member status, Member elections, Replication lag, and Oplog activity.

    • IBM Cloud Private MongoDB WiredTiger

    Provides storage-engine metrics such as Cache activity, Blockmanager, Sessions, and Page-faults.

    Note: If you configure pods to use, host level resources such as host network, the dashboards display the metrics of the host but not the pod itself.

If you want to view other data, you can create new dashboards or import dashboards from JSON definition files for Grafana.

Metrics collected out of the box

IBM Cloud Private provides the following exporters to provide metrics. The exporters expose metrics endpoints as Kubernetes services.

Some IBM Cloud Private Kubernetes pods provide metrics endpoints for Prometheus:

In addition, Prometheus preconfigures scrape targets that communicate with several targets to scrape metrics:

Prometheus displays scrape targets in its user interface as links. These addresses are typically not accessible from a user's browser as they are on the Kubernetes cluster internal network. Only the Prometheus server needs to be able to access the addresses.

Role-based access control (RBAC)

RBAC for monitoring API

A user with role ClusterAdministratorAdministrator or Operator can access monitoring service. A user with role ClusterAdministrator or Administrator can perform write operations in monitoring service, including deleting Prometheus metrics data, and updating Grafana configurations.

RBAC for monitoring data

Starting with version 1.2.0, the ibm-icpmonitoring Helm chart introduces an important feature. It offers a new module that provides role-based access controls (RBAC) for access to the Prometheus metrics data.

The RBAC module is effectively a proxy that sits in front of the Prometheus client pod. It examines the requests for authorization headers, and at that point, enforces role-based controls. In general, the rules that concern RBAC are as follows:

A user with the ClusterAdministrator role can access any resource. A user with any other role can access data in only the namespaces for which that user is authorized.

If metrics data includes label, kubernetes_namespace, then it is recognized as being in the namespace that is the value of that label. If metrics data has no such label, then it is recognized as system level metrics. Only users with the role ClusterAdministrator can access system level metrics.

In a IBM Multicloud Manager hub cluster environment, users can access metrics from managed clusters. A user with the role ClusterAdministrator can access data from all managed clusters. A user with any other role can access data from only the managed clusters whose related namespaces that user is authorized.

RBAC for monitoring dashboards

Starting with version 1.5.0, the ibm-icpmonitoring Helm chart offers a new module that provides role-based access controls (RBAC) for access to the monitoring dashboards in Grafana.

In Grafana, users can belong to one or more organizations. Each organization contains its own settings for resources such as data sources and dashboards. For the Grafana running in IBM Cloud Private, each namespace in IBM Cloud Private has a corresponding organization with the same name. For example, if you create a new namespace that is named test in IBM Cloud Private, an organization that is named test is generated in Grafana. If you delete the test namespace, the test organization is also removed. The only exception is the kube-system namespace. The corresponding organization for kube-system is the Grafana default of Main Org.

Each Grafana organization includes a default data source that is named prometheus, which points to the Prometheus in the monitoring service. Each organization also includes the following dashboards:

All out of the box monitoring dashboards that are mentioned in Accessing the monitoring dashboard are imported to the Main Org organization.

When you log in to IBM Cloud Private, you can access a Grafana organization only if you are authorized to access the corresponding namespace. If you have access to more than one Grafana organization, use the Grafana console to switch to a different organization. Message, UNAUTHORIZED appears when you do not have access to a Grafana organization.

Different IBM Cloud Private users access Grafana organizations by using different organization roles. In the corresponding namespace, if you are assigned the role of ClusterAdministrator or Administrator, you have Admin access to the Grafana organization. Otherwise, you have Viewer access to the Grafana organization.

When you access Grafana as a IBM Cloud Private user, a user with the same name is created in Grafana. If the user in IBM Cloud Private is deleted, the corresponding user is not deleted from Grafana. The user account becomes stale. Run the following command to request the removal of stale users:

  curl -k -s -X POST -H "Authorization:$ACCESS_TOKEN" https://<Cluster Master Host>:<Cluster Master API Port>/grafana/check_stale_users

For information about Grafana APIs, see Accessing monitoring service APIs.

Note: Monitoring service does not provide RBAC support for Prometheus and Alertmanager alerts.

Installing monitoring service in IBM Cloud Private

Monitoring service is installed by default during IBM Cloud Private installation. You can also select to install monitoring service from the Catalog or CLI.

Installing monitoring service from the Catalog

You can deploy the monitoring service with customized configurations from the Catalog in the IBM Cloud Private management console.

  1. From the Catalog page, click the ibm-icpmonitoring Helm chart to configure and install it.
  2. Provide required values for the following parameters:

    • Helm release name: monitoring
    • Target namespace: kube-system
    • Mode of deployment: Managed
    • Cluster access address: Specify the Domain Name Service (DNS) or IP address that is used to access the IBM Cloud Private console.
    • Cluster access port: Specify the port that is used to access the IBM Cloud Private console. The default port is 8443.
    • etcd address: Specify the Domain Name Service (DNS) or IP address for etcd nodes

Installing monitoring service from the CLI

  1. Install the Kubernetes command line (kubectl). For information about the kubectl CLI, see Accessing your cluster from the Kubernetes CLI (kubectl).

  2. Install the Helm command line interface (CLI). For information about the Helm CLI, see Installing the Helm CLI (Helm).

  3. Install the ibm-icpmonitoring Helm chart. Run the following command:

    helm install -n monitoring --namespace kube-system --set mode=managed --set clusterAddress=<IP_address> --set clusterPort=<port> ibm-icpmonitoring-1.4.0.tgz
    

<IP_address> is the DNS or IP address that is used to access the IBM Cloud Private console.

<port> is the port that is used to access the IBM Cloud Private console.

For more information about parameters that you can configure during installation, see Parameters.

Data persistence configuration

By default, user data in the monitoring service components such as Prometheus, Grafana, or Alertmanager, is not stored in persistent volumes. The user data is lost if the monitoring service component crashes. To store user data in persistent volumes, you must configure related parameters when you install the monitoring service. Use one of the following options to enable persistent volumes:

During configuration, select the checkbox for Persistent volume, and provide values for the following parameters:

In the following example, the value of Field to select the volume is component. The value of Value of the field to select the volume is prometheus:

    apiVersion: v1
    kind: PersistentVolume
    metadata:
        name: monitoring-prometheus-pv
        labels:
            component: prometheus
    .......

For information about creating storage classes, PersistentVolume, and PersistentVolumeClaim, see Storage.

Configuring the Prometheus server

You can configure the following Prometheus server parameters during preinstallation or postinstallation:

Preinstallation configuration

For monitoring service installation and IBM Cloud Private, you can configure the parameters in the config.yaml before installation. For example, your config.yaml file might resemble the following content:

monitoring:
  prometheus:
    scrape_Interval: 1m
    evaluation_Interval: 1m
    retention: 24h
    resources:
      limits:
        memory: 4096Mi

If you choose to install the monitoring service from the Catalog, you can configure the parameters in related console fields.

Postinstallation configuration

You can also update configuration parameters after you install the monitoring service by editing the Prometheus resource, monitoring-prometheus.

kubectl edit prometheus monitoring-prometheus -n kube-system

You can update values for spec.scrapeInterval, spec.evaluationInterval, spec.retention, and spec.resources.limits.memory in the monitoring-prometheus resource.

Notes:

  1. When you update the retention or resources.limits.memory values, the active Prometheus pod is deleted, and a new Prometheus pod is started.
  2. Modifications to the Prometheus resource are lost if you redeploy the monitoring chart. For example, if you upgrade to a new version.

Alerts

Default alerts

Capability to install default alerts is available in version 1.3.0 of the ibm-icpmonitoring chart. Some alerts provide customizable parameters to control the alert frequency. You can configure the following alerts during installation.

Field Default Value
prometheus.alerts.nodeMemoryUsage.nodeMemoryUsage.enabled true
prometheus.alerts.nodeMemoryUsage.nodeMemoryUsageThreshold 85
Field Default Value
prometheus.alerts.highCPUUsage.enabled true
prometheus.alerts.highCPUUsage.highCPUUsageThreshold 85
Field Default Value
prometheus.alerts.failedJobs true
Field Default Value
prometheus.alerts.elasticsearchClusterHealth false
Field Default Value
prometheus.alerts.podsTerminated true
Field Default Value
prometheus.alerts.podsRestarting true

Managing alert rules

You can use the Kubernetes custom resource, PrometheusRule, to manage alert rules in IBM Cloud Private.

The following sample-rule.yaml file is an example of an PrometheusRule resource definition:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
    labels:
        component: icp-prometheus
    name: sample-rule
spec:
    groups:
      - name: a.rules
        rules:
          - alert: NodeMemoryUsage
            expr: ((node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes))/ node_memory_MemTotal_bytes) * 100 > 5
            annotations:
              DESCRIPTION: '{{ $labels.instance }}: Memory usage is greater than the 15% threshold.  The current value is: {{ $value }}.'
              SUMMARY: '{{ $labels.instance }}: High memory usage detected'

You must provide the following parameter values:

monitoring.coreos.com/v1

PrometheusRule

icp-prometheus

Contains the content of the alert rule. For detailed information about alert rule files, see Recording Rules Opens in a new tab.

Migrating from AlertRule to PrometheusRule

You can migrate your existing monitoring AlertRule to the PrometheusRule.

You must change the format of any existing AlertRule that is not defined by the monitoring component. The following differences exist in the format of the .yaml file.

For example, here is an example of the AlertRule

apiVersion: monitoringcontroller.cloud.ibm.com/v1
kind: AlertRule
metadata:
  name: failed-jobs
spec:
  enabled: true
  data: |-
    groups:
      - name: failedJobs
        rules:
          - alert: failedJobs
            expr: kube_job_failed != 0
            annotations:
              description: 'Job {{ "{{ " }} $labels.exported_job {{ " }}" }} in namespace {{ "{{ " }} $labels.namespace {{ " }}" }} failed for some reason.'
              summary: Failed job

After you migrate to PrometheusRule, your .yaml resembles the following example.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    component: icp-prometheus
  name: failed-jobs
spec:
  groups:
    - name: failedJobs
      rules:
        - alert: failedJobs
          expr: kube_job_failed != 0
          annotations:
            description: 'Job {{ "{{ " }} $labels.exported_job {{ " }}" }} in namespace {{ "{{ " }} $labels.namespace {{ " }}" }} failed for some reason.'
            summary: Failed job

After you change your .yaml file, run the following command to load your new PrometheusRule and activate it on Prometheus.

kubectl create -f {file}

Configuring Alertmanager

Edit Kubernetes secret monitoring-prometheus-alertmanager to configure Prometheus Alertmanager to integrate external alert service receivers, such as Slack or PagerDuty, for IBM Cloud Private.

    kubectl edit secret alertmanager-monitoring-prometheus-alertmanager -n kube-system

Following is an example of the default secret configuration.

apiVersion: v1
data:
  alertmanager.yaml: CiAgZ2xvYmFsOgogIHJlY2VpdmVyczoKICAgIC0gbmFtZTogZGVmYXVsdC1yZWNlaXZlcgogIHJvdXRlOgogICAgZ3JvdXBfd2FpdDogMTBzCiAgICBncm91cF9pbnRlcnZhbDogNW0KICAgIHJlY2VpdmVyOiBkZWZhdWx0LXJlY2VpdmVyCiAgICByZXBlYXRfaW50ZXJ2YWw6IDNo
kind: Secret
metadata:
  name: alertmanager-monitoring-prometheus-alertmanager
type: Opaque

The content of alertmanager.yaml is base64 encoded. To update alertmanager.yaml, you must first decode it. Complete the following steps to decode alertmanager.yaml.

Important: Secret changes are lost when you upgrade, roll back, or update the monitoring release. In addition, the secret format can change between releases.

Allow several minutes for the updates to take effect. Open the AlertManager dashboard at https://<Cluster Master Host>:<Cluster Master API Port>/alertmanager. <Cluster Master Host>:<Cluster Master API Port> is defined in the Master endpoint documentation.

Managing Grafana dashboards

You can manage Grafana dashboards by operating on a Kubernetes custom resource MonitoringDashboard in IBM Cloud Private. The following sample-dashboard.yaml file is an example of a MonitoringDashboard resource definition.

apiVersion: monitoringcontroller.cloud.ibm.com/v1
kind: MonitoringDashboard
metadata:
  name: sample-dashboard
spec:
  enabled: true
  data: |-
    {
        "id": null,
        "uid": null,
        "title": "Marco Test Dashboard",
        "tags": [ "test" ],
        "timezone": "browser",
        "schemaVersion": 16,
        "version": 1
      }

You must provide the following parameter values:

monitoringcontroller.cloud.ibm.com/v1

MonitoringDashboard

Contains the content of the Grafana dashboard definition file. For more information about dashboard files, see Dashboard JSON Opens in a new tab.

Set the flag to specify whether the dashboard is enabled or not enabled.

You can use kubectl to manage the dashboard. Use the -n option to specify the namespace in which this MonitoringDashboard is to be created. The dashboard is imported to the corresponding organization in Grafana.

Configure applications to use monitoring service

Modify the application to expose the metrics.

Logs and metrics management for Prometheus

You can modify the time period for metric retention by updating the storage.tsdb.retention parameter in the config.yaml file. By default this value is set at 24h, which means that the metrics are kept for 24 hours and then purged. See Configuring the monitoring service.

However, if you need to manually remove this data from the system, you can use the rest API that is provided by the Prometheus component.

The target URL must have the format:

https://<IP_address>:<Port>/prometheus

Accessing monitoring service APIs

You can access monitoring service APIs such as Prometheus and Grafana APIs. Before you can access the APIs, you must obtain authentication tokens to specify in your request headers. For information about obtaining authentication tokens, see Preparing to run component or management API commands.

After you obtain the authentication tokens, complete the following steps to access the Prometheus and Grafana APIs.

  1. Access the Prometheus API at url, https://<Cluster Master Host>:<Cluster Master API Port>/prometheus/*, and get boot times of all nodes.

    • $ACCESS_TOKEN is the variable that stores the authentication token for your cluster.
    • <Cluster Master Host> and <Cluster Master API Port> are defined in Master endpoints.
    curl -k -s -X GET -H "Authorization:Bearer $ACCESS_TOKEN" https://<Cluster Master Host>:<Cluster Master API Port>/prometheus/api/v1/query?query=node_boot_time_seconds
    

    For detailed information about Prometheus APIs, see Prometheus HTTP API Opens in a new tab.

  2. Access the Grafana API at url, https://<Cluster Master Host>:<Cluster Master API Port>/grafana/*, and obtain the sample dashboard.

    • $ACCESS_TOKEN is the variable that stores the authentication token for your cluster.
    • <Cluster Master Host> and <Cluster Master API Port> are defined in Master endpoints.
    curl -k -s -X GET -H "Authorization: Bearer $ACCESS_TOKEN” "https://<Cluster Master Host>:<Cluster Master API Port>/grafana/api/dashboards/db/sample"
    

    For detailed information about Grafana APIs, see Grafana HTTP API Reference Opens in a new tab.

Support for custom cluster access URL in monitoring service

You can customize the cluster access URL. For more information, see Customizing the cluster access URL. After you complete the customization, you must manually edit the Prometheus and Alertmanager resources and verify that all external links are correct.

Prometheus resource

Use kubectl to edit the monitoring-prometheus resource. For example,

 kubectl edit prometheus monitoring-prometheus -n kube-system

In the monitoring-prometheus Prometheus resource, change externalUrl:* to the following value:

 externalUrl: https://<custom_host>:<custom_port>/prometheus

<custom_host> and <custom_port> are the customized host name and port that you defined in the custom cluster access URL.

Alertmanager resource

Use kubectl to edit the monitoring-prometheus-alertmanager resource. For example,

 kubectl edit alertmanager monitoring-prometheus-alertmanager -n kube-system

In the monitoring-prometheus-alertmanager Alertmanager resource, change externalUrl:* to the following value:

 externalUrl: https//:<custom_host>:<custom_port>/alertmanager

<custom_host> and <custom_port> are the customized host name and port that you defined in the custom cluster access URL.