IBM Cloud Pak foundational services Monitoring service

IBM Cloud Pak® foundational services monitoring service is built on top of the Prometheus stack. It provides pre-configured, self-updated monitoring service for clusters and applications.

Features

Metrics visualization

Grafana is installed to query and visualize your metrics. Some built-in dashboards for cluster metrics visualization are created by default. You can also create your custom dashboards.

Multi-tenancy

Monitoring provides Kubernetes namespace level isolation. Grafana Organizations are created automatically per Kubernetes namespace. Users can access dashboards and metrics they are allowed only based on the namespaces to which they have access in Red Hat OpenShift Container Platform.

Alerts

Alerts can be triggered automatically and sent to 3rd-party applications like Slack and PagerDuty.

Customization

Adopters and user can integrate with it easily to query and visualize their application metrics, and create Alerts.

Operators

ibm-monitoring-exporters-operator

This operator installs kube-state-metrics and nodeexporter for cluster metrics collection. Install this operator only when IBM Cloud Pak foundational services Prometheus is configured as the Grafana data source. This operator is removed from IBM Cloud Pak foundational services version 3.8.x onwards.

ibm-monitoring-prometheusext-operator

This operator installs Prometheus and Alertmanager. It is an extension of the community Prometheus operator. Install this operator only when CS Prometheus is configured as the Grafana datasource. This operator is removed from IBM Cloud Pak foundational services version 3.8.x onwards.

ibm-monitoring-grafana-operator

Installs Grafana.

Red Hat OpenShift Container Platform monitoring mode and IBM Cloud Pak foundational services monitoring mode

Evolution of Red Hat OpenShift Container Platform monitoring

Red Hat® OpenShift® Container Platform version 3.4 and earlier, OpenShift Container Platform does not provide capability for application metrics. The IBM Cloud Pak foundational services monitoring service installs a full Prometheus stack that includes Exporters, Prometheus, Alertmanager, and Grafana.

OpenShift Container Platform version 4.4 introduced their monitoring feature as a technology preview. IBM Cloud Pak foundational services version 3.5 also introduced a technology preview that provides a migration path to implement OpenShift Container Platform monitoring.

OpenShift Container Platform monitoring is generally available in version 4.6. IBM Cloud Pak foundational services version 3.6 offers two monitoring modes; one of which allows OpenShift Container Platform monitoring to configure Prometheus as a Grafana datasource.

Introduction of two monitoring service modes

IBM Cloud Pak foundational services version 3.6 includes two monitoring service modes, IBM Cloud Pak foundational services monitoring, and OpenShift Container Platform monitoring.

OpenShift Container Platform monitoring means that IBM Cloud Pak foundational services monitoring installs only Grafana, and Prometheus is configured as the datasource for OpenShift Container Platform monitoring.

IBM Cloud Pak foundational services (CS) monitoring means that IBM Cloud Pak foundational services installs its full Prometheus stack, which is configured as the Grafana datasource. You must configure this mode before installation. This mode is the default.

Mode Operators Support for OCP 4.5 and earlier Support for OCP 4.6 and later
IBM Cloud Pak foundational services Monitoring Exporter
PrometheusExt
Grafana
Yes Yes
OpenShift Container Platform Monitoring Grafana No Yes

Note: IBM Cloud Pak foundational services monitoring mode is removed from IBM Cloud Pak foundational services version 3.8.x onwards.

Accessing the monitoring dashboard

  1. Log in to the IBM Cloud Pak foundational services console.

    Note: When you log in to the console, you have administrative access to Grafana. Do not create more users within the Grafana dashboard or modify the existing users or org.

  2. To access the Grafana dashboard, click Menu > Monitor Health > Monitoring.

    Alternatively, you can open https://<IP_address>:<port>/grafana, where <IP_address> is the DNS or IP address that is used to access the console. <port> is the port that is used to access the console.

    Note: If you are logged in as a Cluster Administrator, you can access the Monitoring dashboard from the Administration Hub dashboard. This dashboard provides Cluster Administrators overviews of clusters. The overview includes key metrics for various services and components. It provides links to open other dashboards, pages, and consoles to administer those services and components. From this Administration Hub dashboard, you can view and click Monitoring link on the Welcome widget to access the Grafana dashboard. The Administration Hub can be accessed by clicking Home within the main navigation menu. Only Cluster Administrators can access the Administration Hub dashboard.

  3. (For CS monitoring mode) To access the Alertmanager dashboard, open https://<IP_address>:<port>/alertmanager.

  4. (For CS monitoring mode) To access the Prometheus dashboard, open https://<IP_address>:<port>/prometheus.
  5. The following default Grafana dashboards are created in the Grafana main-org. You must first grant ibm-common-services namespace access to the user.

Role-based access control (RBAC)

RBAC for monitoring API

A user with role ClusterAdministratorAdministrator or Operator can access monitoring service. A user with role ClusterAdministrator or Administrator can use write operations in monitoring service, including deleting Prometheus metrics data, and updating Grafana configurations.

RBAC for monitoring data

Starting with version 1.2.0, the ibm-icpmonitoring Helm chart introduces an important feature. It offers a new module that provides role-based access controls (RBAC) for access to the Prometheus metrics data.

The RBAC module is effectively a proxy that sits in front of the Prometheus client pod. It examines the requests for authorization headers, and at that point, enforces role-based controls. The general RBAC rules are as follows.

A user with the ClusterAdministrator role can access any resource. A user with any other role can access data in only the namespaces for which that user is authorized.

If metrics data includes the label, kubernetes_namespace, then it is recognized as being in the namespace, which is the value of that label. If metrics data has no such label, then it is recognized as system level metrics. Only users with the role ClusterAdministrator can access system level metrics.

In a IBM Multicloud Manager hub cluster environment, users can access metrics from managed clusters. A user with the role ClusterAdministrator can access data from all managed clusters. A user with any other role can access data from only the managed clusters whose related namespaces that user is authorized.

RBAC for monitoring dashboards

Starting with version 1.5.0, the ibm-icpmonitoring Helm chart offers a new module that provides role-based access controls (RBAC) for access to the monitoring dashboards in Grafana.

In Grafana, users can belong to one or more organizations. Each organization contains its own settings for resources such as data sources and dashboards. For the Grafana running in your product, each namespace in your product has a corresponding organization with the same name. For example, if you create a new namespace that is named test in your product, an organization that is named test is generated in Grafana. If you delete the test namespace, the test organization is also removed. The only exception is the ibm-common-services namespace. The corresponding organization for ibm-common-services is the Grafana default of Main Org.

When you log in to your product, you can access a Grafana organization only if you are authorized to access the corresponding namespace. If you have access to more than one Grafana organization, use the Grafana console to switch to a different organization. Message, UNAUTHORIZED appears when you do not have access to a Grafana organization.

Different users access Grafana organizations by using different organization roles. In the corresponding namespace, if you are assigned the role of ClusterAdministrator or Administrator, you have Admin access to the Grafana organization. Otherwise, you have Viewer access to the Grafana organization.

When you access Grafana as a user of your product, a user with the same name is created in Grafana. If the user in your product is deleted, the corresponding user is not deleted from Grafana. The user account becomes stale. Run the following command to request the removal of stale users:

  curl -k -s -X POST -H "Authorization:$ACCESS_TOKEN" https://<Cluster Master Host>:<Cluster Master API Port>/grafana/check_stale_users

For information about Grafana APIs, see Accessing monitoring service APIs.

Note: Monitoring service does not provide RBAC support for Prometheus and Alertmanager alerts.

Installing monitoring service

Prerequisites

Installing IBM Cloud Pak foundational services

Complete the following steps to install IBM Cloud Pak foundational services. For more information, see Installing IBM Cloud Pak foundational services online.

  1. Create or edit the OperandRequest CR.

    The following example resembles a CR for Red Hat OpenShift Container Platform monitoring.

    apiVersion: operator.ibm.com/v1alpha1
    kind: OperandRequest
    metadata:
      name: common-service
      namespace: ibm-common-services
    spec:
      requests:
        - operands:
            - name: ibm-monitoring-grafana-operator
          registry: common-service
    

    For IBM Cloud Pak foundational services version 3.6, you must edit the OperandConfig CR to enable Red Hat OpenShift Container Platform monitoring as shown in the following examples. You must enable Red Hat OpenShift Container Platform monitoring before you create the OperandRequest CR in Step 1.

       - name: ibm-monitoring-grafana-operator
         spec:
           grafana:
             datasourceConfig:       ### this is the configuration
               type: "openshift"     ### to enable OCP monitoring
           operandRequest: {}
    

    Example CR for CS monitoring:

    apiVersion: operator.ibm.com/v1alpha1
    kind: OperandRequest
    metadata:
      name: common-service
      namespace: ibm-common-services
    spec:
      requests:
        - operands:
            - name: ibm-monitoring-exporters-operator
            - name: ibm-monitoring-prometheusext-operator
            - name: ibm-monitoring-grafana-operator
          registry: common-service
    
  2. Run the following command to check the status of your pods.

    oc get po -n ibm-common-services | grep monitoring
    

    For Red Hat OpenShift Container Platform monitoring:

    Your output might resemble the following example, which shows that all pods are Running and all containers are available; for example, 4/4 for Prometheus.

       ibm-monitoring-grafana-5b9bbdcd-495dg              4/4     Running     15         3d21h
       ibm-monitoring-grafana-operator-76bc8bbdc8-5vsns   1/1     Running     0          3d22h
    
    **Note:** Four containers are running in the Grafana pod for Red Hat OpenShift Container Platform monitoring, and three containers are running in the Grafana pod for CS monitoring. 
    

    For CS monitoring:

       alertmanager-ibm-monitoring-alertmanager-0                3/3       Running                      0          6m46s
       ibm-monitoring-collectd-694dd7868-wsvss                   2/2       Running                      0          6m48s
       ibm-monitoring-exporters-operator-55fd6c876d-44h67        1/1       Running                      0          9m37s
       ibm-monitoring-grafana-7cbc65885f-gnsgk                   3/3       Running                      4          8m55s
       ibm-monitoring-grafana-operator-c8867db64-7b4lj           1/1       Running                      0          9m22s
       ibm-monitoring-kube-state-6f588b8dfd-fl447                2/2       Running                      0          6m47s
       ibm-monitoring-mcm-ctl-6647759b47-2qfv8                   1/1       Running                      0          6m46s
       ibm-monitoring-nodeexporter-6qlhg                         2/2       Running                      0          6m48s
       ibm-monitoring-nodeexporter-7jgsb                         2/2       Running                      0          6m48s
       ibm-monitoring-nodeexporter-nw5qg                         2/2       Running                      0          6m48s
       ibm-monitoring-prometheus-operator-6bbb48d8cb-wd5r5       1/1       Running                      0          7m18s
       ibm-monitoring-prometheus-operator-ext-86cdbc7644-qb4ph   1/1       Running                      0          9m20s
       prometheus-ibm-monitoring-prometheus-0                    4/4       Running                      4          6m47s
    

Configuring monitoring service

You can configure the monitoring service by editing the Operand Deployment Lifecycle Manager (ODLM) OperandRequest CR. Following is an example of a default CR.

apiVersion: operator.ibm.com/v1alpha1
kind: OperandConfig
metadata:
  name: common-service
  namespace: ibm-common-services
spec:
  services:
    - name: ibm-monitoring-exporters-operator
      spec:
        exporter: {}
    - name: ibm-monitoring-prometheusext-operator
      spec:
        prometheusExt: {}
    - name: ibm-monitoring-grafana-operator
      spec:
        grafana: {}

You can update the configuration parameters. For more information, see Configuring IBM Cloud Pak foundational services by using the CommonService custom resource.

Configuring applications to use the monitoring service

You can configure your applications in any namespace to expose metrics to the monitoring service.

  1. Create a Service object and add specified annotations to it. This step is required for CS monitoring.

The following example illustrates annotations for metrics. It also illustrates how to create certificates by using the Red Hat OpenShift Container Platform service.beta.openshift.io/serving-cert-secret-name annotation.

  apiVersion: v1
  kind: Service
  metadata:
    name: prometheus-metrics-server-demo
    namespace: default
    labels:
      name: prometheus-metrics-server-demo
    annotations:
      ## Generate certificate secret which is used by metrics pod. Only works on OpenShift
      service.beta.openshift.io/serving-cert-secret-name: prometheus-metrics-server-demo
      ## enable cs monitoring metrics scrape
      prometheus.io/scrape: "true"
      ## it uses 8443 port which is https. 
      ## comment it out to use 8080 port
      prometheus.io/scheme: "https"
  spec:
    ports:
    - name: https
      port: 8443
      protocol: TCP
      targetPort: 8443
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
    selector:
      name: prometheus-metrics-server-demo
    type: ClusterIP

You can choose to add the annotations to Pod objects instead of Service objects. However, Service objects are recommended because they support TLS.

  1. Create ServiceMonitor or PodMonitor CRs in the same namespace with your Service object. This step is required for Red Hat OpenShift Container Platform monitoring. For more information, see Prometheus documentation Opens in a new tab.

    Following is an example of a ServiceMonitor CR.

    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: prometheus-metrics-server-demo
      namespace: default
    spec:
      selector:
        matchLabels:
          name: prometheus-metrics-server-demo
      endpoints:
      - scheme: https
        port: https
        tlsConfig:
          insecureSkipVerify: true
    

    Following is an example of a PodMonitor CR, which is not recommended because they do not support TLS.

      apiVersion: monitoring.coreos.com/v1
      kind: PodMonitor
      metadata:
        name: prometheus-metrics-server-demo
        namespace: default
      spec:
        selector:
          matchLabels:
           name: prometheus-metrics-server-demo
        podMetricsEndpoints:
        - scheme: http
          targetPort: 9157
    

Managing Grafana dashboards

You can create custom Grafana dashboards by creating MonitoringDashboard CRs. CRs can be created in any namespace and will appear in the corresponding Grafana organization.

Notes:

  1. Create a dashboard on Grafana, and then generate a JSON string for the dashboard. From the dashboard, click Dashboard Setting > JSON Model. For more information about dashboard files, see Dashboard JSON Opens in a new tab.

  2. Create the MonitoringDashboard CR in following format:

    apiVersion: monitoringcontroller.cloud.ibm.com/v1
    kind: MonitoringDashboard
    metadata:
      name: sample-dashboard
    spec:
      enabled: true
      data: |-
         {
            ...
         }
    
  3. Copy the generated JSON string and use it as the value in the spec.data field of the MonitoringDashboard CR from Step 2.

Note: Remove id and uid fields of the top-level object.

Following is an example of the MonitoringDashboard CR.

   apiVersion: monitoringcontroller.cloud.ibm.com/v1
   kind: MonitoringDashboard
   metadata:
     name: dashboard-demo
     namespace: default
   spec:
     enabled: true
     data: |-
      {
        "annotations": {
          "list": [
            {
              "builtIn": 1,
              "datasource": "-- Grafana --",
              "enable": true,
              "hide": true,
              "iconColor": "rgba(0, 211, 255, 1)",
              "name": "Annotations & Alerts",
              "type": "dashboard"
            }
          ]
        },
        "editable": true,
        "gnetId": null,
        "graphTooltip": 0,
        "links": [],
        "panels": [
          {
            "cacheTimeout": null,
            "colorBackground": false,
            "colorValue": false,
            "colors": [
              "#299c46",
              "rgba(237, 129, 40, 0.89)",
              "#d44a3a"
            ],
            "datasource": "prometheus",
            "format": "none",
            "gauge": {
              "maxValue": 100,
              "minValue": 0,
              "show": false,
              "thresholdLabels": false,
              "thresholdMarkers": true
            },
            "gridPos": {
              "h": 9,
              "w": 12,
              "x": 0,
              "y": 0
            },
            "id": 2,
            "interval": null,
            "links": [],
            "mappingType": 1,
            "mappingTypes": [
              {
                "name": "value to text",
                "value": 1
              },
              {
                "name": "range to text",
                "value": 2
              }
            ],
            "maxDataPoints": 100,
            "nullPointMode": "connected",
            "nullText": null,
            "options": {},
            "postfix": "",
            "postfixFontSize": "50%",
            "prefix": "",
            "prefixFontSize": "50%",
            "rangeMaps": [
              {
                "from": "null",
                "text": "N/A",
                "to": "null"
              }
            ],
            "sparkline": {
              "fillColor": "rgba(31, 118, 189, 0.18)",
              "full": false,
              "lineColor": "rgb(31, 120, 193)",
              "show": false,
              "ymax": null,
              "ymin": null
            },
            "tableColumn": "",
            "targets": [
              {
                "expr": "sum(kube_pod_info{namespace=~\"ibm-common-services\"})",
                "refId": "A"
              }
            ],
            "thresholds": "",
            "timeFrom": null,
            "timeShift": null,
            "title": "Demo Panel",
            "type": "singlestat",
            "valueFontSize": "80%",
            "valueMaps": [
              {
                "op": "=",
                "text": "N/A",
                "value": "null"
              }
            ],
            "valueName": "avg"
          }
        ],
        "schemaVersion": 21,
        "style": "dark",
        "tags": [],
        "templating": {
          "list": []
        },
        "time": {
          "from": "now-6h",
          "to": "now"
        },
        "timepicker": {
          "refresh_intervals": [
            "5s",
            "10s",
            "30s",
            "1m",
            "5m",
            "15m",
            "30m",
            "1h",
            "2h",
            "1d"
          ]
        },
        "timezone": "",
        "title": "Demo Dashboard",
        "version": 0
      }
  1. Save the YAML string as a file and run command oc apply -f <file location>.

  2. Log in to Grafana and switch to the ibm-common-services organization to check the new dashboard.

  3. To delete the dashboard, run command, oc delete monitoringdashboards/dashboard-demo -n default.

Alerts

Default alerts created by IBM Cloud Pak foundational services monitoring

Capability to install default alerts is available in version 1.3.0 of the ibm-icpmonitoring chart. Some alerts provide customizable parameters to control the alert frequency. You can configure the following alerts during installation.

Field Default Value
prometheus.alerts.nodeMemoryUsage.nodeMemoryUsage.enabled True
prometheus.alerts.nodeMemoryUsage.nodeMemoryUsageThreshold 85
Field Default Value
prometheus.alerts.highCPUUsage.enabled True
prometheus.alerts.highCPUUsage.highCPUUsageThreshold 85
Field Default Value
prometheus.alerts.failedJobs True
Field Default Value
prometheus.alerts.podsTerminated True
Field Default Value
prometheus.alerts.podsRestarting True

Managing alert rules

You can use the Kubernetes custom resource, PrometheusRule, to manage alert rules in your product.

The following sample-rule.yaml file is an example of an PrometheusRule resource definition:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
    labels:
        component: icp-prometheus
    name: sample-rule
spec:
    groups:
      - name: a.rules
        rules:
          - alert: NodeMemoryUsage
            expr: ((node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes))/ node_memory_MemTotal_bytes) * 100 > 5
            annotations:
              DESCRIPTION: '{{ $labels.instance }}: Memory usage is greater than the 15% threshold.  The current value is: {{ $value }}.'
              SUMMARY: '{{ $labels.instance }}: High memory usage detected'

You must provide the following parameter values:

monitoring.coreos.com/v1

PrometheusRule

icp-prometheus

Contains the content of the alert rule. For more information, see Recording Rules Opens in a new tab.

Accessing monitoring service APIs ()

You can access monitoring service APIs such as Prometheus and Grafana APIs. Before you can access the APIs, you must obtain authentication tokens to specify in your request headers. For information about obtaining authentication tokens, see Preparing to run component or management API commands.

After you obtain the authentication tokens, complete the following steps to access the Prometheus and Grafana APIs.

  1. (For CS monitoring mode only) Access the Prometheus API at url, https://<Cluster Master Host>:<Cluster Master API Port>/prometheus/* and get boot times of all nodes.

    curl -k -s -X GET -H "Authorization:Bearer $ACCESS_TOKEN" https://<Cluster Master Host>:<Cluster Master API Port>/prometheus/api/v1/query?query=node_boot_time_seconds
    

    For more information, see Prometheus HTTP API Opens in a new tab.

  2. Access the Grafana API at url, https://<Cluster Master Host>:<Cluster Master API Port>/grafana/*, and obtain the sample dashboard.

    curl -k -s -X GET -H "Authorization: Bearer $ACCESS_TOKEN” "https://<Cluster Master Host>:<Cluster Master API Port>/grafana/api/dashboards/db/sample"
    

    For more information, see Grafana HTTP API Reference Opens in a new tab.