Monitoring service
IBM Cloud® foundational services monitoring service is built on top of the Prometheus stack. It provides pre-configured, self-updated monitoring service for clusters and applications.
Features
Metrics visualization
Grafana is installed to query and visualize your metrics. Some built-in dashboards for cluster metrics visualization are created by default. You can also create your custom dashboards.
Multi-tenancy
Monitoring provides Kubernetes namespace level isolation. Grafana Organizations are created automatically per Kubernetes namespace. Users can access dashboards and metrics they are allowed only based on the namespaces to which they have access in OpenShift.
Alerts
Alerts can be triggered automatically and send to 3rd-party applications like slack and PagerDuty.
Customization
Adopters and user can integrate with it easily to query and visualize their application metrics, and create Alerts.
Operators
ibm-monitoring-exporters-operator
This operator installs kube-state-metrics
and nodeexporter
for cluster metrics collection. Install this operator only when IBM Cloud Pak foundational services Prometheus is configured as the Grafana datasource.
ibm-monitoring-prometheusext-operator
Installs Prometheus and Alertmanager. It is an extension of the community Prometheus operator. Install this operator only when CS Prometheus is configured as the Grafana datasource.
ibm-monitoring-grafana-operator
Installs Grafana.
Red Hat OpenShift Container Platform monitoring mode and IBM Cloud Pak foundational services monitoring mode
Evolution of Red Hat OpenShift Container Platform monitoring
Red Hat® OpenShift® Container Platform version 3.4 and earlier, OpenShift Container Platform does not provide capability for application metrics. The IBM Cloud Pak foundational services monitoring service installs a full Prometheus stack that includes Exporters, Prometheus, Alertmanager, and Grafana.
OpenShift Container Platform version 4.4 introduced their application monitoring feature as a technology preview. IBM Cloud Pak foundational services version 3.5 also introduced a technology preview that provides a migration path to implement OpenShift Container Platform monitoring.
OpenShift Container Platform monitoring is generally available in version 4.6. IBM Cloud Pak foundational services version 3.6 offers two monitoring modes; one of which allows OpenShift Container Platform monitoring to configure Prometheus as a Grafana datasource.
Introduction of two monitoring service modes
IBM Cloud Pak foundational services version 3.6 includes two monitoring service modes, IBM Cloud Pak foundational services monitoring, and OpenShift Container Platform monitoring.
OpenShift Container Platform monitoring means that IBM Cloud Pak foundational services monitoring installs only Grafana, and Prometheus is configured as the datasource for OpenShift Container Platform monitoring. This mode is the default.
IBM Cloud Pak foundational services (CS) monitoring means that IBM Cloud Pak foundational services installs its full Prometheus stack, which is configured as the Grafana datasource. You must configure this mode before installation.
Mode | Operators | Support for OCP 4.5 and earlier | Support for OCP 4.6 and later |
---|---|---|---|
IBM Cloud Pak foundational services Monitoring | Exporter PrometheusExt Grafana |
Yes | Yes |
OpenShift Container Platform Monitoring | Grafana | No | Yes |
- Accessing the monitoring dashboard
- Role-based access control (RBAC)
- Installing monitoring service
- Configuring monitoring service
- Configuring applications to use monitoring service
- Managing Grafana dashboards
- Alerts
- Accessing monitoring service APIs
Accessing the monitoring dashboard
-
Log in to the IBM Cloud Pak foundational services console.
Note: When you log in to the console, you have administrative access to Grafana. Do not create more users within the Grafana dashboard or modify the existing users or org.
-
To access the Grafana dashboard, click Menu > Monitor Health > Monitoring.
Alternatively, you can open
https://<IP_address>:<port>/grafana
, where<IP_address>
is the DNS or IP address that is used to access the console.<port>
is the port that is used to access the console.Note: If you are logged in as a Cluster Administrator, you can access the Monitoring dashboard from the Administration Hub dashboard. This dashboard provides Cluster Administrators overviews of clusters. The overview includes key metrics for various services and components. It provides links to open other dashboards, pages, and consoles to administer those services and components. From this Administration Hub dashboard, you can view and click Monitoring link on the Welcome widget to access the Grafana dashboard. The Administration Hub can be accessed by clicking Home within the main navigation menu. Only Cluster Administrators can access the Administration Hub dashboard.
-
(For CS monitoring mode) To access the Alertmanager dashboard, open
https://<IP_address>:<port>/alertmanager
. - (For CS monitoring mode) To access the Prometheus dashboard, open
https://<IP_address>:<port>/prometheus
. -
The following default Grafana dashboards are created in the Grafana
main-org
. You must first grantibm-common-services
namespace access to the user.-
Namespaces Performance IBM Provided 2.5
Provides information about namespace performance and status metrics. -
Performance IBM Provided 2.5
Provides TCP system performance information aboutNodes
,Memory
, andContainers
. -
Kubernetes Cluster Monitoring
Monitors Kubernetes clusters that use Prometheus. Provides information about clusterCPU
,Memory
, andFilesystem
usage. The dashboard also provides statistics for individual pods, containers, and systemd services. -
Kubernetes POD Overview
Monitors pod metrics such asCPU
,Memory
,Network
pod status, and restarts. -
NGINX Ingress controller
Provides information about NGINX Ingress controller metrics that can be sorted by namespace, controller class, controller, and ingress. -
Node Performance Summary
Provides information about system performance metrics such asCPU
,Memory
,Disk
, andNetwork
for all nodes in the cluster. -
Prometheus Stats
Dashboard for monitoring Prometheus v2.x.x.
-
Role-based access control (RBAC)
RBAC for monitoring API
A user with role ClusterAdministrator
,Administrator
or Operator
can access monitoring service. A user with role ClusterAdministrator
or Administrator
can use write operations in monitoring
service, including deleting Prometheus metrics data, and updating Grafana configurations.
RBAC for monitoring data
Starting with version 1.2.0, the ibm-icpmonitoring
Helm chart introduces an important feature. It offers a new module that provides role-based access controls (RBAC) for access to the Prometheus metrics data.
The RBAC module is effectively a proxy that sits in front of the Prometheus client pod. It examines the requests for authorization headers, and at that point, enforces role-based controls. The general RBAC rules are as follows.
A user with the ClusterAdministrator
role can access any resource. A user with any other role can access data in only the namespaces for which that user is authorized.
If metrics data includes the label, kubernetes_namespace
, then it is recognized as being in the namespace, which is the value of that label. If metrics data has no such label, then it is recognized as system level metrics. Only users
with the role ClusterAdministrator
can access system level metrics.
In a IBM Multicloud Manager hub cluster environment, users can access metrics from managed clusters. A user with the role ClusterAdministrator
can access data from all managed clusters. A user with any other role can access data from
only the managed clusters whose related namespaces that user is authorized.
RBAC for monitoring dashboards
Starting with version 1.5.0, the ibm-icpmonitoring
Helm chart offers a new module that provides role-based access controls (RBAC) for access to the monitoring dashboards in Grafana.
In Grafana, users can belong to one or more organizations. Each organization contains its own settings for resources such as data sources and dashboards. For the Grafana running in IBM Cloud Pak for Multicloud Management, each namespace in IBM Cloud
Pak for Multicloud Management has a corresponding organization with the same name. For example, if you create a new namespace that is named test
in IBM Cloud Pak for Multicloud Management, an organization that is named test
is generated in Grafana. If you delete the test
namespace, the test
organization is also removed. The only exception is the ibm-common-services
namespace. The corresponding organization for ibm-common-services
is the Grafana default of Main Org
.
When you log in to IBM Cloud Pak for Multicloud Management, you can access a Grafana organization only if you are authorized to access the corresponding namespace. If you have access to more than one Grafana organization, use the Grafana console
to switch to a different organization. Message, UNAUTHORIZED
appears when you do not have access to a Grafana organization.
Different users access Grafana organizations by using different organization roles. In the corresponding namespace, if you are assigned the role of ClusterAdministrator
or Administrator
, you have Admin
access
to the Grafana organization. Otherwise, you have Viewer
access to the Grafana organization.
When you access Grafana as a user of IBM Cloud Pak for Multicloud Management, a user with the same name is created in Grafana. If the user in IBM Cloud Pak for Multicloud Management is deleted, the corresponding user is not deleted from Grafana. The user account becomes stale. Run the following command to request the removal of stale users:
curl -k -s -X POST -H "Authorization:$ACCESS_TOKEN" https://<Cluster Master Host>:<Cluster Master API Port>/grafana/check_stale_users
For information about Grafana APIs, see Accessing monitoring service APIs.
Note: Monitoring service does not provide RBAC support for Prometheus and Alertmanager alerts.
Installing monitoring service
Prerequisites
-
common service
The monitoring service depends on other services that are provided by IBM Cloud Pak foundational services. If IBM Cloud Pak foundational services is not installed in your OpenShift cluster, see Installing IBM Cloud Platform Common Services online to install the bootstrap operator and initial custom Resource (CR) instances in theibm-common-services
namespace. -
Dynamic volume provisioning and storage class for CS monitoring
Prometheus and Alertmanager that are included in the IBM Cloud Pak foundational services monitoring service store metrics and alerts to persistent volumes (PV). ReadWriteOnce (RWO) mode Storage Class and corresponding provisioner is required. Cluster defaultStorageclass
is used by default. -
OpenShift application monitoring must be enabled and configured for OpenShift monitoring. For more information, see Configuring the monitoring stack
.
Installing IBM Cloud Pak foundational services
Complete the following steps to install IBM Cloud Pak foundational services. For more information, see Installing IBM Cloud Platform Common Services online.
-
Create or edit the OperandRequest CR.
The following example resembles a CR for OpenShift monitoring.
apiVersion: operator.ibm.com/v1alpha1 kind: OperandRequest metadata: name: common-service namespace: ibm-common-services spec: requests: - operands: - name: ibm-monitoring-grafana-operator registry: common-service
For IBM Cloud Pak foundational services version 3.6, you must edit the
OperandConfig
CR to enable OpenShift monitoring as shown in the following examples. You must enable OpenShift monitoring before you create the OperandRequest CR in Step 1.- name: ibm-monitoring-grafana-operator spec: grafana: datasourceConfig: ### this is the configuration type: "openshift" ### to enable OCP monitoring operandRequest: {}
Example CR for CS monitoring:
apiVersion: operator.ibm.com/v1alpha1 kind: OperandRequest metadata: name: common-service namespace: ibm-common-services spec: requests: - operands: - name: ibm-monitoring-exporters-operator - name: ibm-monitoring-prometheusext-operator - name: ibm-monitoring-grafana-operator registry: common-service
-
Run the following command to check the status of your pods.
oc get po -n ibm-common-services | grep monitoring
For OpenShift monitoring
Your output might resemble the following example, which shows that all pods are
Running
and all containers are available; for example, 4/4 for Prometheus.ibm-monitoring-grafana-5b9bbdcd-495dg 4/4 Running 15 3d21h ibm-monitoring-grafana-operator-76bc8bbdc8-5vsns 1/1 Running 0 3d22h
Note: Four containers are running in the Grafana pod for OpenShift monitoring, and three containers are running in the Grafana pod for CS monitoring.
For CS monitoring
alertmanager-ibm-monitoring-alertmanager-0 3/3 Running 0 6m46s ibm-monitoring-collectd-694dd7868-wsvss 2/2 Running 0 6m48s ibm-monitoring-exporters-operator-55fd6c876d-44h67 1/1 Running 0 9m37s ibm-monitoring-grafana-7cbc65885f-gnsgk 3/3 Running 4 8m55s ibm-monitoring-grafana-operator-c8867db64-7b4lj 1/1 Running 0 9m22s ibm-monitoring-kube-state-6f588b8dfd-fl447 2/2 Running 0 6m47s ibm-monitoring-mcm-ctl-6647759b47-2qfv8 1/1 Running 0 6m46s ibm-monitoring-nodeexporter-6qlhg 2/2 Running 0 6m48s ibm-monitoring-nodeexporter-7jgsb 2/2 Running 0 6m48s ibm-monitoring-nodeexporter-nw5qg 2/2 Running 0 6m48s ibm-monitoring-prometheus-operator-6bbb48d8cb-wd5r5 1/1 Running 0 7m18s ibm-monitoring-prometheus-operator-ext-86cdbc7644-qb4ph 1/1 Running 0 9m20s prometheus-ibm-monitoring-prometheus-0 4/4 Running 4 6m47s
Configuring monitoring service
You can configure the monitoring service by editing the Operand Deployment Lifecycle Manager (ODLM) OperandRequest CR. Following is an example of a default CR.
apiVersion: operator.ibm.com/v1alpha1
kind: OperandConfig
metadata:
name: common-service
namespace: ibm-common-services
spec:
services:
- name: ibm-monitoring-exporters-operator
spec:
exporter: {}
- name: ibm-monitoring-prometheusext-operator
spec:
prometheusExt: {}
- name: ibm-monitoring-grafana-operator
spec:
grafana: {}
You can update the configuration parameters. For more information, see Configuring IBM Cloud Platform Common Services by using the CommonService custom resource.
Configuring applications to use the monitoring service
You can configure your applications in any namespace to expose metrics to the monitoring service.
-
Create a
Service
object and add specified annotations to it. This step is required for CS monitoring.-
prometheus.io/scrape: 'true'
Required.
-
prometheus.io/scheme: 'https'
Optional. Use this parameter when TLS is enabled for your metrics endpoint. Prometheus is configured to skip certificate verification so you can use any certificate to secure your endpoint. For example, you can use OpenShift annotation,
service.beta.openshift.io/serving-cert-secret-name
or the IBM Certificate Manager service. -
prometheus.io/path
Optional. Use this parameter when your default value for endpoint is not
/metrics
. -
prometheus.io/port
Optional. Use this parameter to specify the port for metrics.
-
The following example illustrates annotations for metrics. It also illustrates how to create certificates by using the OpenShift service.beta.openshift.io/serving-cert-secret-name
annotation.
apiVersion: v1
kind: Service
metadata:
name: prometheus-metrics-server-demo
namespace: default
labels:
name: prometheus-metrics-server-demo
annotations:
## Generate certificate secret which is used by metrics pod. Only works on OpenShift
service.beta.openshift.io/serving-cert-secret-name: prometheus-metrics-server-demo
## enable cs monitoring metrics scrape
prometheus.io/scrape: "true"
## it uses 8443 port which is https.
## comment it out to use 8080 port
prometheus.io/scheme: "https"
spec:
ports:
- name: https
port: 8443
protocol: TCP
targetPort: 8443
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
name: prometheus-metrics-server-demo
type: ClusterIP
You can choose to add the annotations to Pod
objects instead of Service
objects. However, Service
objects are recommended because they support TLS.
-
Create
ServiceMonitor
orPodMonitor
CRs in the same namespace with yourService
object. This step is required for OpenShift monitoring. For more information, see Prometheus documentation.
Following is an example of a
ServiceMonitor
CR.apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: prometheus-metrics-server-demo namespace: default spec: selector: matchLabels: name: prometheus-metrics-server-demo endpoints: - scheme: https port: https tlsConfig: insecureSkipVerify: true
Following is an example of a
PodMonitor
CR, which is not recommended because they do not support TLS.apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: prometheus-metrics-server-demo namespace: default spec: selector: matchLabels: name: prometheus-metrics-server-demo podMetricsEndpoints: - scheme: http targetPort: 9157
Managing Grafana dashboards
You can create custom Grafana dashboards by creating MonitoringDashboard
CRs. CRs can be created in any namespace and will appear in the corresponding in Grafana organization.
Notes:
- You must switch to the Grafana organization before you browse the dashboard.
- Dashboards that are created directly in Grafana are lost when you restart pods.
1. Create a dashboard on Grafana, and then generate a JSON string for the dashboard. From the dashboard, click Dashboard Setting > JSON Model. For more information about dashboard files, see
Dashboard JSON .
2. Create the MonitoringDashboard
CR in following format.
apiVersion: monitoringcontroller.cloud.ibm.com/v1
kind: MonitoringDashboard
metadata:
name: sample-dashboard
spec:
enabled: true
data: |-
{
...
}
3. Copy the generated JSON string and use it as the value in the spec.data
field of the MonitoringDashboard
CR from Step 2.
Note: Remove id
and uid
fields of the top-level object.
Following is an example of the MonitoringDashboard CR.
apiVersion: monitoringcontroller.cloud.ibm.com/v1
kind: MonitoringDashboard
metadata:
name: dashboard-demo
namespace: default
spec:
enabled: true
data: |-
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"gnetId": null,
"graphTooltip": 0,
"links": [],
"panels": [
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"#299c46",
"rgba(237, 129, 40, 0.89)",
"#d44a3a"
],
"datasource": "prometheus",
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 9,
"w": 12,
"x": 0,
"y": 0
},
"id": 2,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false,
"ymax": null,
"ymin": null
},
"tableColumn": "",
"targets": [
{
"expr": "sum(kube_pod_info{namespace=~\"ibm-common-services\"})",
"refId": "A"
}
],
"thresholds": "",
"timeFrom": null,
"timeShift": null,
"title": "Demo Panel",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
}
],
"schemaVersion": 21,
"style": "dark",
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
]
},
"timezone": "",
"title": "Demo Dashboard",
"version": 0
}
4. Save the YAML string as a file and run command oc apply -f <file location>
.
5. Log in to Grafana and switch to the ibm-common-services
organization to check the new dashboard.
6. To delete the dashboard, run command, oc delete monitoringdashboards/dashboard-demo -n default
.
Alerts
Default alerts created by IBM Cloud Pak foundational services monitoring
Capability to install default alerts is available in version 1.3.0 of the ibm-icpmonitoring
chart. Some alerts provide customizable parameters to control the alert frequency. You can configure the following alerts during installation.
-
Node memory usage
Default alert to trigger when the node memory threshold exceeds 85%. The threshold is configurable and is installed by default. If you use the CLI, the following values control this alert:
Field | Default Value |
---|---|
prometheus.alerts.nodeMemoryUsage.nodeMemoryUsage.enabled |
True |
prometheus.alerts.nodeMemoryUsage.nodeMemoryUsageThreshold |
85 |
-
High CPU Usage
Default alert to trigger when the CPU threshold exceeds 85%. The threshold is configurable and is installed by default. If you use the CLI, the following values control this alert:
Field | Default Value |
---|---|
prometheus.alerts.highCPUUsage.enabled |
True |
prometheus.alerts.highCPUUsage.highCPUUsageThreshold |
85 |
-
Failed jobs
Default alert if a job did not complete successfully. Is installed by default. If you use the CLI, the following values control this alert:
Field | Default Value |
---|---|
prometheus.alerts.failedJobs |
True |
-
Pods terminated
Default alert if a pod was terminated and did not complete successfully. This alert is installed by default. If you use the CLI, the following values control this alert:
Field | Default Value |
---|---|
prometheus.alerts.podsTerminated |
True |
-
Pods restarting
Default alert is triggered if a pod is restarting more than five times in 10 minutes. This parameter is installed by default. If you use the CLI, the following values control this alert:
Field | Default Value |
---|---|
prometheus.alerts.podsRestarting | True |
Managing alert rules
You can use the Kubernetes custom resource, PrometheusRule
, to manage alert rules in IBM Cloud Pak for Multicloud Management.
The following sample-rule.yaml
file is an example of an PrometheusRule
resource definition:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
component: icp-prometheus
name: sample-rule
spec:
groups:
- name: a.rules
rules:
- alert: NodeMemoryUsage
expr: ((node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes))/ node_memory_MemTotal_bytes) * 100 > 5
annotations:
DESCRIPTION: '{{ $labels.instance }}: Memory usage is greater than the 15% threshold. The current value is: {{ $value }}.'
SUMMARY: '{{ $labels.instance }}: High memory usage detected'
You must provide the following parameter values:
- apiVersion
monitoring.coreos.com/v1
- kind
PrometheusRule
- metadata.labels.component
icp-prometheus
- spec
Contains the content of the alert rule. For more information, see Recording Rules .
Accessing monitoring service APIs ()
You can access monitoring service APIs such as Prometheus and Grafana APIs. Before you can access the APIs, you must obtain authentication tokens to specify in your request headers. For information about obtaining authentication tokens, see Preparing to run component or management API commands.
After you obtain the authentication tokens, complete the following steps to access the Prometheus and Grafana APIs.
-
(For CS monitoring mode only) Access the Prometheus API at url,
https://<Cluster Master Host>:<Cluster Master API Port>/prometheus/*
and get boot times of all nodes.$ACCESS_TOKEN
is the variable that stores the authentication token for your cluster.<Cluster Master Host>
and<Cluster Master API Port>
are defined in Master endpoints.
curl -k -s -X GET -H "Authorization:Bearer $ACCESS_TOKEN" https://<Cluster Master Host>:<Cluster Master API Port>/prometheus/api/v1/query?query=node_boot_time_seconds
For more information, see Prometheus HTTP API
.
-
Access the Grafana API at url,
https://<Cluster Master Host>:<Cluster Master API Port>/grafana/*
, and obtain thesample
dashboard.$ACCESS_TOKEN
is the variable that stores the authentication token for your cluster.<Cluster Master Host>
and<Cluster Master API Port>
are defined in Master endpoints.
curl -k -s -X GET -H "Authorization: Bearer $ACCESS_TOKEN” "https://<Cluster Master Host>:<Cluster Master API Port>/grafana/api/dashboards/db/sample"
For more information, see Grafana HTTP API Reference
.