Monitoring with Prometheus and Grafana
You can use Prometheus as a monitoring system and Grafana as a visualization tool to monitor your existing installation of IBM Cloud Pak for AIOps. Prometheus and Grafana can be installed and configured on a Red Hat OpenShift Container Platform cluster.
Prometheus is an open-source toolkit that can be used with Grafana to create real-time dashboards for monitoring Cloud Pak for AIOps stability and performance.
Note: The following instructions are updated to support the Grafana v5 operator, which is compatible with OpenShift version 4.11.x or higher. Previously, these instructions used the Grafana v4 operator, which is not supported by OpenShift version 4.16.x or higher. If you followed the instructions for the Grafana v4 operator, there might be compatibility issues if you upgrade your OpenShift version to 4.16.x or higher. To check your version of the Grafana operator, go to your OpenShift console, Operators > Installed Operators. To upgrade your Grafana operator from v4 to v5, run the cleanup script, and follow the instructions detailed in the Viewing metrics in Grafana section.
Warning: The clean up script deletes any existing dashboards in Grafana. Copy the JSON of any existing dashboards that you want to keep.
Clean up script (optional):
oc project monitoring grafana
# delete CRs
oc delete grafana grafana
oc delete grafanadatasource prometheus
# delete subscription & CSV
CSV=$(oc get subscription monitoring-grafana -o json | jq -r '.status.installedCSV')
oc delete subscription monitoring-grafana
oc delete csv $CSV
# delete CRDs
# IMPORTANT: if you created any additional CRD instances, make sure to delete them before deleting the definition
oc delete crd grafanadashboards.integreatly.org
oc delete crd grafanadatasources.integreatly.org
oc delete crd grafanafolders.integreatly.org
oc delete crd grafananotificationchannels.integreatly.org
oc delete crd grafanas.integreatly.org
# delete the rest of v4 associated resources
oc delete clusterrole grafana-proxy
oc delete rolebinding grafana-proxy
oc delete secret grafana-k8s-proxy
Notes:
- The monitoring stack that is described in the following sections uses 3rd party components that are not owned by IBM, use with caution.
- The monitoring stack that is described in the following sections currently works with Cloud Pak for AIOps that is installed in OpenShift Container Platform cluster. Cloud Pak for AIOps installed in an air-gapped environment (offline) is not supported by the described monitoring stack.
Prerequisites
- You need network connectivity to the OpenShift Container Platform cluster where Cloud Pak for AIOps is installed.
- OpenShift command-line interface is required. For more information, see Getting started with the OpenShift CLI.
- You need to install jq for your operating system to run the commands for JSON processing. For more information, see Download jq.
Use the following sections to set up the monitoring for your Cloud Pak for AIOps environment:
Setting up user workload monitoring
OpenShift Container Platform comes with a prometheus-based monitoring stack to track metrics about the cluster. You can extend the monitoring to user workloads by enabling user workload monitoring. User workloads include anything that is not a core OpenShift Container Platform service. For more information about user workloads monitoring and advanced configuration, see User workload monitoring.
To set up the user workload monitoring capability, you need to enable it.
Note: Once you run the following commands, the associated ConfigMaps (configuration settings) are applied cluster-wide. Do not run the commands on a shared cluster that includes applications other than Cloud Pak for AIOps.
Log in to your OpenShift Container Platform cluster by using the command line. Run the following commands to enable the user workload monitoring capability:
oc create -n openshift-monitoring -f - <<EOF || true
apiVersion: v1
kind: ConfigMap
metadata:
labels:
app.kubernetes.io/created-by: IBM
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
enableUserWorkload: true
EOF
oc create -n openshift-user-workload-monitoring -f - <<EOF || true
apiVersion: v1
kind: ConfigMap
metadata:
labels:
app.kubernetes.io/created-by: IBM
name: user-workload-monitoring-config
namespace: openshift-user-workload-monitoring
data:
config.yaml: |
EOF
After you run the preceding commands, you can view metrics in OpenShift Container Platform web console.
Viewing metrics in OpenShift Container Platform cluster
To view the metrics, go to Observe > Metrics in OpenShift Container Platform web console. You can enter a valid Prometheus query and graph the result. The following are some example queries that you can use to view the metrics for your Cloud Pak for AIOps environment:
-
The total number of pods that are in a running state across all namespaces:
sum(kube_pod_status_phase{namespace=~".*", phase="Running"})
-
The total amount of memory that is allocated across all the nodes:
sum (machine_memory_bytes{node=~"^.*$"})
-
Deployment replicas across all the namespaces:
kube_deployment_status_replicas{namespace=~".*"}
Enabling additional metrics
In some cases you might need configuration to enable additional metrics for your Cloud Pak for AIOps environment.
Kafka
Since a large portion of communication within Cloud Pak for AIOps takes place in Kafka, it is highly beneficial to acquire metrics from Kafka.
Notes:
-
Kafka metrics, such as those displayed in the usage dashboard, are only available from the moment the metrics are enabled. Viewing time windows that precede the enabling of the metrics results in nothing being displayed.
-
The Kafka custom resource (CR) must be in a ready state for metric scraping to be enabled successfully. To view the status of the Kafka CR, navigate to Home > Search. Search for the Kafka resource and click on Kafka instance. The Conditions section at the end of the page shows the
Status
.
Use the following steps to enable Kafka metrics:
-
Switch to the project (namespace) where Cloud Pak for AIOps is installed:
oc project <namespace>
Where
<namespace>
is the project (namespace) where Cloud Pak for AIOps is installed -
Define the Kafka
ConfigMap
andPodMonitor
resources to configure the Kafka Prometheus exporter:-
Define the Kafka
ConfigMap
resource:KAFKA_METRICS_CONFIGMAP=' kind: ConfigMap apiVersion: v1 metadata: name: kafka-metrics labels: app: strimzi data: kafka-metrics-config.yml: | # See https://github.com/prometheus/jmx_exporter for more info about JMX Prometheus Exporter metrics lowercaseOutputName: true rules: # Special cases and very specific rules - pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), topic=(.+), partition=(.*)><>Value name: kafka_server_$1_$2 type: GAUGE labels: clientId: "$3" topic: "$4" partition: "$5" - pattern: kafka.server<type=(.+), name=(.+), clientId=(.+), brokerHost=(.+), brokerPort=(.+)><>Value name: kafka_server_$1_$2 type: GAUGE labels: clientId: "$3" broker: "$4:$5" - pattern: kafka.server<type=(.+), cipher=(.+), protocol=(.+), listener=(.+), networkProcessor=(.+)><>connections name: kafka_server_$1_connections_tls_info type: GAUGE labels: cipher: "$2" protocol: "$3" listener: "$4" networkProcessor: "$5" - pattern: kafka.server<type=(.+), clientSoftwareName=(.+), clientSoftwareVersion=(.+), listener=(.+), networkProcessor=(.+)><>connections name: kafka_server_$1_connections_software type: GAUGE labels: clientSoftwareName: "$2" clientSoftwareVersion: "$3" listener: "$4" networkProcessor: "$5" - pattern: "kafka.server<type=(.+), listener=(.+), networkProcessor=(.+)><>(.+):" name: kafka_server_$1_$4 type: GAUGE labels: listener: "$2" networkProcessor: "$3" - pattern: kafka.server<type=(.+), listener=(.+), networkProcessor=(.+)><>(.+) name: kafka_server_$1_$4 type: GAUGE labels: listener: "$2" networkProcessor: "$3" # Some percent metrics use MeanRate attribute # Ex) kafka.server<type=(KafkaRequestHandlerPool), name=(RequestHandlerAvgIdlePercent)><>MeanRate - pattern: kafka.(\w+)<type=(.+), name=(.+)Percent\w*><>MeanRate name: kafka_$1_$2_$3_percent type: GAUGE # Generic gauges for percents - pattern: kafka.(\w+)<type=(.+), name=(.+)Percent\w*><>Value name: kafka_$1_$2_$3_percent type: GAUGE - pattern: kafka.(\w+)<type=(.+), name=(.+)Percent\w*, (.+)=(.+)><>Value name: kafka_$1_$2_$3_percent type: GAUGE labels: "$4": "$5" # Generic per-second counters with 0-2 key/value pairs - pattern: kafka.(\w+)<type=(.+), name=(.+)PerSec\w*, (.+)=(.+), (.+)=(.+)><>Count name: kafka_$1_$2_$3_total type: COUNTER labels: "$4": "$5" "$6": "$7" - pattern: kafka.(\w+)<type=(.+), name=(.+)PerSec\w*, (.+)=(.+)><>Count name: kafka_$1_$2_$3_total type: COUNTER labels: "$4": "$5" - pattern: kafka.(\w+)<type=(.+), name=(.+)PerSec\w*><>Count name: kafka_$1_$2_$3_total type: COUNTER # Generic gauges with 0-2 key/value pairs - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+), (.+)=(.+)><>Value name: kafka_$1_$2_$3 type: GAUGE labels: "$4": "$5" "$6": "$7" - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+)><>Value name: kafka_$1_$2_$3 type: GAUGE labels: "$4": "$5" - pattern: kafka.(\w+)<type=(.+), name=(.+)><>Value name: kafka_$1_$2_$3 type: GAUGE # Emulate Prometheus 'Summary' metrics for the exported 'Histogram's. # Note that these are missing the '_sum' metric! - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+), (.+)=(.+)><>Count name: kafka_$1_$2_$3_count type: COUNTER labels: "$4": "$5" "$6": "$7" - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.*), (.+)=(.+)><>(\d+)thPercentile name: kafka_$1_$2_$3 type: GAUGE labels: "$4": "$5" "$6": "$7" quantile: "0.$8" - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+)><>Count name: kafka_$1_$2_$3_count type: COUNTER labels: "$4": "$5" - pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.*)><>(\d+)thPercentile name: kafka_$1_$2_$3 type: GAUGE labels: "$4": "$5" quantile: "0.$6" - pattern: kafka.(\w+)<type=(.+), name=(.+)><>Count name: kafka_$1_$2_$3_count type: COUNTER - pattern: kafka.(\w+)<type=(.+), name=(.+)><>(\d+)thPercentile name: kafka_$1_$2_$3 type: GAUGE labels: quantile: "0.$4" zookeeper-metrics-config.yml: | # See https://github.com/prometheus/jmx_exporter for more info about JMX Prometheus Exporter metrics lowercaseOutputName: true rules: # replicated Zookeeper - pattern: "org.apache.ZooKeeperService<name0=ReplicatedServer_id(\\d+)><>(\\w+)" name: "zookeeper_$2" type: GAUGE - pattern: "org.apache.ZooKeeperService<name0=ReplicatedServer_id(\\d+), name1=replica.(\\d+)><>(\\w+)" name: "zookeeper_$3" type: GAUGE labels: replicaId: "$2" - pattern: "org.apache.ZooKeeperService<name0=ReplicatedServer_id(\\d+), name1=replica.(\\d+), name2=(\\w+)><>(Packets\\w+)" name: "zookeeper_$4" type: COUNTER labels: replicaId: "$2" memberType: "$3" - pattern: "org.apache.ZooKeeperService<name0=ReplicatedServer_id(\\d+), name1=replica.(\\d+), name2=(\\w+)><>(\\w+)" name: "zookeeper_$4" type: GAUGE labels: replicaId: "$2" memberType: "$3" - pattern: "org.apache.ZooKeeperService<name0=ReplicatedServer_id(\\d+), name1=replica.(\\d+), name2=(\\w+), name3=(\\w+)><>(\\w+)" name: "zookeeper_$4_$5" type: GAUGE labels: replicaId: "$2" memberType: "$3" '
-
Define the
PodMonitor
resource:KAFKA_PODMONITOR=' apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: kafka-resources-metrics labels: app: strimzi spec: selector: matchExpressions: - key: "ibmevents.ibm.com/kind" operator: In values: ["Kafka", "KafkaConnect", "KafkaMirrorMaker", "KafkaMirrorMaker2"] namespaceSelector: matchNames: - myproject podMetricsEndpoints: - path: /metrics port: tcp-prometheus relabelings: - separator: ; regex: __meta_kubernetes_pod_label_(ibmevents_ibm_com_.+) replacement: $1 action: labelmap - sourceLabels: [__meta_kubernetes_namespace] separator: ; regex: (.*) targetLabel: namespace replacement: $1 action: replace - sourceLabels: [__meta_kubernetes_pod_name] separator: ; regex: (.*) targetLabel: kubernetes_pod_name replacement: $1 action: replace - sourceLabels: [__meta_kubernetes_pod_node_name] separator: ; regex: (.*) targetLabel: node_name replacement: $1 action: replace - sourceLabels: [__meta_kubernetes_pod_host_ip] separator: ; regex: (.*) targetLabel: node_ip replacement: $1 action: replace '
-
-
Create the
ConfigMap
andPodMonitor
resources by using the apply command:oc apply -f - <<EOF ${KAFKA_METRICS_CONFIGMAP} --- ${KAFKA_PODMONITOR} EOF
-
Enable the Prometheus JMX exporter by patching the Kafka instance with the
ConfigMap
that you created in the preceding step.Note: When you run the following commands, the Kafka brokers perform a rolling restart to enable the metric exporter.
oc patch kafka iaf-system --type=merge -p '{ "spec": { "kafka": { "metricsConfig": { "type": "jmxPrometheusExporter", "valueFrom": { "configMapKeyRef": { "key": "kafka-metrics-config.yml", "name": "kafka-metrics" } } } } } }'
You can now explore the Kafka broker metrics in your cluster.
For more information about Kafka Prometheus Exporter, see Kafka Prometheus Exporter.
To view the metrics, go to Observe > Metrics in OpenShift Container Platform web console. You can enter a valid Prometheus query and graph the result. For example, you can the use the following Prometheus query to get the active count of all the Kafka topic partitions including the replicas.
sum(kafka_server_replicamanager_partitioncount)
Viewing metrics in Grafana
You can use Grafana to visualize metrics and logs that come from multiple sources, which include Prometheus and other monitoring tools. Grafana provides a web-based interface for creating and customizing dashboards, which can be used to display various metrics and logs from different sources. You need to install Grafana in your OpenShift Container Platform cluster and configure it to connect to Thanos. For more information about Thanos, see Thanos.
Grafana gives access to all metrics across all of the Prometheus instances (both cluster and user workload metrics). You can use the following commands to install the Grafana v5 operator, set up a Grafana instance, and connect Prometheus as a Grafana data source.
-
Run the following script.
oc apply -f - <<EOF apiVersion: v1 kind: Namespace metadata: name: monitoring-grafana --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: operatorgroup namespace: monitoring-grafana spec: targetNamespaces: - monitoring-grafana --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: monitoring-grafana namespace: monitoring-grafana spec: channel: v5 installPlanApproval: Automatic name: grafana-operator source: community-operators sourceNamespace: openshift-marketplace EOF oc project monitoring-grafana # wait for the Grafana operator to finish installing installReady=false while [ "$installReady" != true ] do installPlan=`oc get subscription.operators.coreos.com monitoring-grafana -n monitoring-grafana -o json | jq -r .status.installplan.name` if [ -z "$installPlan" ] then installReady=false else installReady=`oc get installplan -n monitoring-grafana "$installPlan" -o json | jq -r '.status|.phase == "Complete"'` fi if [ "$installReady" != true ] then sleep 5 fi done installReady=false while [ "$installReady" != true ] do csv=`oc get subscription.operators.coreos.com monitoring-grafana -n monitoring-grafana -o json | jq -r .status.currentCSV` if [ -z "$csv" ] then installReady=false else installReady=`oc get csv -n monitoring-grafana "$csv" -o json | jq -r '.status.phase == "Succeeded"'` fi if [ "$installReady" != true ] then sleep 5 fi done # set up PVC for persisting dashboards (optional) # if not using PVC, remove oc apply -n monitoring-grafana -f - <<EOF apiVersion: v1 kind: PersistentVolumeClaim metadata: name: grafana-data namespace: monitoring-grafana spec: accessModes: - ReadWriteOnce volumeMode: Filesystem resources: requests: storage: 2Gi EOF # generate random secret for the proxy oc create secret generic -n monitoring-grafana grafana-k8s-proxy --from-literal=session_secret=$(openssl rand --hex 32) || true # create a serviceaccount for the grafana datasource instance oc apply -n monitoring-grafana -f - <<EOF apiVersion: v1 kind: ServiceAccount metadata: name: grafana-cluster-monitoring namespace: monitoring-grafana EOF oc adm policy add-cluster-role-to-user cluster-monitoring-view -z grafana-cluster-monitoring -n monitoring-grafana oc apply -n monitoring-grafana -f - <<EOF apiVersion: v1 kind: Secret metadata: name: grafana-auth-secret namespace: monitoring-grafana annotations: kubernetes.io/service-account.name: grafana-cluster-monitoring type: kubernetes.io/service-account-token EOF # create the Grafana instance along with a cluster role and role binding oc apply -n monitoring-grafana -f - <<EOF apiVersion: grafana.integreatly.org/v1beta1 kind: Grafana metadata: labels: dashboards: "grafana" name: grafana namespace: monitoring-grafana spec: client: preferIngress: false config: auth: disable_login_form: "false" disable_signout_menu: "true" auth.anonymous: enabled: "true" org_role: Admin auth.basic: enabled: "true" log: level: warn mode: console deployment: spec: template: spec: containers: - args: - -provider=openshift - -pass-basic-auth=false - -https-address=:9091 - -http-address= - -email-domain=* - -upstream=http://localhost:3000 - "\$(SAR)" - '-openshift-delegate-urls={"/": {"resource": "namespaces", "verb": "get"}}' - -tls-cert=/etc/tls/private/tls.crt - -tls-key=/etc/tls/private/tls.key - -client-secret-file=/var/run/secrets/kubernetes.io/serviceaccount/token - -cookie-secret-file=/etc/proxy/secrets/session_secret - -openshift-service-account=grafana-sa - -openshift-ca=/etc/pki/tls/cert.pem - -openshift-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt - -skip-auth-regex=^/metrics env: - name: SAR value: '-openshift-sar={"resource": "namespaces", "verb": "get"}' image: registry.redhat.io/openshift4/ose-oauth-proxy:v4.10 imagePullPolicy: Always name: grafana-proxy ports: - containerPort: 9091 name: grafana-proxy protocol: TCP resources: {} volumeMounts: - mountPath: /etc/tls/private name: secret-grafana-k8s-tls readOnly: false - mountPath: /etc/proxy/secrets name: secret-grafana-k8s-proxy readOnly: false volumes: - name: secret-grafana-k8s-tls secret: secretName: grafana-k8s-tls - name: secret-grafana-k8s-proxy secret: secretName: grafana-k8s-proxy - name: grafana-data persistentVolumeClaim: claimName: grafana-pvc persistentVolumeClaim: spec: accessModes: - ReadWriteOnce resources: requests: storage: 2Gi route: spec: port: targetPort: grafana-proxy tls: termination: reencrypt to: kind: Service name: grafana-service weight: 100 wildcardPolicy: None service: metadata: annotations: service.alpha.openshift.io/serving-cert-secret-name: grafana-k8s-tls spec: ports: - name: grafana-proxy port: 9091 protocol: TCP targetPort: grafana-proxy serviceAccount: metadata: annotations: serviceaccounts.openshift.io/oauth-redirectreference.primary: '{"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"grafana-route"}}' --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: grafana-proxy namespace: monitoring-grafana rules: - apiGroups: - authentication.k8s.io resources: - tokenreviews verbs: - create - apiGroups: - authorization.k8s.io resources: - subjectaccessreviews verbs: - create --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: grafana-proxy namespace: monitoring-grafana roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: grafana-proxy subjects: - kind: ServiceAccount name: grafana-sa EOF # Create the grafanadatasource instance oc apply -f - <<EOF apiVersion: grafana.integreatly.org/v1beta1 kind: GrafanaDatasource metadata: name: prometheus namespace: monitoring-grafana spec: datasource: access: proxy isDefault: true jsonData: httpHeaderName1: 'Authorization' timeInterval: 5s tlsSkipVerify: true name: prometheus secureJsonData: httpHeaderValue1: 'Bearer \${token}' type: prometheus url: 'https://thanos-querier.openshift-monitoring.svc.cluster.local:9091' instanceSelector: matchLabels: dashboards: grafana valuesFrom: - targetPath: secureJsonData.httpHeaderValue1 valueFrom: secretKeyRef: key: token name: grafana-auth-secret EOF echo "Monitoring will be available at https://$(oc get route -n monitoring-grafana grafana-route -o jsonpath='{.status.ingress[0].host}')"
You might see the following error while the Grafana operator is in the process of deploying:
Error from server (NotFound): installplans.operators.coreos.com "null" not found
Importing dashboards
Dashboards can be imported into Grafana either by an ID or by importing the dashboard JSON itself.

Importing Cloud Pak for AIOps dashboards to Grafana
Several dashboards are provided to assist in the operational monitoring of Cloud Pak for AIOps.
For more information about how to import a dashboard in Grafana, see Import dashboards.
Cloud Pak for AIOps offers the following custom dashboards available as JSON, which can be uploaded or copy pasted into Grafana. For more information, see the following sections:
Importing other dashboards to Grafana
You can import the following dashboards in Grafana. These dashboards are developed by Grafana community members and can be imported with their ID field.
K8 Cluster Detail Dashboard
ID 10856kubernetes-networking-cluster
ID 16371
Viewing available metrics
The preceding dashboards show a fraction of the metrics that are made available by Prometheus. To view the list of available metrics in your OpenShift Container Platform cluster, run the following commands as described in the Authentication with Bearer token.
-
Obtain the API URL for Thanos querier:
oc get routes -n openshift-monitoring thanos-querier -o jsonpath='{.status.ingress[0].host}'
-
Retrieve the list of available Metric APIs and store the result in
metrics.json
:Note: The following command might not work on Federal Information Processing Standards (FIPS) enabled clusters due to networking reasons. To get around this issue, log in to OpenShift Container Platform online console and go to any pod. Exec into the pod and run the following command. Alternatively you can log in to the OpenShift Container Platform cluster by using the command line, exec into the pod, and then run the following command.
curl -k -H "Authorization: Bearer $(oc whoami -t)" https://<thanos_querier_route>/api/v1/metadata > metrics.json
Where
<thanos_querier_route>
is the route that you obtained in the preceding command.
All of the Metric APIs listed under the output of the preceding command can be copy and pasted in the cluster under Observe > Metrics. These queries can be used in Grafana to create new panels. The new panels can be saved in a new dashboard or in one of the existing custom (community) dashboards.