Configuring and enabling OpenShift Container Platform monitoring
OpenShift Container Platform comes with a built-in Prometheus instance that is configured by default to scrape workload metrics from all components of the cluster.
Workloads are platform-level components that are managed by the OpenShift, such as containers, pods, and nodes. These default workload metrics provide platform-level visibility, but nothing specific to the IBM Cloud Pak for AIOps deployment. IBM Cloud Pak for AIOps exposes metrics, but OCP will not scrape them by default as they are user-workloads. The sections below will cover how to configure the default workload monitoring, as well as how to enable and configure user-workload monitoring.
Prerequisites
- You must have access to the cluster as a user with the
cluster-adminrole. - OpenShift command-line interface is required. For more information, see Getting started with the OpenShift CLI.
Configuring default workload monitoring
You can configure the core OpenShift Container Platform
monitoring components by creating the
cluster-monitoring-config configmap object in the
openshift-monitoring project. The Cluster Monitoring
Operator (CMO) then configures the core components of the
monitoring stack.
-
Verify configmap: Check if the
cluster-monitoring-configconfigmap object exists and verify that it contains configuration options for CPU, Memory, & PVC as shown in step 3:oc -n openshift-monitoring get -o yaml configmap cluster-monitoring-config -o yamlIf the
cluster-monitoring-configconfigmap object exists and contains the appropriate configuration options, then this procedure is complete. Otherwise, proceed with the next step. -
Select a StorageClass: Choose a StorageClass to provision PVCs for Prometheus. Prometheus will use these PVCs to store workload metrics. SSD-backed, block-storage StorageClasses such as
rook-ceph-rbdare recommended. Shared file system StorageClasses such asrook-cephfsshould be avoided. For more details on the underlying storage interactions, view the Prometheus Storage documentation.- View available StorageClasses:
oc get sc - Export the
OCP_MONITOR_STORAGE_CLASSvariable:OCP_MONITOR_STORAGE_CLASS=<StorageClass name>
- View available StorageClasses:
-
Apply PVC, Retention, CPU, & Memory configurations:
The following command will configure both Prometheus workload server pods with:
- Storage (PVC): A 35GiB PVC for storage of workload metrics
- Storage Retention: a 4 day data retention period for the Prometheus workload metrics. Metrics older than this will be cleaned out by Prometheus.
- CPU: a minimum resource request of 1 CPU and a limit of 2 cpus for the Prometheus container.
- Memory: a minimum resource request of 3 GiB and a limit of 6 GiB of memory for the Prometheus container.
Important: Prometheus uses 2 workload server pods (prometheus-k8s-0 & prometheus-k8s-1) for scraping metrics and serving requests. A PVC is allocated per pod which means 2 PVCs will be created (70GiB total). CPU & Memory requests/limits will be applied at the container level for both Prometheus workload server pods. All settings here can be modified according to available resources. Around 8GiB of PVC storage should be allocated per day of retention, however this will vary depending with the size of the cluster.cat << EOF | oc apply --validate -f - apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: true alertmanagerMain: enableUserAlertmanagerConfig: true prometheusK8s: retention: 4d resources: requests: cpu: 1 memory: 3Gi limits: cpu: 2 memory: 6Gi volumeClaimTemplate: spec: storageClassName: $OCP_MONITOR_STORAGE_CLASS volumeMode: Filesystem resources: requests: storage: 35Gi EOF -
Validate that the Prometheus workload pods are running with the new configuration: The prometheus workload pods (prometheus-k8s-0 & prometheus-k8s-1), running in the
openshift-monitoringnamespace should have restarted after applying the configmap. Make sure they have started up successfully and are in a running state. Ensure the PVCs were created successfully in theopenshift-monitoringnamespace.If the pods are not entering a running state, check the logs for this error:
disk quota exceeded. This error indicates the allocated storage is not large enough to hold the existing metrics currently stored in ephemeral storage on the node. To address this, the PVC can be expanded to acommodate the data. Check the amount of data being storaged in ephemeral memory and make sure the PVC is large enough to accomodate that. To check this, ssh into the prometheus server pods and run the following command:du -sh /prometheusAlternatively, the existing data can be wiped out by applying the configmap from step 3 without the
volumeClaimTemplateconfiguration. This will clear out any existing data. After doing so, wait for the pods to restart and then the configmap can be re-applied with thevolumeClaimTemplateconfiguration. If this error is not present but the pods are still not entering a ready state, it could mean the pods are too under-resourced to compete their startup. In this case, the CPU/memory should be increased to ensure a successful start up.
Enabling and configuring user-workload monitoring
You can configure the user-workload monitoring components with
the user-workload-monitoring-config configmap object
in the openshift-user-workload-monitoring project. The
Red Hat OpenShift Container Platform Cluster
Monitoring Operator (CMO) then configures the components that
monitor user-defined projects.
-
Verify configmap: Check if the
user-workload-monitoring-configconfigmap object exists and verify that it contains configuration options for CPU, Memory, & PVC as shown in step 3:oc -n openshift-user-workload-monitoring get configmap user-workload-monitoring-config -o yamlIf the
user-workload-monitoring-configconfigmap object exists and contains the appropriate configuration options, then this procedure is complete. Otherwise, proceed with the next step. -
Select a StorageClass: Choose a StorageClass to provision PVCs for Prometheus. Prometheus will use these PVCs to store user-workload metrics. SSD-backed, block-storage StorageClasses such as
rook-ceph-rbdare recommended. Shared file system StorageClasses such asrook-cephfsshould be avoided. For more details on the underlying storage interactions, view the Prometheus Storage documentation.- View available StorageClasses:
oc get sc - Export the
OCP_MONITOR_STORAGE_CLASSvariable:OCP_MONITOR_STORAGE_CLASS=<StorageClass name>
- View available StorageClasses:
-
Apply PVC, Retention, CPU, & Memory configurations:
The following command will configure both Prometheus user-workload server pods with:
- Storage (PVC): A 5GiB PVC for storage of user-workload metrics
- Storage Retention: a 4 day data retention period for the Prometheus user-workload metrics. Metrics older than this will be cleaned out by Prometheus.
- CPU: a minimum resource request of 200 millicores and a limit of 1 cpu for the Prometheus container.
- Memory: a minimum resource request of 2 GiB and a limit of 5 GiB of memory for the Prometheus container.
Important: Prometheus uses 2 user-workload server pods (prometheus-user-workload-0 & prometheus-user-workload-1) for scraping metrics and serving requests. A PVC is allocated per pod which means 2 PVCs will be created (10GiB total). CPU & Memory requests/limits will be applied at the container level for both Prometheus user-workload server pods. All settings here can be modified according to available resources. Around 1GiB of PVC storage should be allocated per day of retention, however this will vary with the size of the cluster.cat << EOF | oc apply --validate -f - apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | prometheus: retention: 4d resources: requests: memory: 2Gi cpu: 200m limits: memory: 5Gi cpu: 1 volumeClaimTemplate: spec: storageClassName: $OCP_MONITOR_STORAGE_CLASS resources: requests: storage: 5Gi EOF -
Validate that the Prometheus user-workload pods are running with the new configuration: The prometheus user-workload pods (prometheus-user-workload-0 & prometheus-user-workload-1), running in the
openshift-user-workload-monitoringnamespace should have restarted after applying the configmap. Make sure they have started up successfully and are in a running state. If the pods do not enter a ready state, the CPU/memory may have to be increased to ensure they start up successfully. Ensure the PVCs were created successfully in theopenshift-user-workload-monitoringnamespace.
For more information about Prometheus configurations options, see Cluster Monitoring Operator configuration reference in the Red Hat OpenShift documentation.