Configuring Prometheus in Kubernetes from the command line
Procedure for configuring Prometheus to point to the probe's webhook running on Red Hat Open Shift Container Platform (OCP). It can also be used with the on-premises version of the probe.
Modifying Prometheus Alert Manager and Alert Rules Configuration for OCP Monitoring
- Determine the Prometheus Alert Manager configuration secret in the cluster. The default Secret
that contains the Alert Manager configuration is in
openshift-monitoring. See Applying custom Alertmanager configuration - A sample Alert Manager configuration with the probe webhook config applied is shown below. The
sample endpoint
http://<probehost>:80/probe/webhook/prometheus.global: resolve_timeout: '5m' receivers: - name: 'null' - name: 'netcool_probe' webhook_configs: - url: 'http://<probehost>:80/probe/webhook/prometheus' send_resolved: true route: group_by: - alertname group_interval: 5m group_wait: 30s receiver: netcool_probe repeat_interval: 5s routes: - receiver: netcool_probe match: alertname: Watchdog - Apply the updated Alert Manager configuration file.
- For details about applying custom alerting rules, see Managing cluster alerts.
- Verify that your probe is receiving the OCP Monitoring alerts and events appear on the Netcool/OMNIbus Event List.
Modifying Prometheus Alert Manager and Alert Rules on IBM Cloud Platform Common Services in Red Hat OCP 4.2
To modify the default CS Monitoring configuration, use the following steps:
- Determine the
Prometheus Alert Managerconfig map in thekube-systemnamespace. In the default configmaps in thekube-systemnamespace it is:monitoring-prometheus-alertmanager. - Edit the Prometheus Alert Manager config map to add a new receiver in the
receiverssection. The default Prometheus deployment config map name is monitoring-prometheus-alertmanager in the kube-system namespace. If a separate Prometheus or CS Monitoring instance is deployed, determine the alertmanager config map and add the new receiver. To do this from the command line, configure the kubectl client and follow the steps below. - Load the config map into a file using the following command:
kubectl get configmap monitoring-prometheus-alertmanager --namespace=kube-system -o yaml > alertmanager.yaml - Edit the alertmanager.yaml file and add the configuration as shown
below:
route: receiver: 'netcool_probe' receivers: - name: 'netcool_probe' webhook_configs: - url: 'http://<probehost>:80/probe/webhook/prometheus' send_resolved: trueReplace the
urlparameter with the probe's webhook URL. This can be the probe's webhook URL deployed either on Kubernetes or on-premises. - Save the changes in the file and replace the config map using the following command:
$ kubectl replace configmaps monitoring-prometheus-alertmanager --namespace=kube-system -f alertmanager.yamlconfigmap "monitoring-prometheus-alertmanager" replaced - Review the sample alert rules CRD YAML below. You may update the rules or add more rules to
generate more alerts to monitor your cluster. The Message Bus Probe rules file expects the following
attributes from the alerts generated by Prometheus Alert Manager:
labels.severity: The severity of the alert. Should be set tocritical,major,minor, orwarning. This is mapped to theSeverityfield in the ObjectServeralerts.statustable.labels.instance: The instance generating the alert. This is mapped to theNodefield in the ObjectServeralerts.statustable.labels.alertname: The alert rule name. This is mapped to theAlertGroupfield in the ObjectServeralerts.statustable.annotations.description: (Optional) The full description of the alert. This is mapped to theSummaryfield in the ObjectServeralerts.statustable.annotations.summary: A short description or summary of the alert. This is mapped to theSummaryfield in the ObjectServeralerts.statustable ifannotations.descriptionis unset.annotations.type: The alert type. For example, "Container", "Service", or "Service". This is mapped to theAlertKeyfield in the ObjectServeralerts.statustable.labels.release: (Optional) If set, will be mapped to theScopeIdfield in the ObjectServeralerts.statustable which will be used as the first level group to group related events.labels.job: (Optional) If set, will be mapped to theSiteNamefield in the ObjectServeralerts.statustable which will be used as the sub-group to group related events.
Note: Sample alert-rules CRD. This file is also available in the included CloudPak under pak_extensions/prometheus-rules.# File: netcool-rules.yaml # Please modify these rules to monitor specific workloads, # containers, services or nodes in your cluster apiVersion: monitoringcontroller.cloud.ibm.com/v1 kind: AlertRule metadata: name: netcool-rules spec: enabled: true data: |- groups: - name: alertrules.rules rules: ## Sample workload monitoring rules - alert: jenkins_down expr: absent(container_memory_usage_bytes{pod_name=~".*jenkins.*"}) for: 30s labels: severity: critical annotations: description: Jenkins container is down for more than 30 seconds. summary: Jenkins down type: Container - alert: jenkins_high_cpu expr: sum(rate(container_cpu_usage_seconds_total{pod_name=~".*jenkins.*"}[1m])) / count(node_cpu_seconds_total{mode="system"}) * 100 > 70 for: 30s labels: severity: warning annotations: description: Jenkins CPU usage is {{ humanize $value}}%. summary: Jenkins high CPU usage type: Container - alert: jenkins_high_memory expr: sum(container_memory_usage_bytes{pod_name=~".*jenkins.*"}) > 1.2e+09 for: 30s labels: severity: warning annotations: description: Jenkins memory consumption is at {{ humanize $value}}. summary: Jenkins high memory usage type: Container ## End - Sample workload monitoring rules. ## Sample container monitoring rules - alert: container_restarts expr: delta(kube_pod_container_status_restarts_total[1h]) >= 1 for: 10s labels: severity: warning annotations: description: The container {{ $labels.container }} in pod {{ $labels.pod }} has restarted at least {{ humanize $value}} times in the last hour on instance {{ $labels.instance }}. summary: Containers are restarting type: Container ## End - Sample container monitoring rules. ## Sample node monitoring rules - alert: high_cpu_load expr: node_load1 > 1.5 for: 30s labels: severity: critical annotations: description: Docker host is under high load, the avg load 1m is at {{ $value}}. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}. summary: Server under high load type: Server - alert: high_memory_load expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes) * 100 > 85 for: 30s labels: severity: warning annotations: description: Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}. summary: Server memory is almost full type: Server - alert: high_storage_load expr: (node_filesystem_size_bytes{fstype="aufs"} - node_filesystem_free_bytes{fstype="aufs"}) / node_filesystem_size_bytes{fstype="aufs"} * 100 > 85 for: 30s labels: severity: warning annotations: description: Docker host storage usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}. summary: Server storage is almost full type: Server - alert: monitor_service_down expr: up == 0 for: 30s labels: severity: critical annotations: description: Service {{ $labels.instance }} is down. summary: Monitor service non-operational type: Service ## End - Sample node monitoring rules. - Use the following command to create a new AlertRule in the kube-system namespace.
$ kubectl apply -f netcool-rules.yaml --namespace kube-systemNote: It usually takes a couple of minutes for Prometheus to reload the updated config maps and apply the new configuration. - Verify that Prometheus events appear on the OMNIbus Event List.