Integrating Prometheus Alert Manager with Netcool Operations Insight
To modify the default Prometheus configuration, use the following steps:
- Deploy the
ibm-netcool-probechart. - After a successful deployment, get the Prometheus probe's
Endpoint HostandPortfrom the Workloads > Deployments page.- If logstashPobe.service.type is set to
ClusterIP, the full webhook URL will have the following format:http://<service name>.<namespace>:<externalPort>/probe/webhook/prometheusTo obtain the service name and port using the command line, use the following commands substituting
<namespace>with the namespace where the release is deployed and<release_name>with the Helm release name.# Get the Service name export SVC_NAME=$(kubectl get services --namespace <namespace> -l "app.kubernetes.io/instance=<release_name>,app.kubernetes.io/component=prometheusprobe" -o jsonpath="{.items[0].metadata.name}")# Get the Service port number export SVC_PORT=$(kubectl get services --namespace <namespace> -l "app.kubernetes.io/instance=<release_name>,app.kubernetes.io/component=prometheusprobe" -o jsonpath="{.items[0].spec.ports[0].port}") - If logstashPobe.service.type is set to
Nodeport, the full webhook URL will have the following format:http://<External IP>:<Node Port>/probe/webhook/prometheusTo obtain the NodePort number using the command line, use the following commands substituting
<namespace>with the namespace where the release is deployed and<release_name>with the Helm release name.# Get the NodePort number from the Service resource export NODE_PORT_PROMETHEUS=$(kubectl get services --namespace <namespace> -l "app.kubernetes.io/instance=<release_name>,app.kubernetes.io/component=prometheusprobe" -o jsonpath="{.items[0].spec.ports[0].nodePort}")# On ICP 3.1.1, you can obtain the External IP from the IBM Cloud Cluster Info Configmap using the command below. export NODE_IP_PROMETHEUS=$(kubectl get configmap --namespace kube-public ibmcloud-cluster-info -o jsonpath="{.data.proxy_address}") echo http://$NODE_IP_PROMETHEUS:$NODE_PORT_PROMETHEUS/probe/webhook/prometheus# On ICP 3.1.0, get the External IP from the Nodes resource. This command requires Cluster Administrator role. export NODE_IP_PROMETHEUS=$(kubectl get nodes -l proxy=true -o jsonpath="{.items[0].status.addresses[0].address}")
- If logstashPobe.service.type is set to
- Determine the Prometheus Alert Manager and Alert Rules config
maps in the same namespace. In this procedure, the config maps in
the
kube-systemnamespace aremonitoring-prometheus-alertmanagerandalert-rulesrespectively. - Edit the Prometheus Alert Manager pipeline ConfigMap to add a new receiver in the
receiverssection. If a separate Prometheus is deployed, determine the Alert Manager ConfigMap and add the new receiver. To do this using the command line, load themonitoring-prometheus-alertmanagerConfigMap into a file using the following command:kubectl get configmap monitoring-prometheus-alertmanager --namespace=kube-system -o yaml > alertmanager.yaml - Edit the alertmanager.yaml file and add a new webhook receiver
configuration. A sample configmap configuration is shown below. Use the full webhook URL from Step
2 in the url parameter.
$ cat alertmanager.yaml apiVersion: v1 data: alertmanager.yml: |- global: receivers: - name: 'netcool_probe' webhook_configs: - url: 'http://<ip_address>:<port>/probe/webhook/prometheus' send_resolved: true route: group_wait: 10s group_interval: 5m receiver: 'netcool_probe' repeat_interval: 3h kind: ConfigMap metadata: creationTimestamp: 2018-04-18T02:38:14Z labels: app: monitoring-prometheus chart: ibm-icpmonitoring-1.3.0 component: alertmanager heritage: Tiller release: monitoring name: monitoring-prometheus-alertmanager namespace: kube-system resourceVersion: "1856489" selfLink: /api/v1/namespaces/kube-system/configmaps/monitoring-prometheus-alertmanager uid: 8aef5f39-42b1-11e8-bd3d-0050569b6c73Note: The send_resolved flag should be set totrueso that the probe receives resolution events. - Save the changes in the file and replace the ConfigMap using the following command:
$ kubectl replace configmap monitoring-prometheus-alertmanager --namespace=kube-system -f alertmanager.yamlconfigmap "monitoring-prometheus-alertmanager" replaced - Load the
alert-rulesConfigMap into a file, update the data section to add your alerting rules and save the file. Sample rules for Prometheus 2.0 or newer are shown below.$ kubectl get configmap monitoring-prometheus-alert-rules --namespace=kube-system -o yaml > alertrules.yaml$ cat alertrules.yaml apiVersion: v1 data: alert.rules: |- groups: - name: alertrules.rules rules: - alert: jenkins_down expr: absent(container_memory_usage_bytes{pod_name=~".*jenkins.*"}) for: 30s labels: severity: critical annotations: description: Jenkins container is down for more than 30 seconds. summary: Jenkins down type: Container - alert: jenkins_high_cpu expr: sum(rate(container_cpu_usage_seconds_total{pod_name=~".*jenkins.*"}[1m])) / count(node_cpu_seconds_total{mode="system"}) * 100 > 70 for: 30s labels: severity: warning annotations: description: Jenkins CPU usage is {{ humanize $value}}%. summary: Jenkins high CPU usage type: Container - alert: jenkins_high_memory expr: sum(container_memory_usage_bytes{pod_name=~".*jenkins.*"}) > 1.2e+09 for: 30s labels: severity: warning annotations: description: Jenkins memory consumption is at {{ humanize $value}}. summary: Jenkins high memory usage type: Container - alert: container_restarts expr: delta(kube_pod_container_status_restarts_total[1h]) >= 1 for: 10s labels: severity: warning annotations: description: The container {{ $labels.container }} in pod {{ $labels.pod }} has restarted at least {{ humanize $value}} times in the last hour on instance {{ $labels.instance }}. summary: Containers are restarting - alert: high_cpu_load expr: node_load1 > 1.5 for: 30s labels: severity: critical annotations: description: Docker host is under high load, the avg load 1m is at {{ $value}}. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}. summary: Server under high load - alert: high_memory_load expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes) * 100 > 85 for: 30s labels: severity: warning annotations: description: Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}. summary: Server memory is almost full - alert: high_storage_load expr: (node_filesystem_size_bytes{fstype="aufs"} - node_filesystem_free_bytes{fstype="aufs"}) / node_filesystem_size_bytes{fstype="aufs"} * 100 > 85 for: 30s labels: severity: warning annotations: description: Docker host storage usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}. summary: Server storage is almost full - alert: monitor_service_down expr: up == 0 for: 30s labels: severity: critical annotations: description: Service {{ $labels.instance }} is down. summary: Monitor service non-operational kind: ConfigMap metadata: creationTimestamp: 2018-04-18T02:38:14Z labels: app: monitoring-prometheus chart: ibm-icpmonitoring-1.3.0 component: prometheus heritage: Tiller release: monitoring name: monitoring-prometheus-alertrules namespace: kube-system resourceVersion: "1856491" selfLink: /api/v1/namespaces/kube-system/configmaps/monitoring-prometheus-alertrules uid: 8aee6593-42b1-11e8-bd3d-0050569b6c73 - Replace the ConfigMap with the updated configuration using the following command:
kubectl replace confimap monitoring-prometheus-alertrules --namespace=kube-system -f alertrules.yamlconfigmap "monitoring-prometheus-alertrules" replaced - Prometheus usually takes a couple of minutes to reload the updated config maps and apply the new configuration. Verify that Prometheus events appear on the OMNIbus Event List.