Integrating Prometheus Alert Manager with Netcool Operations Insight

To modify the default Prometheus configuration, use the following steps:

Deploy the ibm-netcool-probe chart.
After a successful deployment, get the Prometheus probe's Endpoint Host and Port from the Workloads > Deployments page.
- If logstashPobe.service.type is set to ClusterIP, the full webhook URL will have the following format: http://<service name>.<namespace>:<externalPort>/probe/webhook/prometheus
  To obtain the service name and port using the command line, use the following commands substituting <namespace> with the namespace where the release is deployed and <release_name> with the Helm release name.
  
  # Get the Service name export SVC_NAME=$(kubectl get services --namespace <namespace> -l "app.kubernetes.io/instance=<release_name>,app.kubernetes.io/component=prometheusprobe" -o jsonpath="{.items[0].metadata.name}")
  
  # Get the Service port number export SVC_PORT=$(kubectl get services --namespace <namespace> -l "app.kubernetes.io/instance=<release_name>,app.kubernetes.io/component=prometheusprobe" -o jsonpath="{.items[0].spec.ports[0].port}")
- If logstashPobe.service.type is set to Nodeport, the full webhook URL will have the following format: http://<External IP>:<Node Port>/probe/webhook/prometheus
  To obtain the NodePort number using the command line, use the following commands substituting <namespace> with the namespace where the release is deployed and <release_name> with the Helm release name.
  
  # Get the NodePort number from the Service resource export NODE_PORT_PROMETHEUS=$(kubectl get services --namespace <namespace> -l "app.kubernetes.io/instance=<release_name>,app.kubernetes.io/component=prometheusprobe" -o jsonpath="{.items[0].spec.ports[0].nodePort}")
  
  # On ICP 3.1.1, you can obtain the External IP from the IBM Cloud Cluster Info Configmap using the command below. export NODE_IP_PROMETHEUS=$(kubectl get configmap --namespace kube-public ibmcloud-cluster-info -o jsonpath="{.data.proxy_address}") echo http://$NODE_IP_PROMETHEUS:$NODE_PORT_PROMETHEUS/probe/webhook/prometheus
  
  # On ICP 3.1.0, get the External IP from the Nodes resource. This command requires Cluster Administrator role. export NODE_IP_PROMETHEUS=$(kubectl get nodes -l proxy=true -o jsonpath="{.items[0].status.addresses[0].address}")
Determine the Prometheus Alert Manager and Alert Rules config maps in the same namespace. In this procedure, the config maps in the kube-system namespace are monitoring-prometheus-alertmanager and alert-rules respectively.
Edit the Prometheus Alert Manager pipeline ConfigMap to add a new receiver in the receivers section. If a separate Prometheus is deployed, determine the Alert Manager ConfigMap and add the new receiver. To do this using the command line, load the monitoring-prometheus-alertmanager ConfigMap into a file using the following command:
kubectl get configmap monitoring-prometheus-alertmanager --namespace=kube-system -o yaml > alertmanager.yaml

Edit the alertmanager.yaml file and add a new webhook receiver configuration. A sample configmap configuration is shown below. Use the full webhook URL from Step 2 in the url parameter.

$ cat alertmanager.yaml 
apiVersion: v1
data:
  alertmanager.yml: |-
    global:
    receivers:
    - name: 'netcool_probe'
      webhook_configs:
      - url: 'http://<ip_address>:<port>/probe/webhook/prometheus'
        send_resolved: true

    route:
      group_wait: 10s
      group_interval: 5m
      receiver: 'netcool_probe'
      repeat_interval: 3h
kind: ConfigMap
metadata:
  creationTimestamp: 2018-04-18T02:38:14Z
  labels:
    app: monitoring-prometheus
    chart: ibm-icpmonitoring-1.3.0
    component: alertmanager
    heritage: Tiller
    release: monitoring
  name: monitoring-prometheus-alertmanager
  namespace: kube-system
  resourceVersion: "1856489"
  selfLink: /api/v1/namespaces/kube-system/configmaps/monitoring-prometheus-alertmanager
  uid: 8aef5f39-42b1-11e8-bd3d-0050569b6c73

Note: The send_resolved flag should be set to true so that the probe receives resolution events.

Save the changes in the file and replace the ConfigMap using the following command:
$ kubectl replace configmap monitoring-prometheus-alertmanager --namespace=kube-system -f alertmanager.yaml

configmap "monitoring-prometheus-alertmanager" replaced

Load the alert-rules ConfigMap into a file, update the data section to add your alerting rules and save the file. Sample rules for Prometheus 2.0 or newer are shown below.

$ kubectl get configmap monitoring-prometheus-alert-rules --namespace=kube-system -o yaml > alertrules.yaml

$ cat alertrules.yaml
apiVersion: v1
data:
  alert.rules: |-
    groups:
    - name: alertrules.rules
      rules:
      - alert: jenkins_down
        expr: absent(container_memory_usage_bytes{pod_name=~".*jenkins.*"})
        for: 30s
        labels:
          severity: critical
        annotations:
          description: Jenkins container is down for more than 30 seconds.
          summary: Jenkins down
          type: Container
      - alert: jenkins_high_cpu
        expr: sum(rate(container_cpu_usage_seconds_total{pod_name=~".*jenkins.*"}[1m]))
          / count(node_cpu_seconds_total{mode="system"}) * 100 > 70
        for: 30s
        labels:
          severity: warning
        annotations:
          description: Jenkins CPU usage is {{ humanize $value}}%.
          summary: Jenkins high CPU usage
          type: Container
      - alert: jenkins_high_memory
        expr: sum(container_memory_usage_bytes{pod_name=~".*jenkins.*"}) > 1.2e+09
        for: 30s
        labels:
          severity: warning
        annotations:
          description: Jenkins memory consumption is at {{ humanize $value}}.
          summary: Jenkins high memory usage
          type: Container
      - alert: container_restarts
        expr: delta(kube_pod_container_status_restarts_total[1h]) >= 1
        for: 10s
        labels:
          severity: warning
        annotations:
          description: The container {{ $labels.container }} in pod {{ $labels.pod }}
            has restarted at least {{ humanize $value}} times in the last hour on instance
            {{ $labels.instance }}.
          summary: Containers are restarting
      - alert: high_cpu_load
        expr: node_load1 > 1.5
        for: 30s
        labels:
          severity: critical
        annotations:
          description: Docker host is under high load, the avg load 1m is at {{ $value}}.
            Reported by instance {{ $labels.instance }} of job {{ $labels.job }}.
          summary: Server under high load
      - alert: high_memory_load
        expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes
          + node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes) * 100 > 85
        for: 30s
        labels:
          severity: warning
        annotations:
          description: Docker host memory usage is {{ humanize $value}}%. Reported by
            instance {{ $labels.instance }} of job {{ $labels.job }}.
          summary: Server memory is almost full
      - alert: high_storage_load
        expr: (node_filesystem_size_bytes{fstype="aufs"} - node_filesystem_free_bytes{fstype="aufs"})
          / node_filesystem_size_bytes{fstype="aufs"} * 100 > 85
        for: 30s
        labels:
          severity: warning
        annotations:
          description: Docker host storage usage is {{ humanize $value}}%. Reported by
            instance {{ $labels.instance }} of job {{ $labels.job }}.
          summary: Server storage is almost full
      - alert: monitor_service_down
        expr: up == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          description: Service {{ $labels.instance }} is down.
          summary: Monitor service non-operational

kind: ConfigMap
metadata:
  creationTimestamp: 2018-04-18T02:38:14Z
  labels:
    app: monitoring-prometheus
    chart: ibm-icpmonitoring-1.3.0
    component: prometheus
    heritage: Tiller
    release: monitoring
  name: monitoring-prometheus-alertrules
  namespace: kube-system
  resourceVersion: "1856491"
  selfLink: /api/v1/namespaces/kube-system/configmaps/monitoring-prometheus-alertrules
  uid: 8aee6593-42b1-11e8-bd3d-0050569b6c73

Replace the ConfigMap with the updated configuration using the following command:
kubectl replace confimap monitoring-prometheus-alertrules --namespace=kube-system -f alertrules.yaml

configmap "monitoring-prometheus-alertrules" replaced
Prometheus usually takes a couple of minutes to reload the updated config maps and apply the new configuration. Verify that Prometheus events appear on the OMNIbus Event List.