Integrating Prometheus Alert Manager with Netcool Operations Insight

To modify the default Prometheus configuration, use the following steps:

  1. Deploy the ibm-netcool-probe chart.
  2. After a successful deployment, get the Prometheus probe's Endpoint Host and Port from the Workloads > Deployments page.
    • If logstashPobe.service.type is set to ClusterIP, the full webhook URL will have the following format: http://<service name>.<namespace>:<externalPort>/probe/webhook/prometheus

      To obtain the service name and port using the command line, use the following commands substituting <namespace> with the namespace where the release is deployed and <release_name> with the Helm release name.

      # Get the Service name export SVC_NAME=$(kubectl get services --namespace <namespace> -l "app.kubernetes.io/instance=<release_name>,app.kubernetes.io/component=prometheusprobe" -o jsonpath="{.items[0].metadata.name}")

      # Get the Service port number export SVC_PORT=$(kubectl get services --namespace <namespace> -l "app.kubernetes.io/instance=<release_name>,app.kubernetes.io/component=prometheusprobe" -o jsonpath="{.items[0].spec.ports[0].port}")

    • If logstashPobe.service.type is set to Nodeport, the full webhook URL will have the following format: http://<External IP>:<Node Port>/probe/webhook/prometheus

      To obtain the NodePort number using the command line, use the following commands substituting <namespace> with the namespace where the release is deployed and <release_name> with the Helm release name.

      # Get the NodePort number from the Service resource export NODE_PORT_PROMETHEUS=$(kubectl get services --namespace <namespace> -l "app.kubernetes.io/instance=<release_name>,app.kubernetes.io/component=prometheusprobe" -o jsonpath="{.items[0].spec.ports[0].nodePort}")

      # On ICP 3.1.1, you can obtain the External IP from the IBM Cloud Cluster Info Configmap using the command below. export NODE_IP_PROMETHEUS=$(kubectl get configmap --namespace kube-public ibmcloud-cluster-info -o jsonpath="{.data.proxy_address}") echo http://$NODE_IP_PROMETHEUS:$NODE_PORT_PROMETHEUS/probe/webhook/prometheus

      # On ICP 3.1.0, get the External IP from the Nodes resource. This command requires Cluster Administrator role. export NODE_IP_PROMETHEUS=$(kubectl get nodes -l proxy=true -o jsonpath="{.items[0].status.addresses[0].address}")

  3. Determine the Prometheus Alert Manager and Alert Rules config maps in the same namespace. In this procedure, the config maps in the kube-system namespace are monitoring-prometheus-alertmanager and alert-rules respectively.
  4. Edit the Prometheus Alert Manager pipeline ConfigMap to add a new receiver in the receivers section. If a separate Prometheus is deployed, determine the Alert Manager ConfigMap and add the new receiver. To do this using the command line, load the monitoring-prometheus-alertmanager ConfigMap into a file using the following command:

    kubectl get configmap monitoring-prometheus-alertmanager --namespace=kube-system -o yaml > alertmanager.yaml

  5. Edit the alertmanager.yaml file and add a new webhook receiver configuration. A sample configmap configuration is shown below. Use the full webhook URL from Step 2 in the url parameter.
    $ cat alertmanager.yaml 
    apiVersion: v1
    data:
      alertmanager.yml: |-
        global:
        receivers:
        - name: 'netcool_probe'
          webhook_configs:
          - url: 'http://<ip_address>:<port>/probe/webhook/prometheus'
            send_resolved: true
    
        route:
          group_wait: 10s
          group_interval: 5m
          receiver: 'netcool_probe'
          repeat_interval: 3h
    kind: ConfigMap
    metadata:
      creationTimestamp: 2018-04-18T02:38:14Z
      labels:
        app: monitoring-prometheus
        chart: ibm-icpmonitoring-1.3.0
        component: alertmanager
        heritage: Tiller
        release: monitoring
      name: monitoring-prometheus-alertmanager
      namespace: kube-system
      resourceVersion: "1856489"
      selfLink: /api/v1/namespaces/kube-system/configmaps/monitoring-prometheus-alertmanager
      uid: 8aef5f39-42b1-11e8-bd3d-0050569b6c73
    
    Note: The send_resolved flag should be set to true so that the probe receives resolution events.
  6. Save the changes in the file and replace the ConfigMap using the following command:

    $ kubectl replace configmap monitoring-prometheus-alertmanager --namespace=kube-system -f alertmanager.yaml

    configmap "monitoring-prometheus-alertmanager" replaced

  7. Load the alert-rules ConfigMap into a file, update the data section to add your alerting rules and save the file. Sample rules for Prometheus 2.0 or newer are shown below.

    $ kubectl get configmap monitoring-prometheus-alert-rules --namespace=kube-system -o yaml > alertrules.yaml

    $ cat alertrules.yaml
    apiVersion: v1
    data:
      alert.rules: |-
        groups:
        - name: alertrules.rules
          rules:
          - alert: jenkins_down
            expr: absent(container_memory_usage_bytes{pod_name=~".*jenkins.*"})
            for: 30s
            labels:
              severity: critical
            annotations:
              description: Jenkins container is down for more than 30 seconds.
              summary: Jenkins down
              type: Container
          - alert: jenkins_high_cpu
            expr: sum(rate(container_cpu_usage_seconds_total{pod_name=~".*jenkins.*"}[1m]))
              / count(node_cpu_seconds_total{mode="system"}) * 100 > 70
            for: 30s
            labels:
              severity: warning
            annotations:
              description: Jenkins CPU usage is {{ humanize $value}}%.
              summary: Jenkins high CPU usage
              type: Container
          - alert: jenkins_high_memory
            expr: sum(container_memory_usage_bytes{pod_name=~".*jenkins.*"}) > 1.2e+09
            for: 30s
            labels:
              severity: warning
            annotations:
              description: Jenkins memory consumption is at {{ humanize $value}}.
              summary: Jenkins high memory usage
              type: Container
          - alert: container_restarts
            expr: delta(kube_pod_container_status_restarts_total[1h]) >= 1
            for: 10s
            labels:
              severity: warning
            annotations:
              description: The container {{ $labels.container }} in pod {{ $labels.pod }}
                has restarted at least {{ humanize $value}} times in the last hour on instance
                {{ $labels.instance }}.
              summary: Containers are restarting
          - alert: high_cpu_load
            expr: node_load1 > 1.5
            for: 30s
            labels:
              severity: critical
            annotations:
              description: Docker host is under high load, the avg load 1m is at {{ $value}}.
                Reported by instance {{ $labels.instance }} of job {{ $labels.job }}.
              summary: Server under high load
          - alert: high_memory_load
            expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes
              + node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes) * 100 > 85
            for: 30s
            labels:
              severity: warning
            annotations:
              description: Docker host memory usage is {{ humanize $value}}%. Reported by
                instance {{ $labels.instance }} of job {{ $labels.job }}.
              summary: Server memory is almost full
          - alert: high_storage_load
            expr: (node_filesystem_size_bytes{fstype="aufs"} - node_filesystem_free_bytes{fstype="aufs"})
              / node_filesystem_size_bytes{fstype="aufs"} * 100 > 85
            for: 30s
            labels:
              severity: warning
            annotations:
              description: Docker host storage usage is {{ humanize $value}}%. Reported by
                instance {{ $labels.instance }} of job {{ $labels.job }}.
              summary: Server storage is almost full
          - alert: monitor_service_down
            expr: up == 0
            for: 30s
            labels:
              severity: critical
            annotations:
              description: Service {{ $labels.instance }} is down.
              summary: Monitor service non-operational
    
    kind: ConfigMap
    metadata:
      creationTimestamp: 2018-04-18T02:38:14Z
      labels:
        app: monitoring-prometheus
        chart: ibm-icpmonitoring-1.3.0
        component: prometheus
        heritage: Tiller
        release: monitoring
      name: monitoring-prometheus-alertrules
      namespace: kube-system
      resourceVersion: "1856491"
      selfLink: /api/v1/namespaces/kube-system/configmaps/monitoring-prometheus-alertrules
      uid: 8aee6593-42b1-11e8-bd3d-0050569b6c73
  8. Replace the ConfigMap with the updated configuration using the following command:

    kubectl replace confimap monitoring-prometheus-alertrules --namespace=kube-system -f alertrules.yaml

    configmap "monitoring-prometheus-alertrules" replaced

  9. Prometheus usually takes a couple of minutes to reload the updated config maps and apply the new configuration. Verify that Prometheus events appear on the OMNIbus Event List.