SevOne NMS Admin Notifications Quick Start Guide
About
SevOne NMS self-monitoring based on integration with Prometheus, expands the concepts of SevOne admin messages. Prometheus is chosen for its ability to collect small-scale, short duration data; while also serving as a backup notification system that is fully decoupled from SevOne NMS. This integration is provided to the users at no additional cost.
Each NMS peer periodically checks itself and other appliances for potential problems and records the information locally. SevOne NMS automatically aggregates and stores this information for a short duration by leveraging Prometheus and Alertmanager. Alertmanager notifies the NMS administrators if peers fail to push updates to Prometheus, or if the data it receives indicates a potential problem.
As Prometheus operates outside of SevOne NMS software, it is not affected by SevOne NMS outages unless the host(s) running it are completely unavailable. Prometheus can be configured to run redundantly so the odds of missing a notification for a critical outage is significantly reduced. For example, if MySQL goes down, the NMS host is neither able to collect self-monitoring data nor alert on it, whereas Prometheus alert manager is not affected by this outage. This helps the NMS administrators by notifying appropriate individuals of potential NMS service disruption(s) by the system, prior to service tickets getting opened by the end-users.
The original NMS self-monitoring tool is still required for monitoring SevOne NMS appliance performance metrics such as utilization, process statistics, database performance, etc. and storage of raw data for one-year.
SevOne NMS now comes with pre-configured and ready-to-use admin notifications. All services are up and running enabling the collection of out-of-the-box metrics. The alerts are predefined and triggered automatically.
- [any reference to master] OR
- [[if a CLI command contains master] AND/OR
-
[its output contains master]],
it means leader.
And, if there is any reference to slave, it means follower.
Admin Notifications
The Admin notifications trigger if:
- Peers stop reporting status updates to the Prometheus alertmanager pipeline
- Peers are in split-brain or unknown states
- Failovers take too long to execute or do not complete
- Critical processes are running on appliances based on their type (PAS vs DNC) and role (primary vs. secondary (HSA))
- Ingestion is delayed or the indicators per second (IPS) load on an appliance exceeds the specification
- Writes to an appliance's database fails
Configure Alertmanager
By default, the email notifications from Prometheus' alertmanager are disabled. However, services for both Prometheus and alertmanager are running in the background to aggregate metrics and to evaluate the alerts.
Administration > Access Configuration > User Role Manager has two user-level permissions for prometheus alertmanager.
-
Can consume Admin notifications
- User with proper permissions can subscribe/unsubscribe to Alertmanager email notifications from Administration > My Preferences > field Admin Notifications.
- Subscribed user will receive notifications for triggered alert. Unsubscribed user will not receive any notifications.
- User with invalid permission will have an option to edit the field.
- Can edit Admin Notifications
- User with proper permission can enable/disable Alertmanager email service from Administration > Cluster Manager > Cluster Settings tab > General subtab > field Enable Alert-Manager Admin Notifications. If service is disabled then no emails will be sent. If enabled, subscribed users will receive the emails.
- User with invalid permissions will not have an option to edit the field; it is disabled.
- AlertManager email service is enabled by default from Administration > Cluster Manager > Cluster Settings tab > General subtab > field Enable Alert-Manager Admin Notifications.
- Each user is required to subscribe to notifications manually from Administration > My Preferences > field Admin Notifications.
- Enable AlertManager email service from Administration > Cluster Manager > Cluster Settings tab > General subtab > field Enable Alert-Manager Admin Notifications.
- Subscribe to notifications from Administration > My Preferences > field Admin Notifications.
Verify using Prometheus Graphical User Interface
- To check if the data collection is working correctly, login to the Prometheus graphical user interface and verify that all targets are up and green. For example, enter URL http://<Cluster Leader IP address>/prometheus/targets in your web browser. Repeat this for Cluster Leader's HSA, if there is one.
- To check if all the rules are properly parsed and operational, enter http://<Cluster Leader IP address>/prometheus/rules in your web browser.
- To check for active alerts, enter http://<Cluster Leader IP address>/prometheus/alerts in your web browser.If any alerts are being sent, you will see the details for them.
After SevOne-requestd process is turned off on an active appliance, alert triggering appears as the following in the graphical user interface.
Example
Example: SevOne-requestd alert on the same peer at a later time from the Alerts tab in Prometheus
Verify using Alert manager Graphical User Interface
- To check if Alertmanager is working properly, enter http://<Cluster Leader IP
address>/alertmanager/#/status in your web browser.Important: The Alertmanager service to start a mail server must be configured from your NMS > Administration > Cluster Manager > Cluster Settings tab > Email subtab.
- Alertmanager is a separate service from Prometheus and much of the information for alertmanager is exposed from the Prometheus user interface.
- To see active alerts, acknowledge and silence them from Prometheus' user interface. Enter http://<Cluster Leader IP address>/prometheus in your web browser.
Notification Delivery Mechanisms
Prometheus can send notifications via email and Webhooks. Administrators may refer to https://prometheus.io/docs/alerting/latest/configuration/#webhook_config to set up Webhook notification.
SNMP Trap Support
SNMP traps are not currently supported by Prometheus.
Port Configuration
The following ports are required to be open on both the Cluster Leader and its HSA, if there is one.
Port | Service |
---|---|
TCP 9094 | Alertmanager clustering port |
UDP 9094 | |
TCP 80 / 443 |
Prometheus and Alertmanager Uses port 80 (for HTTP protocol) and port 443 (for HTTPS protocol) |
Stop or start all Prometheus-related Services
To stop or start all prometheus services on a single SevOne NMS node or peer. Execute the one of the following scripts as root in the host.
Stop or start all prometheus services on Single SevOne NMS node or peer.
systemctl stop nms-prometheus.service
systemctl start nms-prometheus.service
About Prometheus
Prometheus is used for self-monitoring the SevOne NMS cluster. Prometheus collects various metrics of components of the SevOne NMS cluster and generates alerts if an alerting rule is satisfied. SevOne NMS users can specify their email address to receive notifications on the alerts they are interested in. This guide helps you walkthrough the basic concepts and configurations.
Architecture
Prometheus has two main functions.
- Collect metrics
- Send alerts based on metric rules
Collect Metrics from Long Running Daemon
To collect, known as scrape in Prometheus, metrics from a long running daemon, Prometheus requires the daemon to expose an endpoint. For example, /metrics, which returns a list of {metric_name, value} pairs as a response. Prometheus calls the endpoint periodically and saves the resulting metrics inside its storage.
Collect Metrics from Cron Job
To collect metrics from a cron job, Prometheus cannot simply ask the cron job to expose an endpoint because the cron job is not long-running. When the cron job completes, there is no way for Prometheus to collect the metrics. To solve this problem, Prometheus is bundled with a daemon called Pushgateway. The cron job can send its metrics to Pushgateway. Pushgateway saves the metrics and expose the /metrics for Prometheus to scrape.
Rules & Alerts
Rules can be defined on top of the collected metrics in Prometheus. When a rule is satisfied, Prometheus generates an alert. For example, if you have a metric for disk free space, you can define a rule disk free space < 1G so that when the disk free space of an appliance is less than 1G, an alert is generated. After Prometheus generates an alert, it sends the alert to another daemon called Alertmanager. Alertmanager receives the alert and notify any users that are interested in it by email.
Prometheus Deployment in SevOne NMS Cluster
The following figure shows the architecture of Prometheus deployment in a 4-peer SevOne NMS cluster.
Prometheus and Alertmanager are deployed only on the Cluster Leader and its HSA, in an active-active mode. Prometheus on both the Cluster Leader and its HSA scrape the metrics of processes on all peers via nginx. Nginx is responsible to route a scrape request to an appropriate /metrics endpoint exposed by a process. For example, if Prometheus on the Cluster Leader would like to scrape the metrics of process P1 on Peer A, Prometheus will send a request to Peer A. The nginx on Peer A receives the request and routes it to the /metrics endpoint on process P1.
In this example, prometheus and alertmanager are configured with a set of alerting rules to achieve self-monitoring. You must configure an email address to receive alert notifications. Other than that, it works right out-of-the-box (OOTB). Please refer to Set up OOTB Rules for Self-monitoring to set rules and Set up Alert Email Configuration to set email address(es).
Self-monitoring System Components
The self-monitoring system is currently comprised of 5 distinct components.
Component | Service | Port | Description | URL |
---|---|---|---|---|
1. | Prometheus |
|
The main data collection service (only runs on the Cluster Leader and its HSA). | https://www.prometheus.io/docs/introduction/overview/ |
2. | Alertmanager |
|
The main alerting service (only runs on the Cluster Leader and its HSA). | https://www.prometheus.io/docs/alerting/latest/alertmanager/ |
Alertmanager | TCP 9094UDP 9094 | The high availability clustering port for alertmanager. | https://github.com/prometheus/alertmanager#high-availability | |
3. | Pushgateway | 9091(local only) | Local service to store metrics for ephemeral scripts. It will supersede SevOne-statsd (runs on all peers). | https://prometheus.io/docs/practices/pushing/ |
4. | Node Exporter | 9100(local only) | Local service that exposes various metrics for each peer (runs on all peers). | https://github.com/prometheus/node_exporter |
Execute the following command before running the SevOne-* commands below.
$ podman exec -it nms-nms-nms /bin/bash
Component | Script | Description | Core Schedule | Log File |
---|---|---|---|---|
5. | SevOne-notify-core-processes-status.py | Exposes metrics for the running core processes. | Every 1 minute | SevOne-notify-core-processes-status.log |
SevOne-notify-peer-state-check.py | Exposes metrics for the peer state. | Every 1 minute | SevOne-notify-peer-state-check.log | |
SevOne-notify-peer-takeover-status.py | Exposes metrics during HSA failover/takeover. | Every 1 minute | SevOne-notify-peer-takeover-status.log | |
SevOne-notify-DBTable-crash.py | Exposes when database table crashes and it cannot be repaired automatically. | Every day at 6:00 am UTC | SevOne-notify-DBTable-crash.log | |
SevOne-notify-split-brain | Exposes metrics for the split-brain state. | Every 1 minute | SevOne-notify-split-brain.log |
The following processes expose the metrics to Prometheus. But, these are exposed as local port only and are only accessible from within the system.
Process | Description | Local Port |
---|---|---|
SevOne-datad | Exposes metrics when SevOne-datad experiences database write errors. | 60123 |
SevOne-ingestion-resolved | Exposes metrics if a peer is overloaded, i.e. if a peer exceeds its Indicators per second (IPS) capacity. | 6666 |
Using curl, verify services are exporting metrics
# SevOne-datad metrics
$ curl http://localhost:60123/metrics
# SevOne-ingestion-resolved metrics
$ curl http://localhost:6666/metrics
Gather Data using Prometheus
Prometheus service gathers data by polling predefined REST API end points. The URLs expose the metrics that Prometheus parses and stores locally. A list of all configured collection targets can be found on the Cluster Leader's Prometheus instance, CLUSTER_LEADER_IP/prometheus/targets. The following table lists the URLs that Prometheus polls periodically.
URL | Service |
---|---|
CLUSTER_LEADER_IP/prometheus/metrics | Prometheus metrics. |
https://PEER_IP/metrics/datad | SevOne-datad metrics for a peer. |
https://PEER_IP/metrics/ingestion-resolved | SevOne-ingestion-resolved metrics per peer. |
https://PEER_IP/metrics/push | Pushgateway metrics per peer. |
https://PEER_IP/metrics/node | Node Exporter metrics per peer. |
https://PEER_IP/metrics/ocd | SevOne-ocd metrics for a peer. |
https://PEER_IP/metrics/kafka | Kafka metrics for a peer. |
https://PEER_IP/metrics/config-db | Config database metrics for a peer. |
https://PEER_IP/metrics/data-db | Data database metrics for a peer. |
All metrics are under a common /metrics/ endpoint. This allows you to use the nginx proxy to expose all local ports. The local ports are not accessible from outside a peer and all communication with the metrics endpoints is over TLS.
View Metrics
The metrics are available for the entire cluster on Prometheus user interface.
- Navigate to http://<Cluster Leader IP address>/prometheus.
- Click on - insert metric at cursor drop-down to view the collected metrics.
- Select the metric you want to see. For example, node_filesystem_free_bytes.
- Select Graph tab.
- Click Execute button.
The graph for the selected metric is plotted for all the peers in the cluster. The default time interval is 1 hour, but it can be changed. The individual peer data can be selected or unselected by clicking on in the legend.
Example: http://<Cluster Master IP address>/prometheus
Start & Run Self-monitoring System
systemctl start nms-prometheus.service
systemctl restart nms-prometheus.service
systemctl stop nms-prometheus.service
When the services are operational, the following URLs can be used.
URL | Service |
---|---|
<Cluster Leader IP address>/prometheus | Prometheus |
<Cluster Leader IP address>/alertmanager | Alert Manager |
Configure Self-monitoring System
All communication between prometheus, alertmanager, and various scrape nodes are secured via basic_auth provided by Nginx. The following command can be used to configure the basic_auth username and password.
If your appliance is on SevOne NMS version <= 5.7.2.25 and you have configured prometheus without SevOne-act prometheus credential-change script, then you must follow these steps.
- Upgrade to SevOne NMS 5.7.2.32.
- Upgrade to SevOne NMS 6.1.0.
- By default, prometheus will be disabled until you configure it with SevOne-act prometheus credential-change script (please see the command below). This is a required step.
$ podman exec -it nms-nms-nms /bin/bash
$ SevOne-act prometheus credential-change --username <username> --password <password>
The username and password entered in the command above can be used by SevOne NMS user to view prometheus and alertmanager via the Web user interface. The configuration URLs for prometheus and alertmanager can be found below.
The command above also auto-generates the username and password that can be used for API communication between prometheus, alertmanager, and scrape nodes.
To access SevOne NMS Graphical User Interface, you are required to set the password. Execute the following commands to set it.
The default username is SevOneUser.
Set password for default username SevOneUser
$ podman exec -it nms-nms-nms /bin/bash
$ SevOne-act prometheus credential-change --username SevOneUser --password <password>
If you choose to change the username and password, enter the new <username> and <password> in the following commands.
Example: Set username & password
$ podman exec -it nms-nms-nms /bin/bash
$ SevOne-act prometheus credential-change --username <username> --password <password>
The configuration files can be found in the following location.
File | Description |
---|---|
/config/prometheus/prometheus.yml | The main Prometheus configuration file. It includes the configuration for each peer in the cluster and the available Alert Managers. |
/config/prometheus/rules.yml | Defines the rules to use in order to trigger alerts. |
/config/alertmanager/alertmanager.yml | The Alert Manager configuration file. It is automatically populated with the configured SMTP settings in SevOne NMS upon the start of the service(s). |
Enable / Disable Rule for Alert
Rules to trigger alerts are defined in /config/prometheus/rules.yml file.
Sample section of /config/prometheus/rules.yml file
groups:
- name: SevOne Rules
rules:
- alert: SevOne Core Processes are not running
expr: core_processes_info{not_running_services!=""}
for: 10s
labels:
severity: critical
annotations:
summary: "SevOne Core Processes like {{ $labels.not_running_services }} are not running on {{ $labels.peer_ip }}."
description: "Please run `SevOne-act check services` on {{ $labels.peer_ip }}."
- alert: SevOne Peer State information unavailable
expr: peer_state_info{healthy="0"}
for: 10s
labels:
severity: critical
annotations:
summary: "Peer State information is unavailable on a peer"
description: "{{ $labels.peerstateinfo }}"
To disable an alert from being triggered, using a text editor of your choice, you may open /config/prometheus/rules.yml file to comment out the alert block you want to disable (by adding the #). Save the file and restart the prometheus service.
Comment out the alert you want to disable in /config/prometheus/rules.yml file
groups:
- name: SevOne Rules
rules:
# - alert: SevOne Core Processes are not running
# expr: core_processes_info{not_running_services!=""}
# for: 10s
# labels:
# severity: critical
# annotations:
# summary: "SevOne Core Processes like {{ $labels.not_running_services }} are not running on {{ $labels.peer_ip #}}."
# description: "Please run `SevOne-act check services` on {{ $labels.peer_ip }}."
- alert: SevOne Peer State information unavailable
expr: peer_state_info{healthy="0"}
for: 10s
labels:
severity: critical
annotations:
summary: "Peer State information is unavailable on a peer"
description: "{{ $labels.peerstateinfo }}"
To validate the syntax of /config/prometheus/rules.yml file prior to running the prometheus application normally, execute the following command.
Validate syntax of /config/prometheus/rules.yml file
$ podman exec -it nms-prometheus-prometheus /bin/sh
$ promtool check rules /config/prometheus/rules.yml
Checking /config/prometheus/rules.yml
SUCCESS: 1 rules found
The response from this command lets you know how many rules are configured in the file.
Restart Prometheus service
$ systemctl restart nms-prometheus.service
For details on how to configure Prometheus and Alert Manager, please refer to the following configuration URLs.
Service | Configuration URL |
---|---|
Prometheus | https://prometheus.io/docs/prometheus/latest/configuration/configuration/ |
Alert Manager | https://www.prometheus.io/docs/alerting/latest/alertmanager/ |
Set up OOTB Rules for Self-monitoring
- Using a text editor of your choice, open the Prometheus configuration file in
/config/prometheus/prometheus.yml. The rule_files section lists the path of all the
rule files.Important: After an upgrade, /config/prometheus/prometheus.yml file is subject to be reset. Although the file can be modified, you must proceed with CAUTION!Example: /config/prometheus/prometheus.yml file
$ vi /config/prometheus/prometheus.yml # my global config global: scrape_interval: 1m # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). external_labels: dc: master1 # Alertmanager configuration alerting: alert_relabel_configs: - source_labels: [dc] regex: (.+)\d+ target_label: dc alertmanagers: - static_configs: scheme: http file_sd_configs: - files: - 'alertmanager_peers.json' refresh_interval: 1m # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'sevone' tls_config: insecure_skip_verify: true scheme: 'https' file_sd_configs: - files: - 'prometheus_peers.json' refresh_interval: 1m relabel_configs: - source_labels: [__address__] regex: '[^/]+(/.*)' # capture '/...' part target_label: __metrics_path__ # change metrics path - source_labels: [__address__] regex: '([^/]+)/.*' # capture host:port target_label: __address__ # change target
- Using a text editor of your choice, you may open each rule file to see its details. The
rules are readable. As required, more rules will be added.
Example: /config/prometheus/rules.yml fileImportant: After an upgrade, /config/prometheus/rules.yml file is subject to be reset. Although the file can be modified, you must proceed with CAUTION!$ vi /config/prometheus/rules.yml groups: - name: SevOne Rules rules: - alert: SevOne Core Processes are not running expr: core_processes_info{not_running_services!=""} for: 10s labels: severity: critical annotations: summary: "SevOne Core Processes like {{ $labels.not_running_services }} are not running on {{ $labels.peer_ip }}." description: "Please run `SevOne-act check services` on {{ $labels.peer_ip }}." - alert: SevOne Peer State information unavailable expr: peer_state_info{healthy="0"} for: 10s labels: severity: critical annotations: summary: "Peer State information is unavailable on a peer" description: "{{ $labels.peerstateinfo }}" - alert: IngestionPipelineOverload expr: incoming_data_points{State="Overload"} > 0 annotations: title: 'Ingestion Pipeline Overload' description: 'The incoming data points per second exceeds the pipeline limit.' labels: severity: 'critical' - alert: HighDbInsertionErrorRate expr: rate(db_failed_insertions_total[5m]) / rate(db_insertions_total[5m]) > 0.5 for: 1m annotations: title: 'SevOne-datad has a high database insertion error rate' description: 'The database insertion error rate of SevOne-datad has been higher than 50% for more than 1 minute.' labels: severity: 'critical' - alert: PAS-HSA Pair Split Brain expr: SplitBrain_info{notify = "1"} for: 10s labels: severity: page annotations: summary: "{{ $labels.exported_job }} in Split Brain" description: " Current state is {{ $labels.current_state }}\n pair_status: {{ $labels.pair_status }}" - alert: SevOne Take Over Action expr: take_over_info{healthy="0"} for: 1s labels: severity: critical annotations: summary: "Peer take over time exceeds set time" description: "Please check take over action : {{ $labels.takeoverinfo }}"
Set up Alert Email Configuration
You may optionally configure email addresses to receive Alert notifications. From SevOne NMS user interface, if you have set up the details in Administration > Cluster Manager > Cluster Settings tab > Email subtab, then the SMTP settings will be set in /config/alertmanager/alertmanager.yml when alertmanager service has been started.
Start alertmanager service
$ systemctl restart nms-prometheus.service
- In Cluster Leader and Cluster Leader's HSA, using a text editor of your choice, edit
/config/alertmanager/alertmanager.yml file.Important: After an upgrade, /config/alertmanager/alertmanager.yml file is subject to be reset. Although the file can be modified, you must proceed with CAUTION!
$ vi /config/alertmanager/alertmanager.yml
Add new email address(es) to email_configs section.
Example: /config/alertmanager/alertmanager.yml fileglobal: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'web.hook' receivers: - name: 'web.hook' email_configs: - to: foo@gmail.com,bar@gmail.com smarthost: smtp.gmail.com:587 auth_username: foo@gmail.com auth_identity: foo@gmail.com auth_password: xxxxx send_resolved: true webhook_configs: - url: 'http://127.0.0.1:5001/' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
Important: Only TLS connections to the SMTP server are supported. For details, please refer to https://www.prometheus.io/docs/alerting/latest/configuration/#email_config.Note: To receive email alerts on multiple email ids, the to field in the example above can contain email addresses separated by commas.Optionally, after editing the alertmanager.yml file, it is recommended that you validate the config file for any syntax or other errors by before starting or restarting the alertmanager service.
Validate config file$ podman exec -it nms-prometheus-alertmanager /bin/sh $ amtool --alertmanager.url=http://localhost/alertmanager check-config \ /config/alertmanager/alertmanager.yml
- Restart the alertmanager service after updating the alertmanager.yml
file.
$ systemctl restart nms-prometheus.service
- Repeat the steps above in this section on the Cluster Leader's HSA.
- Test if the new email configurations is
working.Example
Example
$ podman exec -it nms-prometheus-alertmanager /bin/sh $ amtool --alertmanager.url=http://localhost/alertmanager alert add Testing123
Important:- After updating the email settings from SevOne NMS user interface, the alertmanager service must be restarted for the email settings to take effect.
- If you are editing alertmanager.yml manually, make sure it is owned by user/group prometheus.
Silence Alert for some Duration
During Maintenance Windows or Upgrade Process, alerts must either disabled or silenced for some time. This can be achieved via the graphical user interface.
Silence Alert with User Interface
To silence alert with user interface, open the alertmanager user interface from your browser.
Example: http://<Cluster Master IP address>/alertmanager/#/alerts
- Click New Silence button in the upper-right corner to open the New Silence page
for alertname="SevOne Core Processes are not running" alert.
- Based on the example above, alertname="SevOne Core Processes are not running"
is an alert and it contains various alert labels such as, active_appliance="PRIMARY",
capacity="5000", ..., job="sevone", etc.Note: An alert can contain multiple labels and a label can be a part of more than one alert.
- In the Name field, enter the label name of the alert you want to silence. For example, job.
- In the Value field, enter the value for the label name selected in the Name field. For example, sevone.
- In the Creator field, enter the name of the person who is silencing the alert.
- In the Comment field, enter the reason for silencing the alert. For example, Maintenance Windows, Upgrade Process, etc.
- In the Duration field, enter the period (in hours) you want to silence this
alert.
Example: Silence alert 'SevOne Take Over Action' generated by SevOne NMS - Click Create button to create the rule to silence the alert. Preview Alerts button shows the alerts that match your configuration. Click Preview Alerts button to confirm the alerts which will be affected by the silence configuration.
Alert Manager High Availability Configuration
High Availability Prometheus Alerting & Notifications
To gracefully handle any machine that is failing, two Prometheus servers and two Alert Managers are running - one on the Cluster Leader and the other on its HSA.
Example: /config/prometheus/prometheus.yml file
# my global config
global:
scrape_interval: 1m # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
external_labels:
dc: master1 # Note that this is only going to be different only by the trailing number on CM's HSA
# Alertmanager configuration
alerting:
alert_relabel_configs:
- source_labels: [dc]
regex: (.+)\d+
target_label: dc
alertmanagers:
- static_configs:
scheme: http
file_sd_configs:
- files:
- 'alertmanager_peers.json'
refresh_interval: 1m
- Both Cluster Leader and Cluster Leader's HSA Prometheus servers talk to both Alert Managers. In addition to this, the two Prometheus servers have slightly different external labels so that their data does not conflict.
- Alert relabelling ensures a High Availability pair of Prometheus servers with different external labels send identical labelled alerts, which the Alert Manager automatically deduplicate. Alert relabelling is applied after external labels.
- High Availability Prometheus Alerting & Notification ensures that if the Cluster Leader's Prometheus Alert Manager fails, no alerts will be missed as Prometheus and Alert Manager are running on HSA as well. And, in normal case, this will also ensure to deduplicate alerts.
Change Scrape Interval
In /config/prometheus/prometheus.yml file, section scrape_configs specifies a set of targets and parameters describing how to scrape them.
- scrape_interval defines the prometheus scrape metrics from monitored targets at regular intervals. The default value is 1m. The scrape_interval can be configured globally and then, overridden per job.
Example: /config/prometheus/prometheus.yml
In this example, prometheus is configured to scrape metrics every 1 minute.
# my global config
global:
scrape_interval: 1m # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
external_labels:
dc: master1
Troubleshooting
Verify Correct Metrics are exposed on a peer
To verify if the correct metrics are exposed on a peer, connect to the peer's nginx proxy endpoints for each exposed metric.
$ curl -k -X GET https://localhost:443/metrics/ingestion-resolved
# HELP exposer_transferred_bytes_total Transferred bytes to metrics services
# TYPE exposer_transferred_bytes_total counter
exposer_transferred_bytes_total 1398319.000000
# HELP exposer_scrapes_total Number of times metrics were scraped
# TYPE exposer_scrapes_total counter
exposer_scrapes_total 3675.000000
# HELP exposer_request_latencies Latencies of serving scrape requests, in microseconds
# TYPE exposer_request_latencies summary
exposer_request_latencies_count 3675
exposer_request_latencies_sum 1830967.000000
exposer_request_latencies{quantile="0.500000"} 387.000000
exposer_request_latencies{quantile="0.900000"} 406.000000
exposer_request_latencies{quantile="0.990000"} 406.000000
# HELP incoming_data_points Number of data points ingestion pipeline is receiving
# TYPE incoming_data_points gauge
incoming_data_points{Process="SevOne-ingestion-resolved",State="Overload"} 0.000000
incoming_data_points{Process="SevOne-ingestion-resolved",Count="Total",Granularity="30min"} 0.000000
This can be done for all supported metric endpoints.
$ curl -k -X GET https://localhost:443/metrics/push
$ curl -k -X GET https://localhost:443/metrics/node
$ curl -k -X GET https://localhost:443/metrics/datad
$ curl -k -X GET https://localhost:443/metrics/ingestion-resolved
You can also navigate to the URLs from your web browser. Each peer must export these metrics. DNCs normally do not have SevOne-datad and SevOne-ingestion-resolved running on them so these endpoints will fail.
Verify Metrics Endpoints
Each metric endpoint is served by nginx via a proxy configuration in /config/nginx/proxy.d/20_prometheus.conf file.
Example: 20_prometheus.conf file
location ~ ^/metrics/push {
proxy_intercept_errors on;
set $upstream http://127.0.0.1:9091/metrics;
proxy_pass $upstream;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
include /config/nginx/proxy_settings.conf;
proxy_read_timeout 18000;
}
location ~ ^/metrics/node {
proxy_intercept_errors on;
set $upstream http://127.0.0.1:9100/metrics;
proxy_pass $upstream;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
include /config/nginx/proxy_settings.conf;
proxy_read_timeout 18000;
}
location ~ ^/metrics/datad {
proxy_intercept_errors on;
set $upstream http://127.0.0.1:60123/metrics;
proxy_pass $upstream;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
include /config/nginx/proxy_settings.conf;
proxy_read_timeout 18000;
}
location ~ ^/metrics/ingestion-resolved {
proxy_intercept_errors on;
set $upstream http://127.0.0.1:6666/metrics;
proxy_pass $upstream;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
include /config/nginx/proxy_settings.conf;
proxy_read_timeout 18000;
}
Each proxy endpoint redirects to local-only ports as follows.
Pushgateway | 127.0.0.1:9091 |
Node Exporter | 127.0.0.1:9100 |
SevOne-datad | 127.0.0.1:60123 |
SevOne-ingestion-resolved | 127.0.0.1:6666 |
Kafka Exporter | 127.0.0.1:18080 |
MariaDB Exporter | 127.0.0.1:9104 |
If the nginx proxy is not working, try to reach the local port of the peer. You will need to SSH into that peer since the port is not exposed.
Obtain SevOne-datad metrics
$ ssh root@peer1
$ curl -X GET http://127.0.0.1:60123/metrics
# HELP exposer_transferred_bytes_total Transferred bytes to metrics services
# TYPE exposer_transferred_bytes_total counter
exposer_transferred_bytes_total 1275032.000000
# HELP exposer_scrapes_total Number of times metrics were scraped
# TYPE exposer_scrapes_total counter
exposer_scrapes_total 3774.000000
# HELP exposer_request_latencies Latencies of serving scrape requests, in microseconds
# TYPE exposer_request_latencies summary
exposer_request_latencies_count 3774
exposer_request_latencies_sum 1748553.000000
exposer_request_latencies{quantile="0.500000"} 574.000000
exposer_request_latencies{quantile="0.900000"} 589.000000
exposer_request_latencies{quantile="0.990000"} 589.000000
# HELP db_insertions_total The number of insertion queries to save data points
# TYPE db_insertions_total counter
db_insertions_total 0.000000
# HELP db_failed_insertions_total The number of failed insertion queries to save data points
# TYPE db_failed_insertions_total counter
db_failed_insertions_total 0.000000
If the local port is reachable but the nginx endpoint is not, then something is wrong with the proxy configuration and must be debugged. If the local port is not reachable as well, then something is wrong with the actual daemon (SevOne-datad, in this case).
Prometheus & Alert manager Fail to Start
Prometheus and Alertmanager daemons will only run on the Cluster Leader and its HSA. These daemons will not run on any other peer. The scripts responsible for running them are:
- systemctl restart nms-prometheus.service
If you are unable to start the daemons, look at the log files in /var/log/prometheus/ for any obvious reason why the daemons are failing to start.
One important consideration is that the service discovery files /config/prometheus/prometheus_peers.json and /config/prometheus/alertmanager_peers.json need to be owned by the user prometheus. If not, you will see permission denied issues in the logs.
Running the wrapper scripts manually under root while debugging them, is not recommended. This will change the ownership of the service discovery files to root and the services will not to start due to permission denied errors. You must delete the service discovery files /config/prometheus/*.json or change their ownership to the prometheus user after you are done debugging.
Alerts
You may use amtool that is bundled with the alertmanager installation. For additional details, please see https://github.com/prometheus/alertmanager#amtool.
Prometheus Service is Down
If prometheus service is down on both the Cluster Leader and its HSA,
- You will not be able to access the prometheus user interface from both the Cluster Leader and its HSA.
- You will not receive any new alerts.
If prometheus service is down on either the Cluster Leader or its HSA,
- You will not be able to access the prometheus user interface on the machine where the prometheus service is down.
- You will receive all alerts.
Alert manager Service is Down
If alertmanager service is down on both the Cluster Leader and its HSA,
- You will not be able to access the alertmanager user interface from both the Cluster Leader and its HSA.
- You will not receive any new alert emails.
- You will receive new alerts from prometheus user interface.
If alertmanager service is down on either the Cluster Leader or its HSA,
- You will not be able to access the alertmanager user interface on the machine where the alertmanager service is down.
- You will receive new alert emails.
References / Related Documents
Document | URL |
---|---|
Alertmanager Configuration | https://www.prometheus.io/docs/alerting/latest/alertmanager/ |
Alertmanager Installation (amtool) | https://github.com/prometheus/alertmanager#amtool |
Node Exporter | https://github.com/prometheus/node_exporter |
Prometheus Configuration | https://prometheus.io/docs/prometheus/latest/configuration/configuration/ |
Pushgateway | https://prometheus.io/docs/practices/pushing/ |
Set up Webhook Notification | https://prometheus.io/docs/alerting/latest/configuration/#webhook_config |
Service Alertmanager |
|
Service Prometheus | https://www.prometheus.io/docs/introduction/overview/ |
TLS connections to SMTP server | https://www.prometheus.io/docs/alerting/latest/configuration/#email_config |