Smart Alerts for infrastructure
With Smart Alerts, you can automatically receive alerts based on the infrastructure metrics that you select.
Instana suggests the thresholds and remaining configurations for you when you select infrastructure metrics from the list for which you want to receive alerts. You might add multiple alerting channels to the configuration, and Instana automatically creates a customized alert for you.
Kubernetes-specific Smart Alerts
If you need to monitor Kubernetes environments, you can access a specialized view of infrastructure Smart Alerts that show only Kubernetes-related alerts. This view is available in the Instana UI () and provides a focused experience for Kubernetes administrators. For more information, see .Smart Alerts for infrastructure
Adding an alert
To add an alert, do the following steps:
- From the navigation menu in the Instana UI, select Infrastructure.
- Select the Smart Alerts tab.
- Click Add Smart Alert.
The Add Smart Alert opens the alert configuration dialog where you can configure Smart Alerts.
The alert configuration process includes the following steps:
- Define the scope.
- Define the threshold for violations.
- Define the time threshold about when to be alerted.
- Select the alert channels that are to be notified.
- Define the alert properties.
- Add custom payloads to be included in alerts.
Defining the scope
To define the scope of the alerting, complete the following steps in the Scope section:
-
Select a metric from the Metric list by using one of the following options:
- On the List tab, search the metrics list by using keywords.
- On the Regex tab, define the metric scope through regular expressions.
Figure 1. selecting the metrics
-
Define the aggregation as follows:
- Time aggregation: Select the preferred cross-time aggregation. This method combines data-points into a single bucket.
- Cross-series aggregations: To sum the buckets across the data series, set Use SUM for cross-series aggregation to on. Usually, the cross-series aggregation is the same as the cross-time aggregation.
-
Select one of the following alerting methods:
- Custom aggregation: To group the metric data based on specific custom tags that you define, click Custom aggregation. This method is enabled by default.
Figure 2. Custom aggregation
Use custom aggregation in the following scenarios:
- Monitoring aggregated metrics across multiple entities (for example, average CPU usage across all hosts in a specific zone or region).
- Reducing alert noise by grouping similar entities together.
- Analyzing overall trends and patterns rather than individual entity behavior.
- Tracking overall capacity or resource utilization across a cluster or environment.
Example: Monitor average CPU load across all production hosts grouped by availability zone. You receive a single alert when the average CPU load in a zone exceeds the threshold, rather than individual alerts for each host.
-
Per entity alerting: To monitor each metric individually and trigger alerts for each individual entity, click Per entity alerting.
Figure 3. Per entity alerting
Use per entity alerting in the following scenarios:
- Identifying and responding to issues on specific individual entities (for example, a particular host, container, or database instance).
- Providing separate attention and remediation for each entity.
- Monitoring critical resources where individual failures matter.
- Tracking entity-specific SLAs or performance requirements.
Example: Monitor CPU usage on each production database server individually. You receive a separate alert for each database server that exceeds the CPU threshold, allowing you to pinpoint and address the specific problematic instance.
Per entity alerting supports grouping by metric-related tags such as
metricIdor custom tags such asdevice,mountpoint, andstate. This grouping enables individual alerts for each unique metric variant when using metric patterns or metrics with custom tags.Figure 4. Per entity alerting with grouping
- Custom aggregation: To group the metric data based on specific custom tags that you define, click Custom aggregation. This method is enabled by default.
-
Add filters to further narrow the scope.
-
Group the metric results with the available grouping tags. You can use up to 5 tags for grouping metrics.
Defining the threshold
When you set up a Smart Alert for infrastructure, you can choose to use static or adaptive thresholds.
Static
Static thresholds do not change over time. You can set them when you create or modify the Smart Alert. You can provide different thresholds for the warning and critical severities. A static threshold might stop being relevant after the underlying metric changes significantly. In response, you can manually adjust or recalculate the threshold at any point in time. You can select a threshold operator to define the threshold condition.
When to use static threshold
Static thresholds work best in the following situations:
- Regardless of seasonality of the underlying metric, the metric must not exceed or fall below a constant value.
- The underlying metric is seasonal, and therefore different thresholds exist depending on the time of day or week. However, these thresholds themselves don't change over time. Gradual changes to these thresholds over long periods of time are undesirable.
Adaptive
Adaptive thresholds continuously evolve and adjust themselves with new data that Instana observes. This means that the threshold continuously accounts for seasonal changes to the underlying metric without any human intervention. For more information, see the adaptive threshold documentation.
When to use adaptive threshold
Adaptive thresholds work best in the following situations:
- The underlying metric is not seasonal. The threshold is expected to gradually change over time, but any sudden deviation from this trend is undesirable.
- The underlying metric is seasonal and different thresholds exist for different times of the day or week. The thresholds themselves are expected to gradually change over time, but any sudden deviation from this trend is undesirable.
Adaptive threshold requirements
The adaptive threshold requires at least 6 hours of continuous metric data. If this requirement is not fulfilled, you can still create the Smart Alert. Issue detection and alerting will start working as soon as the data requirement is met to initialize the used model.
Alert preview
After you define the scope and threshold, the chart is plotted based on the historical data against the metrics. A maximum of 7 days of historical data is available for visualization in the chart. You can switch between the last 24 hours to 7 days of historical data to visualize the historical variations of metric data.
Based on the historical data and threshold conditions, the chart displays the alerts that the current threshold value might trigger.
If you select any grouping options, the grouping results might appear as a table just after the chart. To analyze the metric data trends in the chart against each grouping, select the respective rows in the table.
Defining the time threshold
For the alert that is triggered, you can add more conditions in the Time threshold section on when the defined threshold for the selected metric is violated.
The following typical conditions, often used in practice, are as follows:
- Persistence over time: Select a time window and the number of consecutive times of violation. You receive an alert when the metric violates a defined threshold over the defined time window.
Optional: Get alerted in advance using forecast alerting
You can set up forecast alerting to receive proactive alerts that help you address potential issues before they affect your system. For example, you might want to receive an alert when a disk is nearing full capacity or when the memory usage of a process is close to the container limits, which might indicate a memory leak. With Instana, you can configure alerts based on metric forecasts.
When you opt for the forecast alerting feature, configure the following two time windows:
- Historical data time window: This window specifies the time frame of the metric that is used to fit the model for forecasting. It allows you to specify whether you care about the short-term or long-term trends of the selected metric.
- Forecasted time window: This window specifies the time frame of the linear forecast that is used for alerting. Larger forecasted time values increase the chance of false alerts.
The following image illustrates these time windows for two example metrics and their linear forecast with different outcome in alerting:
An alert is triggered when either the metric value or the metric forecast is exceeding the threshold based on the configured rule.
Adding alert channels
You can configure different alert channels for different severities in Smart Alerts for Infrastructure. To add alert channels, complete the following steps:
- Click Select Alert Channel.
- From the list of preconfigured channels, select the channels from which you want to receive the alerts.
If a threshold value is set for warning and critical severities, you can set the alert channels for each severity. If a threshold value is set for both severities, all the alert channels are selected for the warning severity by default.
Alert channels with both severities configured:
If a threshold value is set only for one severity, the severity is displayed for every alert channel as the alert level.
Alert channels with one severity configured:
For more information about creating channels, see Alert Channels.
Selecting Alert properties
In this section, you can optionally configure various properties related to the alerts created by using the Smart Alert configuration.
Title
Instana suggests a default title based on the selected entity type and metric. However, you can override this title with your custom static text or use a dynamic title by inserting placeholders.
You can insert dynamic placeholders into the alert title in the Insert Placeholder dropdown. These placeholders help identify the context of the alert more clearly when it is triggered.
- You can now include the
${severity}placeholder in the title. This placeholder is useful when you configure multiple severity levels within a single alert. For example, a title such asHigh CPU Usage - ${severity}indicates the severity level directly in the alert title. - The available placeholders vary depending on the selected alerting method:
- With custom aggregation, grouping by a tag (
zone), and then, use the grouped tags as placeholders (${zone}). This placeholder is now available in the Insert Placeholder dropdown. - When configuring per-entity alerts, the
${entity.label}placeholder is available. This placeholder identifies the specific entity that triggered the alert. - When you use metric patterns with regex capturing groups (for example,
fs\.(.+)\.free), you can include the captured values as placeholders in the title. The Insert Placeholder menu contains options similar to the following items:Regex 1st capturing groupfor the first capturing group in your patternRegex 2nd capturing group,Regex 3rd capturing group, and so on for additional capturing groups
With these placeholders, you can include parts of the matched metric name in your alert titles dynamically.
- With custom aggregation, grouping by a tag (
Triggers incident
Use the toggle to automatically trigger an Incident when the alert is generated. The alert is recorded as the triggering event for the incident. The incident includes related events and provides recommended actions.
Description
Optionally, add a description for the alert. The description summarizes the alert's purpose and outlines suggested steps for investigation or resolution.
You can now insert dynamic placeholders into the description. Use the Insert Placeholder dropdown and add placeholders that provide contextual information when the alert is triggered.
Adding custom payloads
To include an additional payload that is relevant to you in alert notifications for specific alert configuration that is sent by Instana, click Add Row in the Custom Payloads section.
Both global custom payloads and alert-specific custom payloads are included in alert notifications if applicable, but the alert-specific configuration takes precedence over the global configuration. As a result, if you use the same key, the value of the global custom payload field is overridden by the alert-specific one.
The following image shows globally defined custom payloads that are used in the alert configuration:
For information about global custom payloads, see Configure Custom Payload Globally.
As of now, public preview does not support dynamic global custom payloads.
Terraform Support
Instana enables Infrastructure as Code (IaC) capabilities by providing a Terraform resource for managing infrastructure Smart Alerts programmatically. This capability allows DevOps and SRE teams to define, deploy, and maintain alert configurations as code. It helps improve automation and consistency across environments.
For more information about managing infrastructure Smart Alerts by using Terraform, see the Instana infrastructure alert configuration documentation.
FAQ
Why migrate Custom Events on infrastructure metrics to Smart Alerts?
- More flexible metric selection with support for metric patterns and regex.
- Direct assignment of alert channels with severity-based routing.
- Enhanced grouping and aggregation capabilities.
- Forecast alerting for proactive monitoring.
- Dynamic placeholders in alert titles and descriptions.
- Previews when setting up the alert configuration based on historic data.
How to migrate a Custom Event to a Smart Alert
- Multi-metric Custom Events: Only Custom Events with a single metric can be migrated.
- Aggregation types: Custom Events that use the following aggregations cannot be migrated:
- Relative difference
- Absolute difference
- System Rules: The following built-in system rules cannot be migrated to infrastructure Smart Alerts:
- Offline event detection
- Hosts that do not have matching entities running
- Host availability detection
- Hosts that have an unexpected number of entities running
Semi-automatic migration
- Mark as migrated: Marks a deprecated Custom Event as migrated and disables it. Select this option when you manually migrate a Custom Event to a Smart Alert. After marking the Custom Event as migrated, you can still view it in the list and review its configuration for reference. This function helps track your migration progress and ensures that no Custom Event is migrated more than once.
- Migrate to Smart Alert: Opens the Smart Alert dialog with pre-populated values from the Custom Event. These values can include the name, description, severity, incident flag, metric, evaluation granularity, aggregation, operator, and threshold. Instana attempts to migrate these fields on a best-effort basis. The scope defined in the Custom Event's Dynamic Focus Query (DFQ) or selected entities is transferred by selecting the respective entities or by using tag filters. If the DFQ cannot be fully mapped, a warning message is displayed, and you can then manually adjust the scope. Saving the Smart Alert automatically disables the previous Custom Event and marks it as migrated.
Manual migration
- Metric mapping: When migrating, ensure that you select the equivalent metric in the infrastructure Smart Alert configuration. Use the metric list or regex pattern matching to identify the correct metric.
- Scope and entity selection: Infrastructure Smart Alerts use tag-based filtering instead of Dynamic Focus Queries. To replicate the scope of your Custom Event:
- Identify the entities that are in scope of the Custom Event.
- Use the Add filter option in the Smart Alert configuration to apply equivalent tag filters.
- Choose between Custom aggregation (for aggregated metrics across entities) or Per entity alerting (for individual entity monitoring).
- Aggregation mapping: Map the aggregation type from your Custom Event to the equivalent option in Smart Alerts:
- Time aggregation: Select the cross-time aggregation method (for example, avg, min, max, sum).
- Cross-series aggregation: Enable Use SUM for cross-series aggregation if you want to sum across multiple entities.
-
Threshold adjustment: When migrating threshold values, consider the following:
- Metric granularity differences: Custom Events and infrastructure Smart Alerts use different underlying metric rollups:
- Custom Events use 1-second metric streams for time windows under 30 minutes, and 5-second rollups (averaged from 1-second metrics) for time windows of 30 minutes or longer. The aggregation is applied to a sliding window of these values.
- Infrastructure Smart Alerts use evaluation cycles with 10-second rollup values (averaged from 1-second metrics). Each evaluation cycle performs cross-time aggregation (and optionally cross-series aggregation) based on these 10-second rollups.
- Adjust threshold values based on the new evaluation granularity and aggregation type.
- Use the preview chart to validate that your threshold triggers alerts as expected with historical data.
Example for SUM aggregation: If your Custom Event for an ActiveMQ entity had a threshold of 100 messages per second with a 1-minute time window using SUM aggregation, and you are using a 5-minute evaluation granularity with SUM aggregation in the Smart Alert, adjust the threshold to 30,000 messages (100 × 60 × 5).
Example for MEAN aggregation: If your Custom Event used MEAN (average) aggregation with a threshold of 100 messages per second, and you are using MEAN aggregation in the Smart Alert, the threshold value remains approximately the same (100 messages per second or ~1,000 messages per 10-second rollup), as the average is computed over the evaluation window rather than summed.
The key difference: SUM aggregation requires scaling the threshold by the evaluation window duration, while MEAN, MIN, MAX aggregations typically do not require threshold scaling.
- Metric granularity differences: Custom Events and infrastructure Smart Alerts use different underlying metric rollups:
- Time threshold mapping: Map the Custom Event grace period and time window to the Smart Alert time threshold options:
- Use Persistence over time to require consecutive violations before alerting.
- The evaluation granularity in Smart Alerts provides more stable metrics compared to per-second granularity in Custom Events, reducing the need for extensive grace periods.
- Alert channel assignment: Unlike Custom Events that rely on alert configurations for routing, infrastructure Smart Alerts allow direct assignment of alert channels with severity-based routing. Assign the appropriate channels for warning and critical severities as needed.
What are the differences between Custom aggregation and Per entity alerting?
Custom aggregation groups metric data based on tags you define and evaluates the aggregated metric against the threshold. This approach is useful for monitoring overall trends across multiple entities and reducing alert noise. You receive a single alert when the aggregated metric violates the threshold.
Per entity alerting monitors each entity individually and triggers separate alerts for each entity that violates the threshold. This approach is useful when you need to identify and respond to issues on specific entities. You receive individual alerts for each affected entity.
Choose the alerting method based on whether you want to monitor aggregated behavior (custom aggregation) or individual entity behavior (per entity alerting).
How does forecast alerting work in infrastructure Smart Alerts?
- Historical data time window: Specifies the time frame of past data used to fit the forecasting model.
- Forecasted time window: Specifies how far into the future to forecast.
An alert is triggered when either the current metric value or the forecasted value exceeds the threshold. This proactive approach helps you to address potential issues before they impact your system, such as disk space running out or memory leaks approaching container limits.
Larger forecasted time windows increase the chance of false alerts, so balance proactive alerting with alert accuracy based on your specific use case.