Anomaly functions apply a sliding window to a signal of time series data to capture patterns in the signal. The window size determines the size of the sliding window.
In the Analytics Service, a signal represents all data points that are included when the pipeline runs an anomaly function. For example, the pipeline might include 60 data points.
An anomaly function breaks up a signal into segments or windows. It uses the user-defined window size parameter to define the segments. For example, you might set the window size to 12 data points.
To capture each segment, the anomaly function slides the window by one data point (see figure 1). Each segment overlaps the next segment by one data point.
Figure 1. Sliding window
An anomaly function builds an internal catalog of patterns from the segments it captures. The function uses these patterns to match each segment to an existing pattern. For example, the KMeansAnomalyScore
function groups similar patterns
into categories over time. Then, for each new window, the function finds the closest pattern from its catalog. It subtracts the window signal from the category signal. The result is a noisy signal. The function has some built-in maximum and minimum
values based on a normal distribution. If any part of the noisy signal is outside of the normal levels, an anomaly is detected.
Table 1 displays a typical window size and minimum window size for each function. It also displays the minimum number of data points that are required to support the typical window size.
Anomaly function | Typical window size | Minimum window size | Minimum number of data points |
---|---|---|---|
FFTbasedGeneralizedAnomalyScore | 12 | 3 | 24 |
GeneralizedAnomalyScore | 12 | 6 | 24 |
KMeansAnomalyScore | 12 | 1 | 24 |
NoDataAnomalyScore | 12 | 6 | 24 |
SpectralAnomalyScore | 12 | 6 | 24 |
SaliencybasedGeneralizedAnomalyScore | 12 | 6 | 24 |
MatrixProfileAnomalyScore | 12 | 6 | 24 |
Cap the window size to the typical window size values. A small window size is preferable because you gather more patterns for analysis. However, if the volume of data points is small, the anomaly detector might not have enough data points to perform the analysis. For example, if you run a KmeansAnomalyScore function with only 12 data points and you set a window size of 12, the function has one segment of data points and nothing to compare this window to.
A minimum number of data points must be available in a pipeline run for the anomaly detectors to work effectively. As a rule, the minimum number of data points per signal or pipeline run is to have at least twice the window size (see Table 1).
For an understanding of the impact that window size has on computational complexity, see Computational complexity of anomaly models.
Before you define the scheduling criteria for an anomaly function, identify the number of data points you expect for each pipeline run. The size of a pipeline run is determine by these factors:
Example:
Your devices are sending events roughly every 5 seconds. You schedule an anomaly function to run every 5 minutes as part of a pipeline run. You do not include historical data in your analysis. The pipeline for your device type typically completes within 5 minutes so no delay in the start of the next pipeline. The signal has approximately 60 data points. If the signal has 60 data points and the window size is 10, the anomaly function breaks the signal into 50 sliding windows.
Table 2 provides some guidance on how to set window size and configure the schedule parameters for anomaly functions based on the frequency of your data:
Data frequency | Window size | Historical data | Schedule: Critical data |
Schedule: Non-critical data |
---|---|---|---|---|
1 event per day | 12 data points | Last 24 days of data | Run once a day | Run every 12 days |
1 event per hour | 12 data points | Last 24 hours of data | Run once an hour | Run every 12 hours |
1 event per 5 min | 12 data points | Last 2 hours of data | Run every 5 minutes | Run every 60 minutes |
1 event per min | 12 data points | Last 1 hour of data | Run every 5 minutes | Run every 12 minutes |
Verify that the parameters you specify for anomaly models work well in your environment.