# Real-time anomaly detection using the InfoSphere Streams TimeSeries Toolkit

Create an application to monitor systems for outlier conditions

An anomaly is a deviation from the standard behavior of a system. Automated anomaly detection in critical systems is highly recommended because large systems are difficult to monitor with traditional means, given that the monitoring process must deal with data that include many variables at each instant. The monitoring process enables the user to automatically take corrective action when an anomaly is detected, thus preventing it from causing damage to the system.

Consider a large data center that must balance the competing needs of users for memory to keep the data center secure and running. This scenario requires an analytical system that can detect anomalies in memory usage in real time.

Using data center memory use as an example, this article illustrates a method for implementing operators in the InfoSphere Streams TimeSeries Toolkit to set up real-time anomaly detection using the outlier detection approach. The plots used throughout depict samples taken at a particular period.

This article is intended for people who have basic skills in designing and running Streams Processing Language (SPL) application jobs from InfoSphere® Streams and who have introductory knowledge of the InfoSphere Streams TimeSeries Toolkit.

## Basic components of an anomaly detection system

As shown in Figure 1, an anomaly detection system includes the following steps:

**Data preprocessing**— Pre-process multidimensional time series data, such as the memory consumption of data center users. Preprocessing can include data normalization and noise removal.**Data decomposition**— Decompose time series data for multivariate data analysis. The decomposition can be carried out using wavelet transformation, Fast Fourier Transforms (FFT), or discrete cosine transforms (DCT). Decomposition reveals the finer details and trends in the input data.**Data tracking or prediction**— Tracking involves modeling expected behavior of the data and computing the difference between that expected behavior and real-time behavior.**Anomaly detection**— An anomaly is an outlier in the behavior of the data. This step computes metrics on the difference between the normal and actual data movement and flags unexpected deviations as anomalies. This step uses the probabilistic operator, which identifies the given input as an anomaly or not.

##### Figure 1. Block diagram of the steps of a real-time anomaly detection system

## Operators used to implement the steps of an anomaly detection system

The following sections cover the operators used for each step, their purpose, and the output of each operator. In general the anomaly detection application collectively analyzes the memory usage statistics of four data center machines and sends alerts if there is any deviation in usage patterns in the data coming from any of the four data sources.

Figure 2 shows the operators used to implement the anomaly detection application. Each operator implements one of the steps illustrated in Figure 1.

##### Figure 2. Operators used in the anomaly detection system

The sample input data for application is shown below.

##### Figure 3. Data center memory usage statistics used as input to the anomaly detection system

### Data preprocessing step

Data preprocessing is an important step in analyzing time series data
because environmental impediments must be removed first to prevent
incorrect results. The most common preprocessing steps are
*normalization* and *filtering*, as shown in Figure 1.

*Normalization* is the process of transforming the input
time series data into zero-mean and unit-variance data. For this anomaly
detection application, we are monitoring memory usage of four machines in a
data center. Therefore, the input data is a vector of four time series
values. Since we are analyzing them collectively, all the sources have to
be normalized to bring them to a common range. The application uses the
`normalize`

operator from the analysis namespace of the
TimeSeries Toolkit, as shown in Listing 1. The
operator is trained using the initial 240 samples. Because the readings
are taken every 36,000 milliseconds, we will have 240 readings for a day,
as depicted in the following code snippet.

##### Listing 1. Normalize each set of time series data to zero mean and unit variance

// use a season of one day= 240 samples to normalize data stream<uint64 timepoint,list<float64> normalizedTS,list<float64> inpTS> normalizedStream= Normalize(InpTS) { param initSamples: 240u; inputTimeSeries: inpTS; output normalizedStream: normalizedTS=normalizedTimeSeries(); }

*Noise filtering* is the process of using output data from
the normalization process as input to the `DSPFilter`

operator
to remove any noisy variations. The coefficients for
`DSPFilter`

operators are chosen to implement an exponential
smoothing algorithm, as depicted in the code snippet below.

##### Listing 2. Exponential smoothing filter to smooth out noisy variation

stream<uint64 timepoint,list<float64> filteredTS,list<float64> inpTS> filteredStream = DSPFilter(normalizedStream) { param inputTimeSeries: normalizedTS; xcoef: {0u:0.9}; ycoef: {0u:1.0, 1u:-0.1}; output filteredStream: filteredTS=filteredTimeSeries(); }

After preprocessing the input data looks different.

##### Figure 4. Filtered and normalized memory usage data

### Data decomposition step

The multidimensional input time series data can be decomposed for
multivariate analysis using the Discrete Wavelet Transformer (DWT)
operator. The wavelet transform maps the original time series data into a
space where general trends and fine details of the data are made more
prominent. The `DWT`

operator code is depicted below.

##### Listing 3. Wavelet transform to perform multivariate analysis

stream<uint64 timepoint,list<float64> transformedTS,list<float64> inpTS> transformedStream = DWT(filteredStream) { param inputTimeSeries: filteredTS; output transformedStream: transformedTS=DWTTransform(); }

Figure 5 shows the transformed time series data that
is the output of the `DWT`

operator.

##### Figure 5. Transformed time series data shows general trends

### Data prediction

The data prediction step extracts the finer details and trends of the data decomposition phase and models the results to predict future trends. Trend prediction is a vital step in anomaly detection because any input that varies from the trend line signals data incorrectness.

The data prediction operators, such as `FMPFilter`

(shown in
Listing 4), can also predict anomalies by using
threshold values specified by the user. If the input trend line does not
fall within the acceptable threshold, the trend line is flagged as an
anomaly. Note that when a threshold value is fixed by the user, the
threshold does not evolve based on the input. Therefore, to make the
solution more dynamic and to fine-tune the result, the output of trend
predictors can be further modelled using a probabilistic algorithm for
finding outliers, such as the `GMM`

operator described in the
next section.

##### Listing 4. Polynomial filter of degree 1 to predict the next sample

stream<uint64 timepoint, list<float64> predictedTS,list<float64> transformedTS, list<boolean> flags,list<float64> inpTS> predictedStream = FMPFilter(transformedStream) { param inputTimeSeries: transformedTS; memoryLength: 5u; degree: 1u; integration:3u; thresholdFactor:2.5u; output predictedStream: predictedTS=predictedTimeSeries(), flags=anomalousFlags(); }

Figure 6 depicts the predicted trend line plot using
output from the `FMPFilter`

operator.

##### Figure 6. Predicted trends

### Anomaly detection step

The difference between output trend prediction and current input trend can
be modeled using a probabilistic algorithm such as `GMM`

, which
identifies the probability that input data from a given time series is an
outlier. The outlier probability score determines if the given input
multivariate time series data is an outlier.

##### Listing 5. Anomaly score and outlier probability computation

// compute the anomaly score as a distance between expected data and real data stream<scoreType> scoreStream = Custom(predictedStream) { logic state:{ mutable int32 i=0;mutable scoreType T1; mutable float64 metric=0.0; } onTuple predictedStream: { metric=EuclidDistance(predictedTS, transformedTS); T1={timepoint=timepoint,score=metric,inpTS=predictedStream.inpTS}; submit(T1,scoreStream); } } // use the GMM to estimate the probability of an anomaly, given score // the higher the probability, the likely the anomaly at the given sample stream <uint64 timepoint, float64 anomalyProbability,list<float64> inpTS> anomalyDetect=GMM(scoreStream) { param inputTimeSeries:score; trainingSize:2600u; output anomalyDetect:anomalyProbability=outlierProbability(); }

Figure 7 depicts the predicted trend line plot using
output from the `FMPFilter`

operator.

##### Figure 7. Anomaly detection plot as seen in InfoSphere Streams console

## Conclusion

The solution described in this article is best suited for detecting general anomalies, such as short-term drifts or sudden spikes. This solution is not designed to detect long-term or gradual drift or other exceptional conditions.

This article describes a quick and effective way to build an anomaly detection application using operators from the InfoSphere Streams TimeSeries Toolkit. Use the concepts and steps covered here to build your own anomaly detection application.

#### Downloadable resources

#### Related topics

- Read "Synchronize data with control signals in the InfoSphere Streams TimeSeries Toolkit."
- Find more details about various operators and functions available in the TimeSeries Toolkit information center.
- Check out this IBM Redbooks publication titled "IBM InfoSphere Streams Harnessing Data in Motion."
- Learn more about the Streams Processing Language.
- Access a directory of all information on InfoSphere Streams in the InfoSphere Streams Playbook (also available at this tinyurl).
- Download InfoSphere Streams Quick Start Edition, a complimentary, downloadable, non-production version of InfoSphere Streams.
- Download InfoSphere Streams, available as a native software installation or as a VMware image.