Introduction
An anomaly is a deviation from the standard behavior of a system. Automated anomaly detection in critical systems is highly recommended because large systems are difficult to monitor with traditional means, given that the monitoring process must deal with data that include many variables at each instant. The monitoring process enables the user to automatically take corrective action when an anomaly is detected, thus preventing it from causing damage to the system.
Consider a large data center that must balance the competing needs of users for memory to keep the data center secure and running. This scenario requires an analytical system that can detect anomalies in memory usage in real time.
Using data center memory use as an example, this article illustrates a method for implementing operators in the InfoSphere Streams TimeSeries Toolkit to set up realtime anomaly detection using the outlier detection approach. The plots used throughout depict samples taken at a particular period.
This article is intended for people who have basic skills in designing and running Streams Processing Language (SPL) application jobs from InfoSphere® Streams and who have introductory knowledge of the InfoSphere Streams TimeSeries Toolkit.
Basic components of an anomaly detection system
As shown in Figure 1, an anomaly detection system includes the following steps:
 Data preprocessing— Preprocess multidimensional time series data, such as the memory consumption of data center users. Preprocessing can include data normalization and noise removal.
 Data decomposition— Decompose time series data for multivariate data analysis. The decomposition can be carried out using wavelet transformation, Fast Fourier Transforms (FFT), or discrete cosine transforms (DCT). Decomposition reveals the finer details and trends in the input data.
 Data tracking or prediction— Tracking involves modeling expected behavior of the data and computing the difference between that expected behavior and realtime behavior.
 Anomaly detection— An anomaly is an outlier in the behavior of the data. This step computes metrics on the difference between the normal and actual data movement and flags unexpected deviations as anomalies. This step uses the probabilistic operator, which identifies the given input as an anomaly or not.
Figure 1. Block diagram of the steps of a realtime anomaly detection system
Operators used to implement the steps of an anomaly detection system
The following sections cover the operators used for each step, their purpose, and the output of each operator. In general the anomaly detection application collectively analyzes the memory usage statistics of four data center machines and sends alerts if there is any deviation in usage patterns in the data coming from any of the four data sources.
Figure 2 shows the operators used to implement the anomaly detection application. Each operator implements one of the steps illustrated in Figure 1.
Figure 2. Operators used in the anomaly detection system
The sample input data for application is shown below.
Figure 3. Data center memory usage statistics used as input to the anomaly detection system
Data preprocessing step
Data preprocessing is an important step in analyzing time series data because environmental impediments must be removed first to prevent incorrect results. The most common preprocessing steps are normalization and filtering, as shown in Figure 1.
Normalization is the process of transforming the input
time series data into zeromean and unitvariance data. For this anomaly
detection application, we are monitoring memory usage of four machines in a
data center. Therefore, the input data is a vector of four time series
values. Since we are analyzing them collectively, all the sources have to
be normalized to bring them to a common range. The application uses the
normalize
operator from the analysis namespace of the
TimeSeries Toolkit, as shown in Listing 1. The
operator is trained using the initial 240 samples. Because the readings
are taken every 36,000 milliseconds, we will have 240 readings for a day,
as depicted in the following code snippet.
Listing 1. Normalize each set of time series data to zero mean and unit variance
// use a season of one day= 240 samples to normalize data stream<uint64 timepoint,list<float64> normalizedTS,list<float64> inpTS> normalizedStream= Normalize(InpTS) { param initSamples: 240u; inputTimeSeries: inpTS; output normalizedStream: normalizedTS=normalizedTimeSeries(); }
Noise filtering is the process of using output data from
the normalization process as input to the DSPFilter
operator
to remove any noisy variations. The coefficients for
DSPFilter
operators are chosen to implement an exponential
smoothing algorithm, as depicted in the code snippet below.
Listing 2. Exponential smoothing filter to smooth out noisy variation
stream<uint64 timepoint,list<float64> filteredTS,list<float64> inpTS> filteredStream = DSPFilter(normalizedStream) { param inputTimeSeries: normalizedTS; xcoef: {0u:0.9}; ycoef: {0u:1.0, 1u:0.1}; output filteredStream: filteredTS=filteredTimeSeries(); }
After preprocessing the input data looks different.
Figure 4. Filtered and normalized memory usage data
Data decomposition step
The multidimensional input time series data can be decomposed for
multivariate analysis using the Discrete Wavelet Transformer (DWT)
operator. The wavelet transform maps the original time series data into a
space where general trends and fine details of the data are made more
prominent. The DWT
operator code is depicted below.
Listing 3. Wavelet transform to perform multivariate analysis
stream<uint64 timepoint,list<float64> transformedTS,list<float64> inpTS> transformedStream = DWT(filteredStream) { param inputTimeSeries: filteredTS; output transformedStream: transformedTS=DWTTransform(); }
Figure 5 shows the transformed time series data that
is the output of the DWT
operator.
Figure 5. Transformed time series data shows general trends
Data prediction
The data prediction step extracts the finer details and trends of the data decomposition phase and models the results to predict future trends. Trend prediction is a vital step in anomaly detection because any input that varies from the trend line signals data incorrectness.
The data prediction operators, such as FMPFilter
(shown in
Listing 4), can also predict anomalies by using
threshold values specified by the user. If the input trend line does not
fall within the acceptable threshold, the trend line is flagged as an
anomaly. Note that when a threshold value is fixed by the user, the
threshold does not evolve based on the input. Therefore, to make the
solution more dynamic and to finetune the result, the output of trend
predictors can be further modelled using a probabilistic algorithm for
finding outliers, such as the GMM
operator described in the
next section.
Listing 4. Polynomial filter of degree 1 to predict the next sample
stream<uint64 timepoint, list<float64> predictedTS,list<float64> transformedTS, list<boolean> flags,list<float64> inpTS> predictedStream = FMPFilter(transformedStream) { param inputTimeSeries: transformedTS; memoryLength: 5u; degree: 1u; integration:3u; thresholdFactor:2.5u; output predictedStream: predictedTS=predictedTimeSeries(), flags=anomalousFlags(); }
Figure 6 depicts the predicted trend line plot using
output from the FMPFilter
operator.
Figure 6. Predicted trends
Anomaly detection step
The difference between output trend prediction and current input trend can
be modeled using a probabilistic algorithm such as GMM
, which
identifies the probability that input data from a given time series is an
outlier. The outlier probability score determines if the given input
multivariate time series data is an outlier.
Listing 5. Anomaly score and outlier probability computation
// compute the anomaly score as a distance between expected data and real data stream<scoreType> scoreStream = Custom(predictedStream) { logic state:{ mutable int32 i=0;mutable scoreType T1; mutable float64 metric=0.0; } onTuple predictedStream: { metric=EuclidDistance(predictedTS, transformedTS); T1={timepoint=timepoint,score=metric,inpTS=predictedStream.inpTS}; submit(T1,scoreStream); } } // use the GMM to estimate the probability of an anomaly, given score // the higher the probability, the likely the anomaly at the given sample stream <uint64 timepoint, float64 anomalyProbability,list<float64> inpTS> anomalyDetect=GMM(scoreStream) { param inputTimeSeries:score; trainingSize:2600u; output anomalyDetect:anomalyProbability=outlierProbability(); }
Figure 7 depicts the predicted trend line plot using
output from the FMPFilter
operator.
Figure 7. Anomaly detection plot as seen in InfoSphere Streams console
Conclusion
The solution described in this article is best suited for detecting general anomalies, such as shortterm drifts or sudden spikes. This solution is not designed to detect longterm or gradual drift or other exceptional conditions.
This article describes a quick and effective way to build an anomaly detection application using operators from the InfoSphere Streams TimeSeries Toolkit. Use the concepts and steps covered here to build your own anomaly detection application.
Resources
Learn
 Read "Synchronize data with control signals in the InfoSphere Streams TimeSeries Toolkit."
 Find more details about various operators and functions available in the TimeSeries Toolkit information center.
 Check out this IBM Redbooks publication titled "IBM InfoSphere Streams Harnessing Data in Motion."
 Find resources to help you get started with InfoSphere Streams, IBM's highperformance computing platform that enables userdeveloped applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of realtime sources.
 Learn more about the Streams Processing Language.
 Access a directory of all information on InfoSphere Streams in the InfoSphere Streams Playbook (also available at this tinyurl).
 Learn more about big data in the developerWorks big data content area. Find technical documentation, howto articles, education, downloads, product information, and more.
 Follow these selfpaced tutorials (PDF) to learn how to manage your big data environment, import data for analysis, analyze data with BigSheets, develop your first big data application, develop Big SQL queries to analyze big data, and create an extractor to derive insights from text documents with InfoSphere BigInsights.
 Find resources to help you get started with InfoSphere Streams, IBM's highperformance computing platform that enables userdeveloped applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of realtime sources.
 Stay current with developerWorks technical events and webcasts.
 Follow developerWorks on Twitter.
Get products and technologies
 Download InfoSphere Streams Quick Start Edition, a complimentary, downloadable, nonproduction version of InfoSphere Streams.
 Download InfoSphere Streams, available as a native software installation or as a VMware image.
 Use InfoSphere Streams on IBM SmartCloud Enterprise.
 Build your next development project with IBM trial software, available for download directly from developerWorks.
Discuss
 Ask questions and get answers in the InfoSphere Streams forum.
 Check out the developerWorks blogs and get involved in the developerWorks community.
 Check out IBM big data and analytics on Facebook.
Comments
Dig deeper into Big data and analytics on developerWorks

developerWorks Premium
Exclusive tools to build your next great app. Learn more.

dW Answers
Ask a technical question

Explore more technical topics
Tutorials & training to grow your development skills