Real-time anomaly detection using the InfoSphere Streams TimeSeries Toolkit

Create an application to monitor systems for outlier conditions

InfoSphere® Streams, which processes data in real time, includes the TimeSeries Toolkit for building real-time analytical solutions. With the TimeSeries Toolkit operators for preprocessing, analyzing, and modeling multidimensional time series data in real time, you can create an anomaly detection application to monitor systems across the domains of cybersecurity, infrastructure, data center management, healthcare, and environment.

Share:

Bharath Kumar Devaraju (bhdevara@in.ibm.com), Software Engineer, IBM

Author photoBharath Kumar Devaraju has worked with IBM since 2009 and is currently working on InfoSphere Streams toolkit development. He is a QualityStage and DataStage certified solution developer. He has worked extensively on customer POCs, and assisted in pre-sales activities for growth markets.


developerWorks Contributing author
        level

Dattaram Rao (dattarao@in.ibm.com), Staff Software Engineer, IBM

Dattaram RaoDattaram Rao is currently working on Time Series toolkit in InfoSphere Streams and has 9 1/2 years of experience in IT. He is working on data mining and analytics. He has also worked extensively with system software development.



05 November 2013

Also available in Russian

Introduction

An anomaly is a deviation from the standard behavior of a system. Automated anomaly detection in critical systems is highly recommended because large systems are difficult to monitor with traditional means, given that the monitoring process must deal with data that include many variables at each instant. The monitoring process enables the user to automatically take corrective action when an anomaly is detected, thus preventing it from causing damage to the system.

InfoSphere Streams Quick Start Edition

InfoSphere Streams Quick Start Edition is a complimentary, downloadable, non-production version of InfoSphere Streams, a high-performance computing platform that enables user-developed applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of real-time sources. With no data or time limits, InfoSphere Streams Quick Start Edition enables you to experiment with stream computing in your own unique environment. Build a powerful analytics platform that can handle incredibility high data throughput, up to millions of events or messages per second. Download InfoSphere Streams Quick Start Edition now.

Consider a large data center that must balance the competing needs of users for memory to keep the data center secure and running. This scenario requires an analytical system that can detect anomalies in memory usage in real time.

Using data center memory use as an example, this article illustrates a method for implementing operators in the InfoSphere Streams TimeSeries Toolkit to set up real-time anomaly detection using the outlier detection approach. The plots used throughout depict samples taken at a particular period.

This article is intended for people who have basic skills in designing and running Streams Processing Language (SPL) application jobs from InfoSphere® Streams and who have introductory knowledge of the InfoSphere Streams TimeSeries Toolkit.

Basic components of an anomaly detection system

As shown in Figure 1, an anomaly detection system includes the following steps:

  1. Data preprocessing— Pre-process multidimensional time series data, such as the memory consumption of data center users. Preprocessing can include data normalization and noise removal.
  2. Data decomposition— Decompose time series data for multivariate data analysis. The decomposition can be carried out using wavelet transformation, Fast Fourier Transforms (FFT), or discrete cosine transforms (DCT). Decomposition reveals the finer details and trends in the input data.
  3. Data tracking or prediction— Tracking involves modeling expected behavior of the data and computing the difference between that expected behavior and real-time behavior.
  4. Anomaly detection— An anomaly is an outlier in the behavior of the data. This step computes metrics on the difference between the normal and actual data movement and flags unexpected deviations as anomalies. This step uses the probabilistic operator, which identifies the given input as an anomaly or not.
Figure 1. Block diagram of the steps of a real-time anomaly detection system
Image shows flow diagram of anomaly detection steps

Operators used to implement the steps of an anomaly detection system

Explore StreamsDev, your direct channel to the InfoSphere Streams development team

Find all the resources you need to develop with InfoSphere Streams, brought to you by the extended Streams development team. Doc, product downloads, SPL code examples, help, events, expert blogs — it's all there. Plus a direct line to the developers. Get started now.

The following sections cover the operators used for each step, their purpose, and the output of each operator. In general the anomaly detection application collectively analyzes the memory usage statistics of four data center machines and sends alerts if there is any deviation in usage patterns in the data coming from any of the four data sources.

Figure 2 shows the operators used to implement the anomaly detection application. Each operator implements one of the steps illustrated in Figure 1.

Figure 2. Operators used in the anomaly detection system
Image shows TimeSeries operators used in the application

The sample input data for application is shown below.

Figure 3. Data center memory usage statistics used as input to the anomaly detection system
Image shows sample data/memory usage statistics

Data preprocessing step

Data preprocessing is an important step in analyzing time series data because environmental impediments must be removed first to prevent incorrect results. The most common preprocessing steps are normalization and filtering, as shown in Figure 1.

Normalization is the process of transforming the input time series data into zero-mean and unit-variance data. For this anomaly detection application, we are monitoring memory usage of four machines in a data center. Therefore, the input data is a vector of four time series values. Since we are analyzing them collectively, all the sources have to be normalized to bring them to a common range. The application uses the normalize operator from the analysis namespace of the TimeSeries Toolkit, as shown in Listing 1. The operator is trained using the initial 240 samples. Because the readings are taken every 36,000 milliseconds, we will have 240 readings for a day, as depicted in the following code snippet.

Listing 1. Normalize each set of time series data to zero mean and unit variance
 // use a season of one day= 240 samples to normalize data
stream<uint64 timepoint,list<float64> normalizedTS,list<float64> inpTS> 
	normalizedStream= Normalize(InpTS) 
	{
		param
			initSamples:		240u;
			inputTimeSeries: inpTS;
		
	
		output
			normalizedStream: normalizedTS=normalizedTimeSeries();
	}

Noise filtering is the process of using output data from the normalization process as input to the DSPFilter operator to remove any noisy variations. The coefficients for DSPFilter operators are chosen to implement an exponential smoothing algorithm, as depicted in the code snippet below.

Listing 2. Exponential smoothing filter to smooth out noisy variation
stream<uint64 timepoint,list<float64> filteredTS,list<float64> inpTS> 
	filteredStream = DSPFilter(normalizedStream) 
	{
		param
			inputTimeSeries: normalizedTS;
			xcoef: {0u:0.9};
			ycoef: {0u:1.0, 1u:-0.1};
		output
			filteredStream: filteredTS=filteredTimeSeries();
	}

After preprocessing the input data looks different.

Figure 4. Filtered and normalized memory usage data
Image shows line graph of the filtered data by source

Data decomposition step

The multidimensional input time series data can be decomposed for multivariate analysis using the Discrete Wavelet Transformer (DWT) operator. The wavelet transform maps the original time series data into a space where general trends and fine details of the data are made more prominent. The DWT operator code is depicted below.

Listing 3. Wavelet transform to perform multivariate analysis
stream<uint64 timepoint,list<float64> transformedTS,list<float64> inpTS> 
transformedStream = DWT(filteredStream) 
{	
	param
		inputTimeSeries:		filteredTS;
	output
		transformedStream: transformedTS=DWTTransform();
}

Figure 5 shows the transformed time series data that is the output of the DWT operator.

Figure 5. Transformed time series data shows general trends
Image shows line graph of the transformed data by source

Data prediction

The data prediction step extracts the finer details and trends of the data decomposition phase and models the results to predict future trends. Trend prediction is a vital step in anomaly detection because any input that varies from the trend line signals data incorrectness.

The data prediction operators, such as FMPFilter (shown in Listing 4), can also predict anomalies by using threshold values specified by the user. If the input trend line does not fall within the acceptable threshold, the trend line is flagged as an anomaly. Note that when a threshold value is fixed by the user, the threshold does not evolve based on the input. Therefore, to make the solution more dynamic and to fine-tune the result, the output of trend predictors can be further modelled using a probabilistic algorithm for finding outliers, such as the GMM operator described in the next section.

Listing 4. Polynomial filter of degree 1 to predict the next sample
 stream<uint64 timepoint, list<float64> predictedTS,list<float64> transformedTS,
 list<boolean> flags,list<float64> inpTS> predictedStream = FMPFilter(transformedStream) 
{
	param
		inputTimeSeries: transformedTS;
		memoryLength: 5u;
		degree: 1u;
		integration:3u;
		thresholdFactor:2.5u;
	output
		predictedStream: predictedTS=predictedTimeSeries(),
		flags=anomalousFlags();
}

Figure 6 depicts the predicted trend line plot using output from the FMPFilter operator.

Figure 6. Predicted trends
Image shows line graph of predicted trends by source

Anomaly detection step

The difference between output trend prediction and current input trend can be modeled using a probabilistic algorithm such as GMM, which identifies the probability that input data from a given time series is an outlier. The outlier probability score determines if the given input multivariate time series data is an outlier.

Listing 5. Anomaly score and outlier probability computation
	// compute the anomaly score as a distance between expected data and real data 
 stream<scoreType> scoreStream = Custom(predictedStream)
	{
		logic
		
		state:{
			 mutable int32 i=0;mutable scoreType T1;
			 mutable float64 metric=0.0;			 
			 }
		onTuple predictedStream:
		{	
		metric=EuclidDistance(predictedTS, transformedTS);
		T1={timepoint=timepoint,score=metric,inpTS=predictedStream.inpTS};
		submit(T1,scoreStream);
		
		}
	}	
	
       // use the GMM to estimate the probability of an anomaly, given score
       // the higher the probability, the likely the anomaly at the given sample
stream <uint64 timepoint, float64 anomalyProbability,list<float64> inpTS>
 anomalyDetect=GMM(scoreStream)
	{
		param
		inputTimeSeries:score;
		
		trainingSize:2600u;
		output
		anomalyDetect:anomalyProbability=outlierProbability();
	}

Figure 7 depicts the predicted trend line plot using output from the FMPFilter operator.

Figure 7. Anomaly detection plot as seen in InfoSphere Streams console
Image shows x axis: Timestamp and y-axis: Memory usage

Conclusion

The solution described in this article is best suited for detecting general anomalies, such as short-term drifts or sudden spikes. This solution is not designed to detect long-term or gradual drift or other exceptional conditions.

This article describes a quick and effective way to build an anomaly detection application using operators from the InfoSphere Streams TimeSeries Toolkit. Use the concepts and steps covered here to build your own anomaly detection application.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=950942
ArticleTitle=Real-time anomaly detection using the InfoSphere Streams TimeSeries Toolkit
publish-date=11052013