Contents


Real-time anomaly detection using the InfoSphere Streams TimeSeries Toolkit

Create an application to monitor systems for outlier conditions

Comments

An anomaly is a deviation from the standard behavior of a system. Automated anomaly detection in critical systems is highly recommended because large systems are difficult to monitor with traditional means, given that the monitoring process must deal with data that include many variables at each instant. The monitoring process enables the user to automatically take corrective action when an anomaly is detected, thus preventing it from causing damage to the system.

Consider a large data center that must balance the competing needs of users for memory to keep the data center secure and running. This scenario requires an analytical system that can detect anomalies in memory usage in real time.

Using data center memory use as an example, this article illustrates a method for implementing operators in the InfoSphere Streams TimeSeries Toolkit to set up real-time anomaly detection using the outlier detection approach. The plots used throughout depict samples taken at a particular period.

This article is intended for people who have basic skills in designing and running Streams Processing Language (SPL) application jobs from InfoSphere® Streams and who have introductory knowledge of the InfoSphere Streams TimeSeries Toolkit.

Basic components of an anomaly detection system

As shown in Figure 1, an anomaly detection system includes the following steps:

  1. Data preprocessing— Pre-process multidimensional time series data, such as the memory consumption of data center users. Preprocessing can include data normalization and noise removal.
  2. Data decomposition— Decompose time series data for multivariate data analysis. The decomposition can be carried out using wavelet transformation, Fast Fourier Transforms (FFT), or discrete cosine transforms (DCT). Decomposition reveals the finer details and trends in the input data.
  3. Data tracking or prediction— Tracking involves modeling expected behavior of the data and computing the difference between that expected behavior and real-time behavior.
  4. Anomaly detection— An anomaly is an outlier in the behavior of the data. This step computes metrics on the difference between the normal and actual data movement and flags unexpected deviations as anomalies. This step uses the probabilistic operator, which identifies the given input as an anomaly or not.
Figure 1. Block diagram of the steps of a real-time anomaly detection system
Image shows flow diagram of anomaly detection steps
Image shows flow diagram of anomaly detection steps

Operators used to implement the steps of an anomaly detection system

The following sections cover the operators used for each step, their purpose, and the output of each operator. In general the anomaly detection application collectively analyzes the memory usage statistics of four data center machines and sends alerts if there is any deviation in usage patterns in the data coming from any of the four data sources.

Figure 2 shows the operators used to implement the anomaly detection application. Each operator implements one of the steps illustrated in Figure 1.

Figure 2. Operators used in the anomaly detection system
Image shows TimeSeries operators used in the application
Image shows TimeSeries operators used in the application

The sample input data for application is shown below.

Figure 3. Data center memory usage statistics used as input to the anomaly detection system
Image shows sample data/memory usage statistics
Image shows sample data/memory usage statistics

Data preprocessing step

Data preprocessing is an important step in analyzing time series data because environmental impediments must be removed first to prevent incorrect results. The most common preprocessing steps are normalization and filtering, as shown in Figure 1.

Normalization is the process of transforming the input time series data into zero-mean and unit-variance data. For this anomaly detection application, we are monitoring memory usage of four machines in a data center. Therefore, the input data is a vector of four time series values. Since we are analyzing them collectively, all the sources have to be normalized to bring them to a common range. The application uses the normalize operator from the analysis namespace of the TimeSeries Toolkit, as shown in Listing 1. The operator is trained using the initial 240 samples. Because the readings are taken every 36,000 milliseconds, we will have 240 readings for a day, as depicted in the following code snippet.

Listing 1. Normalize each set of time series data to zero mean and unit variance
 // use a season of one day= 240 samples to normalize data
stream<uint64 timepoint,list<float64> normalizedTS,list<float64> inpTS> 
	normalizedStream= Normalize(InpTS) 
	{
		param
			initSamples:		240u;
			inputTimeSeries: inpTS;
		
	
		output
			normalizedStream: normalizedTS=normalizedTimeSeries();
	}

Noise filtering is the process of using output data from the normalization process as input to the DSPFilter operator to remove any noisy variations. The coefficients for DSPFilter operators are chosen to implement an exponential smoothing algorithm, as depicted in the code snippet below.

Listing 2. Exponential smoothing filter to smooth out noisy variation
stream<uint64 timepoint,list<float64> filteredTS,list<float64> inpTS> 
	filteredStream = DSPFilter(normalizedStream) 
	{
		param
			inputTimeSeries: normalizedTS;
			xcoef: {0u:0.9};
			ycoef: {0u:1.0, 1u:-0.1};
		output
			filteredStream: filteredTS=filteredTimeSeries();
	}

After preprocessing the input data looks different.

Figure 4. Filtered and normalized memory usage data
Image shows line graph of the filtered data by source
Image shows line graph of the filtered data by source

Data decomposition step

The multidimensional input time series data can be decomposed for multivariate analysis using the Discrete Wavelet Transformer (DWT) operator. The wavelet transform maps the original time series data into a space where general trends and fine details of the data are made more prominent. The DWT operator code is depicted below.

Listing 3. Wavelet transform to perform multivariate analysis
stream<uint64 timepoint,list<float64> transformedTS,list<float64> inpTS> 
transformedStream = DWT(filteredStream) 
{	
	param
		inputTimeSeries:		filteredTS;
	output
		transformedStream: transformedTS=DWTTransform();
}

Figure 5 shows the transformed time series data that is the output of the DWT operator.

Figure 5. Transformed time series data shows general trends
Image shows line graph of the transformed data by source
Image shows line graph of the transformed data by source

Data prediction

The data prediction step extracts the finer details and trends of the data decomposition phase and models the results to predict future trends. Trend prediction is a vital step in anomaly detection because any input that varies from the trend line signals data incorrectness.

The data prediction operators, such as FMPFilter (shown in Listing 4), can also predict anomalies by using threshold values specified by the user. If the input trend line does not fall within the acceptable threshold, the trend line is flagged as an anomaly. Note that when a threshold value is fixed by the user, the threshold does not evolve based on the input. Therefore, to make the solution more dynamic and to fine-tune the result, the output of trend predictors can be further modelled using a probabilistic algorithm for finding outliers, such as the GMM operator described in the next section.

Listing 4. Polynomial filter of degree 1 to predict the next sample
 stream<uint64 timepoint, list<float64> predictedTS,list<float64> transformedTS,
 list<boolean> flags,list<float64> inpTS> predictedStream = FMPFilter(transformedStream) 
{
	param
		inputTimeSeries: transformedTS;
		memoryLength: 5u;
		degree: 1u;
		integration:3u;
		thresholdFactor:2.5u;
	output
		predictedStream: predictedTS=predictedTimeSeries(),
		flags=anomalousFlags();
}

Figure 6 depicts the predicted trend line plot using output from the FMPFilter operator.

Figure 6. Predicted trends
Image shows line graph of predicted trends by source
Image shows line graph of predicted trends by source

Anomaly detection step

The difference between output trend prediction and current input trend can be modeled using a probabilistic algorithm such as GMM, which identifies the probability that input data from a given time series is an outlier. The outlier probability score determines if the given input multivariate time series data is an outlier.

Listing 5. Anomaly score and outlier probability computation
	// compute the anomaly score as a distance between expected data and real data 
 stream<scoreType> scoreStream = Custom(predictedStream)
	{
		logic
		
		state:{
			 mutable int32 i=0;mutable scoreType T1;
			 mutable float64 metric=0.0;			 
			 }
		onTuple predictedStream:
		{	
		metric=EuclidDistance(predictedTS, transformedTS);
		T1={timepoint=timepoint,score=metric,inpTS=predictedStream.inpTS};
		submit(T1,scoreStream);
		
		}
	}	
	
       // use the GMM to estimate the probability of an anomaly, given score
       // the higher the probability, the likely the anomaly at the given sample
stream <uint64 timepoint, float64 anomalyProbability,list<float64> inpTS>
 anomalyDetect=GMM(scoreStream)
	{
		param
		inputTimeSeries:score;
		
		trainingSize:2600u;
		output
		anomalyDetect:anomalyProbability=outlierProbability();
	}

Figure 7 depicts the predicted trend line plot using output from the FMPFilter operator.

Figure 7. Anomaly detection plot as seen in InfoSphere Streams console
Image shows x axis: Timestamp and y-axis: Memory usage
Image shows x axis: Timestamp and y-axis: Memory usage

Conclusion

The solution described in this article is best suited for detecting general anomalies, such as short-term drifts or sudden spikes. This solution is not designed to detect long-term or gradual drift or other exceptional conditions.

This article describes a quick and effective way to build an anomaly detection application using operators from the InfoSphere Streams TimeSeries Toolkit. Use the concepts and steps covered here to build your own anomaly detection application.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=950942
ArticleTitle=Real-time anomaly detection using the InfoSphere Streams TimeSeries Toolkit
publish-date=11052013