Synchronize data with control signals in the InfoSphere Streams Time Series Toolkit

Preserve model relevance even with noisy input time series data

InfoSphere® Streams is the real-time component of the IBM big data platform. It provides the platform and toolkits for building real-time analytical solutions. The Time Series Toolkit included with Streams includes operators for preprocessing, analyzing, and modeling time series data in real time. Modeling operators in the toolkit use incoming time series data to build an internal model for forecasting or tracking. In real-world scenarios, the incoming data used for model building might become noisy and should be discarded from the model-building process. Also, once the incoming data is clean, the model might have to be retrained. This article provides a solution for this and describes how to synchronize and calibrate the process of model building and operator functions with the quality of incoming data using the control port feature.

Bharath Kumar Devaraju (bhdevara@in.ibm.com), Software Engineer, IBM China

Author photoBharath Kumar Devaraju has worked with IBM since 2009 and is currently working on InfoSphere Streams toolkit development. He is a QualityStage and DataStage certified solution developer. He has worked extensively on customer POCs, and assisted in pre-sales activities for growth markets.


developerWorks Contributing author
        level

Dattaram Rao (dattarao@in.ibm.com), Staff Software Engineer, IBM China

Dattaram Rao is currently working on Time Series toolkit in InfoSphere Streams and has 9 1/2 years of experience in IT. He is working on data mining and analytics. He has also worked extensively with system software development.



11 June 2013

Also available in Chinese

Introduction

The InfoSphere Streams Time Series Toolkit is enriched with many operators to build forecasting, tracking, regression, and prediction models. In some real-life scenarios, the input time series can change its band or can become noisy over time, or starts experiencing missing data. Building the models using such bad quality data can lead to bad performance. Hence, rebuilding of the model or suspending the process of updating the model’s parameters when data becomes of bad quality is a necessity. And doing so at runtime is a challenging issue.

The Time Series Toolkit's modeling operators facilitate model retraining, suspension, or resumption through use of a control port to which specific control signals can be sent. Once the abnormality or change is detected in the input data, the control signal can be sent to the modelling operator to change its behavior. However, this control signal should be synchronized with the data being monitored or it will cause spill-over of anomalous data into the model. For example, a slight delay in the control signal may cause the model training on a bad data. In a streaming environment, controlling this delay is tricky since no guarantee can be given on the speed at which control signals and data move between operators.

Consider the example of the forecasting of electricity usage of an area. During model building cycle, electricity glitches must be ignored or it may lead to skewed results during forecasting. The glitch or out-of-band data needs to be discarded, and the control port feature can help with this. This article describes how the detection and submission of control signals can be synchronized with data used for model learning. The high-level solution architecture is shown in Figure 1.

Figure 1. Real-time control signal and data synchronization high-level solution diagram
Image shows data synchronization solution architecture

Prerequisites

  • Business prerequisites: The audience of this article is required to have basic skills in designing and running SPL application jobs from InfoSphere Streams and introductory knowledge about Time Series Toolkit.
  • Software prerequisites: InfoSphere Streams 3.1

Time Series Toolkit control signals — What are they?

  • The InfoSphere Streams Time Series Toolkit provides various control signals that can be ingested in real time to supported modeling operators. The various control signals and their significance are listed in following table.
Table 1. Various control signals and their significance
SignalDescription
SuspendSuspends the model training. This signal should be ingested when one feels that data may be noisy, hence, preventing the operator from building an incorrect model.
ResumeResume the suspended model training operation.
RetrainRetrain the model since trend of the data has changed over time or the model has lost its significance.
MonitorObserve the current calculated coefficients of the model. This feature is useful for domain experts to diagnose the model or can be preserved for reloading into the operator in the near future.
LoadLoad an existing model into the operator. This signal can be utilized when a pre-existing model seems more relevant to the pattern or trend of the input data.

Each control signal has a specific schema requirement and format. Refer to InfoSphere Streams Time Series Toolkit documentation for details of modeling operators that support control signals.


Synchronizing the signal with the input data

  • While using the control signal, one must take care that the signal is synchronized with the data to prevent spill over of anomalous data into the operators internal model. Synchronization can be achieved in several ways. In this article, we will highlight one such method of synchronization using SPL custom operator. Using custom operator the logic has been written to detect out of band data and then submit the suspend signal. Once an out-of-band data is detected, the control signal is ingested to the operator following which subsequent input data are ignored until resume signal is received.
    Listing 1. Synchronization of data and Time Series Toolkit control signals using SPL custom operator in InfoSphere Streams
    stream<float64 ts> Src=FileSource()
     {
    		param
    		file: "excg.csv";
     }
     
     stream<float64 ts> SrcDel=Throttle(Src)
     {
    		param
                    rate:1.0;
     }
    
    
    /*The custom operator below detects the noisy data. Here, the noisy data is assumed
    to be above 250.0. The exchange rate date below 250 is
     considered for modeling and forecasting.*/
     
     stream<controlSig> controlInpSuspend=Custom(SrcDel)
       {
              logic 
    		onTuple SrcDel:
    		{
    			if (ts>250.0)
    			{
    			controlSig sig={SuspendSignal=Suspend,suspendValue=40.0,
    			inputts = ts};
    			
    // If input time series data is greater than 250.0, then submit suspend signal.
    				 submit(sig,controlInpSuspend);
    			 }
    			else 
    			{
    			controlSig sig={SuspendSignal=Resume,suspendValue=0.0,
    			inputts = ts};
    			
    /* If data is within the range and the model was suspended, 
    then send Resume signal to ARIMA.*/
    
    				 submit(sig,controlInpSuspend);
    				 
    /* If the model was not in suspended state, then ARIMA will continue 
    to update the model and forecast.*/
    			       	
    			}			
    		}
    	}		
    
     /*This ARIMA operator receives the input tuples and the signals on the control port 
     for its operation.*/ 
          stream<list<float64>  forecast > ARIMALearn =ARIMA(SrcDel;controlInpSuspend)
    	{
    		param
    		inputTimeSeries: ts;
    		initSamples:10u;
    		stepAhead:1u;
    		controlSignal:SuspendSignal;
    		// Control Port which received Suspend and Resume Signal
    		
    		output
    			    ARIMALearn: forecast = forecastedTimeSeriesStep();
    	}
  • Figure 2 depicts the output of the SPL program depicted in Listing 1. When input goes out of band, which is above 250 units in the given use case, the forecasting and model updating process is suspended, and the input sample is discarded. The resume signal is ingested when input falls back into the allowable band (<250) and, hence, forecasting future values is continued.
Figure 2. Control signal output plot
Image shows data synchronization solution architecture

Conclusion

This article has explained an effective way to synchronize data with control signals to discard noisy data during forecasting. In this way, model relevance can be preserved even with noisy input time series data.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=932293
ArticleTitle=Synchronize data with control signals in the InfoSphere Streams Time Series Toolkit
publish-date=06112013