Get to know the R-project Toolkit in InfoSphere Streams

Apply complex data mining algorithms to detect patterns of interest in data streams

InfoSphere® Streams addresses a crucial emerging need for platforms and architectures that can process vast amounts of generated streaming data in real time. The R language is popular and widely used among statisticians and data miners for developing statistical software for data manipulation, statistical computations, and graphical displays. Learn about the InfoSphere Streams R-project Toolkit that integrates with the powerful R suite of software facilities and packages.

Sherif Sakr (ssakr@cse.unsw.edu.au), Senior Research Scientist, National ICT Australia

Sherif Sakr 的照片Dr. Sherif Sakr is a senior research scientist in the Software Systems Group at National ICT Australia (NICTA), Sydney, Australia. He is also a conjoint senior lecturer in the School of Computer Science and Engineering at University of New South Wales. He received his doctorate in computer science from Konstanz University, Germany, in 2007. His bachelor's and master's degrees in computer science are from Cairo University, Egypt. In 2011, Dr. Sakr held a visiting research scientist position in the eXtreme Computing Group (XCG) at Microsoft Research, in Redmond, Wash. In 2012, he held a research MTS position in Alcatel-Lucent Bell Labs.



17 September 2013

Also available in Chinese Russian

Overview of the R Project

R is a free integrated suite of software facilities for data manipulation, calculation, and graphical display. It is an effective data handling and storage facility, providing users with many facilities, including:

  • A suite of operators for calculations on arrays and matrices.
  • A large, coherent, and integrated collection of intermediate tools for data analysis operations.
  • A well-developed and simple environment for the S programming language, a statistical programming language for analyzing data, which includes conditionals, loops, user-defined recursive functions, and input and output facilities.

R is a true object-oriented programming language, much like C++ (and others) where objects can be just about anything: a single value, a variable, datasets, lists of several types of objects, etc. R provides a variety of graphical and statistical techniques, such as linear and non-linear modeling, classical statistical tests, time-series analysis, classification, clustering, and more.

The R programming language is highly extensible. It allows users to write new functions and package those functions in an R package (or R library). In practice, the default installation of R and accompanying packages provides a fully functioning statistical environment in which one may conduct any number of typical and advanced analyses. In addition, there are more than 5,300 user-contributed packages that provide a vast amount of enhanced functioning. Therefore, if you come across a particular analysis, it is likely there is already an R package devoted to it.

A primary strength of the R environment is the ease with which well-designed plots can be produced, including mathematical symbols and formulae where needed. R also has extensive and powerful graphic facilities.

Using R as a calculator

Use simple arithmetic expressions to get R to calculate a mathematical answer.

Listing 1. R as a calculator for integer summation
  > 1+2 
  ## [1] 3

You can also use mathematical functions, such as sqrt, exp, and log.

Listing 2. R as a calculator for mathematical functions
  > log(0.3/(1-0.3)) 
  ## [1] -0.8472979

Coding data structures in R

In principle, R operates on named data structures. The simplest such structure is the numeric vector, which is a single entity consisting of an ordered collection of numbers. Use the following command to set up vector x consisting of five numbers: 9.4, 8.6, 2.1, 3.4, and 11.7.

Listing 3. Vector assignment in R
  > x <- c(10.4, 5.6, 3.1, 6.4, 21.7)

This is an assignment statement using the function c(), which takes an arbitrary number of vector arguments and arrives at a vector value by concatenating the arguments end to end.

The elementary arithmetic operators are the usual +, -, *, /, and ^ for raising to a power. In addition, all of the common arithmetic functions (log, exp, sin, cos, tan, sqrt) are available with their usual meaning.

Functions:

  • max and min select the largest and smallest elements of a vector, respectively.
  • range is a function whose value is a vector of length two, namely c(min(x) and max(x)).
  • length(x) is the number of elements in x, sum(x) gives the total of the elements in x, and prod(x) their product.
  • The statistical function mean(x) calculates the sample mean, which is the same as sum(x)/length(x)).
  • var(x) gives sum((x-mean(x))^2)/(length(x)-1) or the sample variance.
  • sort(x) returns a vector of the same size as x with the elements arranged in increasing order. However, there are other, more flexible, sorting facilities available. The following example is for a simple function to produce the mean of an input vector x.
    Listing 4. Mean calculation using R
      mymean = function(x) {
      if (!is.numeric(x)) 
    	{
      	stop("Input Error")
      	}
      return(sum(x)/length(x))
      }
      
      myVar = 1:5
      mymean(myVar)
      ## [1] 3
    
      mymean("string value")
      ## Error: STOP! Can not be computed.

It is out of the scope of this article to cover all the details and capabilities of the R language. For complete information, refer to the R Project manual (see Resources).


R Project and InfoSphere Streams

InfoSphere® Streams is designed to uncover meaningful patterns from information in motion (data flows) during a window of minutes to hours. The platform provides business value by supporting low-latency insight and better outcomes for time-sensitive applications, such as fraud detection or network management. The types of business problems that InfoSphere Streams is designed to address are fundamentally different from those regularly addressed by common relational databases, standard extract-transform-load (ETL) software platforms, or batch-processing analytic platforms, such as Hadoop.

R-project Toolkit in InfoSphere Streams

InfoSphere Streams is an advanced computing platform that allows user-developed applications to ingest, analyze, and correlate information quickly as it arrives from thousands of real-time sources, handling very high data throughput rates — up to millions of events or messages per second. It includes an R-project Toolkit that enables you to apply complex data mining algorithms to detect patterns of interest in data streams. Learn more and give it a try..

The term stream describes continuous flow of data from a data source. Stream data is in the form of tuples, which are composed of a fixed set of attributes. In practice, a standard database server generally offers static data models and data, and also supports dynamic queries. The data is stored in persistent form, and queries can arrive at any time to answer any question contained inside the historical data and data model.

However, in an InfoSphere Streams environment, data does not need to be persistently stored. Instead, it is generally expected that the observed data volume is so large that you could not afford to make it persistent. Therefore, InfoSphere Streams supports the ability to build applications that can support the arrival of these huge volumes of data, which must be analyzed and reported upon, with low latency, and in real time (or close to real time.)

Queries in InfoSphere Streams are defined by the streams applications and run continuously until the user cancels them. A streams application is a defined collection of operators, connected by streams:

  • A streams application defines how the runtime should analyze a set of stream data.
  • An operator is a component of processing functionality that takes one or more streams as input, processes the tuples and attributes in the streams, and produces one or more streams as output.
  • SPL, the programming language for InfoSphere Streams, is a distributed data-flow composition language. It is an extensible and full-featured language like C++ or Java™ that supports user-defined data types. The user can write custom functions in SPL or a native language (C++ or Java). The user can also write user-defined operators in C++ or Java.

One of the strengths of the InfoSphere Streams platform is toolkits, which represent collections of assets that facilitate the development of a solution for a particular industry or functionality. They can be augmented with their functionalities into any InfoSphere Streams application. Examples of these toolkits include:

  • Mining Toolkit— Enables scoring of real-time data in a streams application. The scoring in the Streams Mining Toolkit assumes (and uses) a predefined model. A variety of model types and scoring algorithms are supported: classification, regression, clustering, and association, among others.
  • Financial Services Toolkit— Delivers competitive advantage to financial institutions.
  • Database Toolkit— Provides the means to enable streams applications to connect to different data stores such as solidDB, Netezza, DB2®, and Oracle.
  • Internet Toolkit— Provides the means for receiving text-based data from remote sources using HTTP, and generates an input stream from this content.

InfoSphere Streams 3.1 supports R analytics, which allows users to use R scoring in a native format while taking advantage of the scalability and parallelism features of InfoSphere Streams to achieve exceptional throughput. R support extends the InfoSphere Streams family of analytic options. The following sections gives an overview of the R-project Toolkit in InfoSphere Streams.


R-project Toolkit

The R-project Toolkit provides an operator that facilitates integration between InfoSphere Streams and the R environment. A cornerstone of the tool is the RScript operator, which is capable of invoking a user-defined R script each time a tuple is received on the required input port.

In particular, it maps the attributes of each input tuple to objects that can be used in R commands. It then runs a script that contains R commands and maps the objects that are output from the script to output tuple attributes.

How the RScript operator works

When a tuple is received on the required input port, the operator uses the streamAttributes parameter to map the input tuple attributes to the objects specified in the rObjects parameter. The operator runs the script specified in the rScriptFileName parameter and processes the results. The operator uses the custom output function fromRvalues to map the values produced by the output statements in the R script to output tuple attributes.

The operator supports an optional input port and accepts an rstring attribute that specifies the name of an R script. This script is run once. You can use the script to update or replace the analytic code in the initialization or processing scripts. For example, you can run R commands that refresh the model used for scoring, or you can replace an R function definition. The operator also supports an optional output port is used to capture error information that occurs while running the R script. In addition, the operator receives the following parameters:

  • rScriptFileName— Provides information about the path of an R script to run for each incoming tuple.
  • streamAttributes— Describes a list of expressions to use for mapping to R objects used within the R script.
  • rObjects— Provides a list of strings that represent the names of R objects that will be populated before the R script is run. Must be a 1:1 mapping with streamAttributes parameter.
  • initializationScriptFileName (optional) — Provides information about the path of an R script to run during operator initialization.
  • rCommand (optional) — Allows the user to specify different path and options when invoking the R shell.

The RScript operator supports all primitive SPL data types (integer, string, double, and time). Furthermore, it can handle mapping between SPL data types and R vectors.

RScript operator in action

The following code shows a simple example of using the RScript operator for calling an R script that sums two integer input parameters.

Listing 5. RScript in action: Integer summation
 stream<int32 a, int32 b, int32 c< analyzedStream =
 	RScript(inStream) {
 	param
 		rScriptFileName : "../process.r" ;
 		streamAttributes : a, b;
 		rObjects : "in1", "in2";
 	output
 		analyzedStream:
			c = fromR("out1");
 }

In the code above, process.r is an R script that receives the two parameters (a and b) after mapping them to in1 and in2 parameters. The returned out1 parameter from the R script is mapped to the parameter c.

The following code is another example that uses linear model for predicting how fertilizer will affect growth.

Listing 6. RScript in action: Applying a prediction model
 stream<int32 identifier, float64 fert, float64 sizeChange,
 	rstring modelSource< analyzedStream = RScript(inStream) {
 	param
 		initializationScriptFileName : "../init.r";
 		rScriptFileName : "../process.r" ;
 		streamAttributes : fert;
 		rObjects : "f";
 	output
 		analyzedStream:
 			sizeChange = fromR("growth"),
}

In the code above, the init.r script runs once during operator startup. It creates the connection to the input data and the linear model growthModel.

 growthData <- read.csv(file="../growthMonday.txt", sep=' ', header=TRUE)
 attach(growthData)
 growthModel = lm(height~fertilizer)

The process.r script runs each time a tuple arrives on required input port. Fertilizer amounts are input with the objects in1 and in2. The script outputs the predicted growth from the linear model using the object (growthModel).

 onerow <- list(fertilizer = f)
 growth = predict(growthModel,onerow)

Conclusion

Helping companies manage, analyze, and benefit from big data is the key focus of the IBM big data platform. InfoSphere Streams, IBM's software platform for storing and processing streaming data, enables integration with new R Project software suites, which offer powerful data manipulation, statistical computation, and graphics display capabilities. If you're ready to get started with InfoSphere Streams and its various toolkits, see Resources for free training materials and software.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=943977
ArticleTitle=Get to know the R-project Toolkit in InfoSphere Streams
publish-date=09172013