Overview of the R Project
R is a free integrated suite of software facilities for data manipulation, calculation, and graphical display. It is an effective data handling and storage facility, providing users with many facilities, including:
 A suite of operators for calculations on arrays and matrices.
 A large, coherent, and integrated collection of intermediate tools for data analysis operations.
 A welldeveloped and simple environment for the S programming language, a statistical programming language for analyzing data, which includes conditionals, loops, userdefined recursive functions, and input and output facilities.
R is a true objectoriented programming language, much like C++ (and others) where objects can be just about anything: a single value, a variable, datasets, lists of several types of objects, etc. R provides a variety of graphical and statistical techniques, such as linear and nonlinear modeling, classical statistical tests, timeseries analysis, classification, clustering, and more.
The R programming language is highly extensible. It allows users to write new functions and package those functions in an R package (or R library). In practice, the default installation of R and accompanying packages provides a fully functioning statistical environment in which one may conduct any number of typical and advanced analyses. In addition, there are more than 5,300 usercontributed packages that provide a vast amount of enhanced functioning. Therefore, if you come across a particular analysis, it is likely there is already an R package devoted to it.
A primary strength of the R environment is the ease with which welldesigned plots can be produced, including mathematical symbols and formulae where needed. R also has extensive and powerful graphic facilities.
Using R as a calculator
Use simple arithmetic expressions to get R to calculate a mathematical answer.
Listing 1. R as a calculator for integer summation
> 1+2 ## [1] 3
You can also use mathematical functions, such as sqrt
,
exp
, and log
.
Listing 2. R as a calculator for mathematical functions
> log(0.3/(10.3)) ## [1] 0.8472979
Coding data structures in R
In principle, R operates on named data structures. The simplest such structure is the numeric vector, which is a single entity consisting of an ordered collection of numbers. Use the following command to set up vector x consisting of five numbers: 9.4, 8.6, 2.1, 3.4, and 11.7.
Listing 3. Vector assignment in R
> x < c(10.4, 5.6, 3.1, 6.4, 21.7)
This is an assignment statement using the function
c()
, which takes an arbitrary number of vector
arguments and arrives at a vector value by concatenating the arguments end
to end.
The elementary arithmetic operators are the usual
+, , *, /
, and ^
for raising to a power. In addition, all of the common arithmetic
functions (log, exp, sin, cos,
tan, sqrt) are available with their usual meaning.
Functions:
max
andmin
select the largest and smallest elements of a vector, respectively.range
is a function whose value is a vector of length two, namelyc(min(x)
andmax(x))
.length(x)
is the number of elements in x,sum(x)
gives the total of the elements in x, andprod(x)
their product. The statistical function
mean(x)
calculates the sample mean, which is the same assum(x)/length(x)
). var(x)
givessum((xmean(x))^2)/(length(x)1)
or the sample variance.sort(x)
returns a vector of the same size as x with the elements arranged in increasing order. However, there are other, more flexible, sorting facilities available. The following example is for a simple function to produce the mean of an input vector x.Listing 4. Mean calculation using R
mymean = function(x) { if (!is.numeric(x)) { stop("Input Error") } return(sum(x)/length(x)) } myVar = 1:5 mymean(myVar) ## [1] 3 mymean("string value") ## Error: STOP! Can not be computed.
It is out of the scope of this article to cover all the details and capabilities of the R language. For complete information, refer to the R Project manual (see Resources).
R Project and InfoSphere Streams
InfoSphere® Streams is designed to uncover meaningful patterns from information in motion (data flows) during a window of minutes to hours. The platform provides business value by supporting lowlatency insight and better outcomes for timesensitive applications, such as fraud detection or network management. The types of business problems that InfoSphere Streams is designed to address are fundamentally different from those regularly addressed by common relational databases, standard extracttransformload (ETL) software platforms, or batchprocessing analytic platforms, such as Hadoop.
The term stream describes continuous flow of data from a data source. Stream data is in the form of tuples, which are composed of a fixed set of attributes. In practice, a standard database server generally offers static data models and data, and also supports dynamic queries. The data is stored in persistent form, and queries can arrive at any time to answer any question contained inside the historical data and data model.
However, in an InfoSphere Streams environment, data does not need to be persistently stored. Instead, it is generally expected that the observed data volume is so large that you could not afford to make it persistent. Therefore, InfoSphere Streams supports the ability to build applications that can support the arrival of these huge volumes of data, which must be analyzed and reported upon, with low latency, and in real time (or close to real time.)
Queries in InfoSphere Streams are defined by the streams applications and run continuously until the user cancels them. A streams application is a defined collection of operators, connected by streams:
 A streams application defines how the runtime should analyze a set of stream data.
 An operator is a component of processing functionality that takes one or more streams as input, processes the tuples and attributes in the streams, and produces one or more streams as output.
 SPL, the programming language for InfoSphere Streams, is a distributed dataflow composition language. It is an extensible and fullfeatured language like C++ or Java™ that supports userdefined data types. The user can write custom functions in SPL or a native language (C++ or Java). The user can also write userdefined operators in C++ or Java.
One of the strengths of the InfoSphere Streams platform is toolkits, which represent collections of assets that facilitate the development of a solution for a particular industry or functionality. They can be augmented with their functionalities into any InfoSphere Streams application. Examples of these toolkits include:
 Mining Toolkit— Enables scoring of realtime data in a streams application. The scoring in the Streams Mining Toolkit assumes (and uses) a predefined model. A variety of model types and scoring algorithms are supported: classification, regression, clustering, and association, among others.
 Financial Services Toolkit— Delivers competitive advantage to financial institutions.
 Database Toolkit— Provides the means to enable streams applications to connect to different data stores such as solidDB, Netezza, DB2®, and Oracle.
 Internet Toolkit— Provides the means for receiving textbased data from remote sources using HTTP, and generates an input stream from this content.
InfoSphere Streams 3.1 supports R analytics, which allows users to use R scoring in a native format while taking advantage of the scalability and parallelism features of InfoSphere Streams to achieve exceptional throughput. R support extends the InfoSphere Streams family of analytic options. The following sections gives an overview of the Rproject Toolkit in InfoSphere Streams.
Rproject Toolkit
The Rproject Toolkit provides an operator that facilitates integration
between InfoSphere Streams and the R environment. A cornerstone of the
tool is the RScript
operator, which is capable
of invoking a userdefined R script each time a tuple is received on the
required input port.
In particular, it maps the attributes of each input tuple to objects that can be used in R commands. It then runs a script that contains R commands and maps the objects that are output from the script to output tuple attributes.
How the RScript
operator works
When a tuple is received on the required input port, the operator uses the
streamAttributes
parameter to map the input tuple attributes
to the objects specified in the rObjects
parameter. The operator
runs the script specified in the rScriptFileName
parameter and processes the results. The operator uses the custom output
function fromRvalues
to map the values produced by the
output statements in the R script to output tuple attributes.
The operator supports an optional input port and accepts an rstring
attribute that specifies the name of an R script. This script is run once. You can
use the script to update or replace the analytic code in the
initialization or processing scripts. For example, you can run R commands
that refresh the model used for scoring, or you can replace an R
function definition. The operator also supports an optional output port is
used to capture error information that occurs while running the R script.
In addition, the operator receives the following parameters:
rScriptFileName
— Provides information about the path of an R script to run for each incoming tuple.streamAttributes
— Describes a list of expressions to use for mapping to R objects used within the R script.rObjects
— Provides a list of strings that represent the names of R objects that will be populated before the R script is run. Must be a 1:1 mapping withstreamAttributes
parameter.initializationScriptFileName
(optional) — Provides information about the path of an R script to run during operator initialization.rCommand
(optional) — Allows the user to specify different path and options when invoking the R shell.
The RScript
operator supports all primitive SPL data types (integer,
string, double, and time). Furthermore, it can handle mapping between SPL
data types and R vectors.
RScript
operator in action
The following code shows a simple example of using the RScript
operator for
calling an R script that sums two integer input parameters.
Listing 5. RScript
in action:
Integer summation
stream<int32 a, int32 b, int32 c< analyzedStream = RScript(inStream) { param rScriptFileName : "../process.r" ; streamAttributes : a, b; rObjects : "in1", "in2"; output analyzedStream: c = fromR("out1"); }
In the code above, process.r
is an R script that
receives the two parameters (a
and b
) after mapping them to in1
and in2
parameters. The returned out1
parameter from the R script is mapped to the parameter c
.
The following code is another example that uses linear model for predicting how fertilizer will affect growth.
Listing 6. RScript
in action:
Applying a prediction model
stream<int32 identifier, float64 fert, float64 sizeChange, rstring modelSource< analyzedStream = RScript(inStream) { param initializationScriptFileName : "../init.r"; rScriptFileName : "../process.r" ; streamAttributes : fert; rObjects : "f"; output analyzedStream: sizeChange = fromR("growth"), }
In the code above, the init.r
script runs once
during operator startup. It creates the connection to the input data and
the linear model growthModel
.
growthData < read.csv(file="../growthMonday.txt", sep=' ', header=TRUE) attach(growthData) growthModel = lm(height~fertilizer)
The process.r
script runs each time a tuple
arrives on required input port. Fertilizer amounts are input with the
objects in1 and in2. The script outputs the predicted
growth from the linear model using the object
(growthModel
).
onerow < list(fertilizer = f) growth = predict(growthModel,onerow)
Conclusion
Helping companies manage, analyze, and benefit from big data is the key focus of the IBM big data platform. InfoSphere Streams, IBM's software platform for storing and processing streaming data, enables integration with new R Project software suites, which offer powerful data manipulation, statistical computation, and graphics display capabilities. If you're ready to get started with InfoSphere Streams and its various toolkits, see Resources for free training materials and software.
Resources
Learn
 Read about the The S Language and System.
 Visit the The R Project.
 See the R Project manual for details and capabilities of the R programming language.
 Read "An introduction to InfoSphere Streams: A platform for analyzing big data in motion."
 Find resources to help you get started with InfoSphere Streams IBM's highperformance computing platform that enables userdeveloped applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of realtime sources.
 Get a detailed overview of InfoSphere Streams in the IBM Redbooks® publication titled "InfoSphere Streams: Harnessing data in motion."
 Dig deeper into the Streams Processing Language (SPL).
 Learn more about big data in the developerWorks big data content area. Find technical documentation, howto articles, education, downloads, product information, and more.
 Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoopbased offering that extends the value of open source Hadoop with features like Big SQL, text analytics, and BigSheets.
 Follow these selfpaced tutorials (PDF) to learn how to manage your big data environment, import data for analysis, analyze data with BigSheets, develop your first big data application, develop Big SQL queries to analyze big data, and create an extractor to derive insights from text documents with InfoSphere BigInsights.
 Find resources to help you get started with InfoSphere Streams, IBM's highperformance computing platform that enables userdeveloped applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of realtime sources.
 Stay current with developerWorks technical events and webcasts.
 Follow developerWorks on Twitter.
Get products and technologies
 Download InfoSphere Streams, available as a native software installation or as a VMware image.
 Use InfoSphere Streams on IBM SmartCloud Enterprise.
 Access IBM trial software available as a download or in a cloud environment and innovate in your next open source development project using software especially for developers.
 Download InfoSphere BigInsights Quick Start Edition, available as a native software installation or as a VMware image.
 Download InfoSphere Streams, available as a native software installation or as a VMware image.
 Use InfoSphere Streams on IBM SmartCloud Enterprise.
 Build your next development project with IBM trial software, available for download directly from developerWorks.
Discuss
 Ask questions and get answers in the InfoSphere Streams forum.
 Ask questions and get answers in the InfoSphere BigInsights forum.
 Check out the developerWorks blogs and get involved in the developerWorks community.
Comments
Dig deeper into Big data and analytics on developerWorks

developerWorks Premium
Exclusive tools to build your next great app. Learn more.

dW Answers
Ask a technical question

Explore more technical topics
Tutorials & training to grow your development skills