R is a powerful language used for in-memory statistical computing and graphical display. It is similar to SAS, IBM SPSS® Statistics, MATLAB, or FORTRAN, except that it is open source. Converters exist that easily move data held in SAS, IBM SPSS Statistics, or MATLAB format into R. In addition, R comes with a wide array of packages available through the Comprehensive R Archive Network (CRAN). CRAN serves a similar function to CPAN for the Perl language or Rubygems.org for the Ruby language. R is also integrated with InfoSphere BigInsights (see Resources).
Set up the development environment
First, you need to set up the development environment. You need R and,
optionally, the RStudio integrated development environment (IDE). Most
package management systems (
others) have R version 2.15 available as the default, but if you want the
most up-to-date version of R (3.0.2 at the time of this writing), you must
edit your available sources.
To edit the available sources in Debian, for example, open the file /etc/apt/sources.list to add the following line:
deb http://favorite-cran-mirror/bin/linux/debian wheezy-cran3/
Replace favorite-cran-mirror with the mirror closest to you. See Resources for a list of CRAN mirrors. Your CRAN mirror provides instructions for other distros, such as Red Hat Enterprise Linux®, SuSE, and Ubuntu, in addition to a terminal-based interface to R.
Next, you need to make R aware of how to read JSON data. You do this through the JSON for R (rjson) package, which is available on CRAN. Start by opening an R terminal. The syntax for importing a package into your local R installation is:
Then make it available with:
You can replace
rjson with the name of any package available
The most basic commands in rjson are
toJSON(). This article explores only
fromJSON() because it is the most useful method when
interacting with existing data, such as data InfoSphere BigInsights
creates or parses.
If you haven't already done so, download the sample file grades.json and save it to an accessible folder (see Download). Then import the file into R with the following command:
grades=fromJSON(file = '/path_to_file/grades.json', unexpected.escape = "error")
path_to_file with the path to your file.
unexpected.escape tells rjson how to
treat an unexpected escape character. The options for this flag are
error, keep, and skip.
Basic R commands
You can learn more about any given command within R by using the
Or you can use the following command:
Replace name_of_command with the name of the command
about which you want to learn. For instance, to read the manual page for
the library command, you simply enter
help(library). This opens the manual page for the
library function. Enter
q to exit the manual
page and return to the R shell.
It is often useful to clean up data before attempting to import it into R
using another language, such as Perl or Ruby. Further instructions are
outside the scope of this article, but R provides tools to clean up data.
In addition, the rjson package provides a flag to skip
unescaped special characters. Internal to R are functions such as
gsub(), which alter data based on
regular expression (regex) rules. The syntax for
gsub is as
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
gsub("A", "a", grades) converts every uppercase
A to a lowercase a in the return value. The
perl flag causes R to use the Perl-compatible regular
expressions library instead of standard regex, which may be useful if you
have existing regex rules you want to import from Perl.
sub() method changes the first occurrence of a matched
gsub() changes all occurrences.
Another useful tool available in R is the
fix() command. With it, you can edit the data set
currently in use. For example,
edit(grades) opens the R
grades, which was created earlier in the article. With
this tool, you can change the data on the fly. R defaults to the
vi. If you want to use a different editor, you can
change it with
options(editor = "nano"). Substitute
nano in the previous command with the text editor of your
choice — for example, Pico or gedit.
In addition, you can call other languages directly from R by using the
system() command, which calls a command in the underlying
shell and prints the return value in the R session. For example,
system('echo "something"') breaks out of the R session and
passes the command
echo "something" to the underlying shell.
Then it grabs the return value from standard output (stdout) — in
this case, the word something— and makes it the return
value in the R session. The following
intern flag makes the
return value an R object you can manipulate.
system('echo "something"', intern=TRUE)
Perhaps a more useful example is found with the following command, which creates an R object from the return value from any arbitrary script in any language available in the underlying shell:
R JSON data types
To understand how R treats JSON data, start with the rjson library, which imports data in list format. To learn more about R data types, check out the link in Resources. The list data type is the most flexible because you can decompose from list data into any of the other data types and because data does not have to be of equal lengths in list type (as opposed to vector type, which does have that limitation.) However, many of the statistical functions you can apply to a data set are not available for data in the list format. Therefore, you have to extract the useful data points from the list-formatted data into another data type.
A useful command for exploring how R views a given piece of data is
str(). If you use the
grades object created
earlier, the output of
str(grades) should look like Listing 1.
Listing 1. Structure of JSON data in R
str(grades) List of 4 $ :List of 4 ..$ name : chr "Amy" ..$ grade1: num 35 ..$ grade2: num 41 ..$ grade3: num 53 $ :List of 4 ..$ name : chr "Bob" ..$ grade1: num 44 ..$ grade2: num 37 ..$ grade3: num 28 $ :List of 4 ..$ name : chr "Charles" ..$ grade1: num 68 ..$ grade2: num 65 ..$ grade3: num 61 $ :List of 4 ..$ name : chr "David" ..$ grade1: num 72 ..$ grade2: num 78 ..$ grade3: num 81
Decide which data points to extract and use the
concatenate to extract the data, as shown below:
grade1.num <- c(grades[]$grade1, grades[]$grade1, grades[]$grade1, grades[]$grade1)
This command function creates a new object (
consists of each student's grade on the first exam.
grade1.num is now a numeric vector, the most basic data type
in R. To remove an R object, issue the
rm() command. For
rm(grade1.num) removes the
object just created from the R session. To create an object for a given
student's grades, you can issue the following command:
Amy.grade< - c(grades[]$grade1, grades[]$grade2, grades[]$grade3)
This command uses the assignment operator, which consists of the
>) or less-than (
depending on the direction of the assignment, combined with a hyphen
-). Items held in list format have their individual elements
accessed by index number (for example,
[] is the first
list) or by name with the dollar sign operator (for example,
A more useful data type in R is
data.frame, which is a composite of vectors. To create a
data.frame from the example data, first create numeric vectors for the
remaining students' grades, as shown below:
Bob.grade <- c(grades[]$grade1, grades[]$grade2, grades[]$grade3) Charles.grade <- c(grades[]$grade1, grades[]$grade2, grades[]$grade3) David.grade <- c(grades[]$grade1, grades[]$grade2, grades[]$grade3)
Next, combine all of the vectors into a single data frame with the following command:
All.grades <- data.frame(Amy.grade, Bob.grade, Charles.grade, David.grade)
Data imported into R from comma-separated values or spreadsheet programs
also become a
data.frame object using handlers built into R.
R statistical functions
To gain basic knowledge of the statistical nature of
grade1.num, use the summary command:
summary(Amy.grade) Min. 1st Qu. Median Mean 3rd Qu. Max. 35 38 41 43 47 53
The summary command also works on data frames. The output of
summary(All.grades) looks like Listing 2.
Listing 2. Summary of data frame in R
summary(All.grades) Amy.grade Bob.grade Charles.grade David.grade Min. :35 Min. :28.00 Min. :61.00 Min. :72.0 1st Qu.:38 1st Qu.:32.50 1st Qu.:63.00 1st Qu.:75.0 Median :41 Median :37.00 Median :65.00 Median :78.0 Mean :43 Mean :36.33 Mean :64.67 Mean :77.0 3rd Qu.:47 3rd Qu.:40.50 3rd Qu.:66.50 3rd Qu.:79.5 Max. :53 Max. :44.00 Max. :68.00 Max. :81.0
To determine whether two vectors are statistically correlated, use the
easy-to-use R function that acts on numeric vectors:
For example, you can examine the correlation of
Amy.grade object by using following command:
cor(Amy.grade, Bob.grade)  -0.9930365
Note: Variance and covariance use the same syntax as
correlation, except with the functions
var(), respectively, as shown below:
cov(Amy.grade, Bob.grade)  -73 var(Amy.grade, Bob.grade)  -73
To calculate the median absolute deviation, use the
mad(Charles.grade)  4.4478
Another useful function that acts on numeric vectors is
This function examines the standard deviation of two or more vectors:
sd(Amy.grade)  9.165151
This function shows that the standard deviation of Amy's grades is 9.165151.
Notice that the output has an index number of
lets you know you can treat the output of this function the same way
you treat any other object or you can assign it a name — for
x<- sd(Amy.grade). If you change the contents of
Amy.grade by using the
command, the value of
x changes, too.
Another useful statistical function is the Kolmogorov-Smirnov test. You can use this test with the example data to determine whether the probability distributions differ between two students.
Listing 3. Kolmogorov-Smirnov test
ks.test(Amy.grade, Bob.grade) Two-sample Kolmogorov-Smirnov test data: Amy.grade and Bob.grade D = 0.3333, p-value = 1 alternative hypothesis: two-sided
You can see that
come from the same distribution.
R is well known for its ability to create data visualizations. The simplest
function in this family is
plot(). For instance,
plot(David.grade) creates a simple scatter plot diagram of
that object. To add a title and labels for the axis, use the following
plot(David.grade, main = "David's Grades", ylab = "Grade", xlab = "Test Number")
dotchart is similar to
takes many of the same arguments, except that
labels=row.names(x) flag, which enables you to use the
row names (if any) to label the chart.
R can also produce a histogram diagram — for
hist(All.grades[]). This command is equivalent to
hist(Amy.grade) but extracts the vector from the data frame
R can also produce bar plots with the
— for example,
barplot(Charles.grade). To change the
orientation of the graph, add the
horiz=TRUE flag. You can
add color by using the
col flag, as shown below:
barplot(Charles.grade, horiz=TRUE, col="darkblue")
The graphics function
boxplot— for example,
boxplot(All.grades)— is useful. With
boxplot, you can compare entire data
frames. To add axis labels, notches for median comparison, a title, and
colors, enter the following command:
boxplot(All.grades, main = "Class Grades", ylab = "Grade", xlab = "Test Number", col=(c("gold","darkgreen","blue","red")), notch=TRUE)
Notice the use of the concatenate function nested within the
This article explains how the R language provides a powerful tool for statistically analyzing data and displaying the results graphically. R has found a variety of uses in business sectors such as finance, biology, and engineering. Now that InfoSphere BigInsights products include R functions, R will gain in popularity.
|Sample data for this article||grades.json.zip||1KB|
- Check out the R InfoSphere BigInsights documentation.
- Check the CRAN mirror list for the mirror nearest you.
- Learn more about R data types.
- Peruse the official R documentation.
- R-bloggers contains many useful articles, tips, and ideas related to the R language.
- Read about R use in statistical biology.
- Learn about R use in public policy.
- Read about R use in conjunction with open government data sets.
- Read about R use in physics. It's interesting to note that the CERN are heavy internal users of R. They produce a Red Hat clone that includes R out of the box, named Scientific Linux.
- Read about R use in engineering.
- Learn more about the Kolmogorov–Smirnov test.
- Check out a good style guide produced by Google for R.
quantmodis an interesting R package designed for financial modeling.
- Visit the IBM developerWorks big data zone for all the latest big data technical resources for architects and developers.
- Browse the technology bookstore for books on these and other technical topics.
- Follow developerWorks on Twitter.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners to advanced functionality for experienced developers.
- Learn more about big data in the developerWorks big data content area. Find technical documentation, how-to articles, education, downloads, product information, and more.
- Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoop-based offering that extends the value of open source Hadoop with features like Big SQL, text analytics, and BigSheets.
- Follow these self-paced tutorials (PDF) to learn how to manage your big data environment, import data for analysis, analyze data with BigSheets, develop your first big data application, develop Big SQL queries to analyze big data, and create an extractor to derive insights from text documents with InfoSphere BigInsights.
- Stay current with developerWorks technical events and webcasts.
- Follow developerWorks on Twitter.
Get products and technologies
- R-Forge is a site that hosts R packages and forums in addition to CRAN.
- Download the RStudio IDE.
- Download InfoSphere BigInsights Quick Start Edition, available as a native software installation or as a VMware image.
- Build your next development project with IBM trial software, available for download directly from developerWorks.
- Find other R users in your area.
- Ask questions and get answers in the InfoSphere BigInsights forum.
- Check out the developerWorks blogs and get involved in the developerWorks community.
- Check out IBM big data and analytics on Facebook.