Extract meaningful statistical measures from data in JSON using R
Integrate JSON data with R
R is a powerful language used for in-memory statistical computing and graphical display. It is similar to SAS, IBM SPSS® Statistics, MATLAB, or FORTRAN, except that it is open source. Converters exist that easily move data held in SAS, IBM SPSS Statistics, or MATLAB format into R. In addition, R comes with a wide array of packages available through the Comprehensive R Archive Network (CRAN). CRAN serves a similar function to CPAN for the Perl language or Rubygems.org for the Ruby language. R is also integrated with InfoSphere BigInsights (see Related topics).
InfoSphere BigInsights uses the JSON format to retain and display data, and this article explains how to use JSON with R. JSON is a key-value data store that can map directly to JavaScript objects. This article uses an example set of students and their respective grades on examinations.
Set up the development environment
First, you need to set up the development environment. You need R and,
optionally, the RStudio integrated development environment (IDE). Most
package management systems (apt, yum, and
others) have R version 2.15 available as the default, but if you want the
most up-to-date version of R (3.0.2 at the time of this writing), you must
edit your available sources.
To edit the available sources in Debian, for example, open the file /etc/apt/sources.list to add the following line:
deb http://favorite-cran-mirror/bin/linux/debian wheezy-cran3/
Replace favorite-cran-mirror with the mirror closest to you. See Related topics for a list of CRAN mirrors. Your CRAN mirror provides instructions for other distros, such as Red Hat Enterprise Linux®, SuSE, and Ubuntu, in addition to a terminal-based interface to R.
Next, you need to make R aware of how to read JSON data. You do this through the JSON for R (rjson) package, which is available on CRAN. Start by opening an R terminal. The syntax for importing a package into your local R installation is:
install.packages("rjson")Then make it available with:
library("rjson")You can replace rjson with the name of any package available
on CRAN.
The most basic commands in rjson are fromJSON() and
toJSON(). This article explores only fromJSON()
because it is the most useful method when interacting with existing data,
such as data InfoSphere BigInsights creates or parses.
If you haven't already done so, download the sample file grades.json and save it to an accessible folder (see Download). Then import the file into R with the following command:
grades=fromJSON(file = '/path_to_file/grades.json', unexpected.escape = "error")
Replace path_to_file with the path to your file.
The flag unexpected.escape tells rjson how to treat an
unexpected escape character. The options for this flag are error, keep,
and skip.
Basic R commands
You can learn more about any given command within R by using the
help function:
help(name_of_command)
Or you can use the following command:
?name_of_command
Replace name_of_command with the name of the command about which
you want to learn. For instance, to read the manual page for the library
command, you simply enter help(library). This opens the
manual page for the library function. Enter q to exit the
manual page and return to the R shell.
It is often useful to clean up data before attempting to import it into R
using another language, such as Perl or Ruby. Further instructions are
outside the scope of this article, but R provides tools to clean up data.
In addition, the rjson package provides a flag to skip unescaped special
characters. Internal to R are functions such as sub() and
gsub(), which alter data based on regular expression (regex)
rules. The syntax for gsub is as follows:
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
For example, gsub("A", "a", grades) converts every uppercase
A to a lowercase a in the return value. The
perl flag causes R to use the Perl-compatible regular
expressions library instead of standard regex, which may be useful if you
have existing regex rules you want to import from Perl.
The sub() method changes the first occurrence of a matched
pattern. The gsub() changes all occurrences.
Another useful tool available in R is the edit() or
fix() command. With it, you can edit the data set currently
in use. For example, edit(grades) opens the R object
grades, which was created earlier in the article. With this
tool, you can change the data on the fly. R defaults to the editor
vi. If you want to use a different editor, you can change it
with options(editor = "nano"). Substitute nano
in the previous command with the text editor of your choice — for
example, Pico or gedit.
In addition, you can call other languages directly from R by using the
system() command, which calls a command in the underlying
shell and prints the return value in the R session. For example,
system('echo "something"') breaks out of the R session and
passes the command echo "something" to the underlying shell.
Then it grabs the return value from standard output (stdout) — in
this case, the word something— and makes it the return
value in the R session. The following intern flag makes the
return value an R object you can manipulate.
system('echo "something"', intern=TRUE)Perhaps a more useful example is found with the following command, which creates an R object from the return value from any arbitrary script in any language available in the underlying shell:
system('./my/script/here.pl', intern=true)R JSON data types
To understand how R treats JSON data, start with the rjson library, which imports data in list format. To learn more about R data types, check out the link in Related topics. The list data type is the most flexible because you can decompose from list data into any of the other data types and because data does not have to be of equal lengths in list type (as opposed to vector type, which does have that limitation.) However, many of the statistical functions you can apply to a data set are not available for data in the list format. Therefore, you have to extract the useful data points from the list-formatted data into another data type.
A useful command for exploring how R views a given piece of data is
str(). If you use the grades object created
earlier, the output of str(grades) should look like Listing
1.
Listing 1. Structure of JSON data in R
str(grades)
List of 4
$ :List of 4
..$ name : chr "Amy"
..$ grade1: num 35
..$ grade2: num 41
..$ grade3: num 53
$ :List of 4
..$ name : chr "Bob"
..$ grade1: num 44
..$ grade2: num 37
..$ grade3: num 28
$ :List of 4
..$ name : chr "Charles"
..$ grade1: num 68
..$ grade2: num 65
..$ grade3: num 61
$ :List of 4
..$ name : chr "David"
..$ grade1: num 72
..$ grade2: num 78
..$ grade3: num 81Decide which data points to extract and use the c(), or
concatenate to extract the data, as shown below:
grade1.num <- c(grades[[1]]$grade1, grades[[2]]$grade1, grades[[3]]$grade1, grades[[4]]$grade1)
This command function creates a new object (grade1.num), which
consists of each student's grade on the first exam.
grade1.num is now a numeric vector, the most basic data type
in R. To remove an R object, issue the rm() command. For
example, rm(grade1.num) removes the grade1.num
object just created from the R session. To create an object for a given
student's grades, you can issue the following command:
Amy.grade< - c(grades[[1]]$grade1, grades[[1]]$grade2, grades[[1]]$grade3)
This command uses the assignment operator, which consists of the
greater-than (>) or less-than (<) symbol,
depending on the direction of the assignment, combined with a hyphen
(-). Items held in list format have their individual elements
accessed by index number (for example, [[1]] is the first
list) or by name with the dollar sign operator (for example,
$grade1).
A more useful data type in R is data.frame, which is a
composite of vectors. To create a data.frame from the example data, first
create numeric vectors for the remaining students' grades, as shown below:
Bob.grade <- c(grades[[2]]$grade1, grades[[2]]$grade2, grades[[2]]$grade3)
Charles.grade <- c(grades[[3]]$grade1, grades[[3]]$grade2, grades[[3]]$grade3)
David.grade <- c(grades[[4]]$grade1, grades[[4]]$grade2, grades[[4]]$grade3)Next, combine all of the vectors into a single data frame with the following command:
All.grades <- data.frame(Amy.grade, Bob.grade, Charles.grade, David.grade)
Data imported into R from comma-separated values or spreadsheet programs
also become a data.frame object using handlers built into R.
R statistical functions
To gain basic knowledge of the statistical nature of
grade1.num, use the summary command:
summary(Amy.grade)
Min. 1st Qu. Median Mean 3rd Qu. Max.
35 38 41 43 47 53The summary command also works on data frames. The output of
summary(All.grades) looks like Listing 2.
Listing 2. Summary of data frame in R
summary(All.grades) Amy.grade Bob.grade Charles.grade David.grade Min. :35 Min. :28.00 Min. :61.00 Min. :72.0 1st Qu.:38 1st Qu.:32.50 1st Qu.:63.00 1st Qu.:75.0 Median :41 Median :37.00 Median :65.00 Median :78.0 Mean :43 Mean :36.33 Mean :64.67 Mean :77.0 3rd Qu.:47 3rd Qu.:40.50 3rd Qu.:66.50 3rd Qu.:79.5 Max. :53 Max. :44.00 Max. :68.00 Max. :81.0
To determine whether two vectors are statistically correlated, use the
easy-to-use R function that acts on numeric vectors: cor().
For example, you can examine the correlation of Bob.grade
with the Amy.grade object by using following command:
cor(Amy.grade, Bob.grade) [1] -0.9930365
Note: Variance and covariance use the same syntax as
correlation, except with the functions cov() and
var(), respectively, as shown below:
cov(Amy.grade, Bob.grade) [1] -73 var(Amy.grade, Bob.grade) [1] -73
To calculate the median absolute deviation, use the mad()
function:
mad(Charles.grade) [1] 4.4478
Another useful function that acts on numeric vectors is sd().
This function examines the standard deviation of two or more vectors:
sd(Amy.grade) [1] 9.165151
This function shows that the standard deviation of Amy's grades is 9.165151.
Notice that the output has an index number of [1]. This lets
you know you can treat the output of this function the same way you treat
any other object or you can assign it a name — for example,
x<- sd(Amy.grade). If you change the contents of the
vector Amy.grade by using the edit() command,
the value of x changes, too.
Another useful statistical function is the Kolmogorov-Smirnov test. You can use this test with the example data to determine whether the probability distributions differ between two students.
Listing 3. Kolmogorov-Smirnov test
ks.test(Amy.grade, Bob.grade) Two-sample Kolmogorov-Smirnov test data: Amy.grade and Bob.grade D = 0.3333, p-value = 1 alternative hypothesis: two-sided
You can see that Amy.grade and Bob.grade both
come from the same distribution.
R visualizations
R is well known for its ability to create data visualizations. The simplest
function in this family is plot(). For instance,
plot(David.grade) creates a simple scatter plot diagram of
that object. To add a title and labels for the axis, use the following
command:
plot(David.grade, main = "David's Grades", ylab = "Grade", xlab = "Test Number")
The function dotchart is similar to plot and
takes many of the same arguments, except that dotchart takes
the labels=row.names(x) flag, which enables you to use the
row names (if any) to label the chart.
R can also produce a histogram diagram — for
example, hist(All.grades[[1]]). This command is equivalent to
hist(Amy.grade) but extracts the vector from the data frame
created earlier.
R can also produce bar plots with the barplot() function
— for example, barplot(Charles.grade). To change the
orientation of the graph, add the horiz=TRUE flag. You can
add color by using the col flag, as shown below:
barplot(Charles.grade, horiz=TRUE, col="darkblue")
The graphics function boxplot— for example,
boxplot(All.grades)— is useful. With
boxplot, you can compare entire data frames. To add axis
labels, notches for median comparison, a title, and colors, enter the
following command:
boxplot(All.grades, main = "Class Grades", ylab = "Grade",
xlab = "Test Number", col=(c("gold","darkgreen","blue","red")),
notch=TRUE)Notice the use of the concatenate function nested within the
boxplot function.
Conclusion
This article explains how the R language provides a powerful tool for statistically analyzing data and displaying the results graphically. R has found a variety of uses in business sectors such as finance, biology, and engineering. Now that InfoSphere BigInsights products include R functions, R will gain in popularity.
Downloadable resources
- PDF of this content
- Sample data for this article (grades.json.zip | 1KB)
Related topics
- Check out the R InfoSphere BigInsights documentation.
- Check the CRAN mirror list for the mirror nearest you.
- Learn more about R data types.
- Peruse the official R documentation.
- R-bloggers contains many useful articles, tips, and ideas related to the R language.
- Read about R use in statistical biology.
- Learn about R use in public policy.
- Read about R use in conjunction with open government data sets.
- Read about R use in physics. It's interesting to note that the CERN are heavy internal users of R. They produce a Red Hat clone that includes R out of the box, named Scientific Linux.
- Read about R use in engineering.
- Learn more about the Kolmogorov–Smirnov test.
quantmodis an interesting R package designed for financial modeling.- R-Forge is a site that hosts R packages and forums in addition to CRAN.
- Download the RStudio IDE.
- Download InfoSphere BigInsights Quick Start Edition, available as a native software installation or as a VMware image.