Extract meaningful statistical measures from data in JSON using R

Integrate JSON data with R

This article explains how to integrate JavaScript Object Notation (JSON) data, such as the data InfoSphere® BigInsights™ produces, with R, a language for statistical computing. It also explains basic R data types, commands, and its graphical display.

Share:

William Hurley (willhurley004@gmail.com), Librarian, American Irish Historical Society

William HurleyWilliam Hurley has spent many years turning large data sets into high-powered web applications for nonprofits in and around New York City.



21 January 2014

Also available in Russian

R is a powerful language used for in-memory statistical computing and graphical display. It is similar to SAS, IBM SPSS® Statistics, MATLAB, or FORTRAN, except that it is open source. Converters exist that easily move data held in SAS, IBM SPSS Statistics, or MATLAB format into R. In addition, R comes with a wide array of packages available through the Comprehensive R Archive Network (CRAN). CRAN serves a similar function to CPAN for the Perl language or Rubygems.org for the Ruby language. R is also integrated with InfoSphere BigInsights (see Resources).

InfoSphere BigInsights Quick Start Edition

InfoSphere BigInsights Quick Start Edition is a complimentary, downloadable version of InfoSphere BigInsights, IBM's Hadoop-based offering. Using Quick Start Edition, you can try out the features that IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets. Guided learning is available to make your experience as smooth as possible including step-by-step, self-paced tutorials and videos to help you start putting Hadoop to work for you. With no time or data limit, you can experiment on your own time with large amounts of data. Watch the videos, follow the tutorials (PDF), and download BigInsights Quick Start Edition now.

InfoSphere BigInsights uses the JSON format to retain and display data, and this article explains how to use JSON with R. JSON is a key-value data store that can map directly to JavaScript objects. This article uses an example set of students and their respective grades on examinations.

Set up the development environment

First, you need to set up the development environment. You need R and, optionally, the RStudio integrated development environment (IDE). Most package management systems (apt, yum, and others) have R version 2.15 available as the default, but if you want the most up-to-date version of R (3.0.2 at the time of this writing), you must edit your available sources.

To edit the available sources in Debian, for example, open the file /etc/apt/sources.list to add the following line:

deb http://favorite-cran-mirror/bin/linux/debian wheezy-cran3/

Replace favorite-cran-mirror with the mirror closest to you. See Resources for a list of CRAN mirrors. Your CRAN mirror provides instructions for other distros, such as Red Hat Enterprise Linux®, SuSE, and Ubuntu, in addition to a terminal-based interface to R.

Next, you need to make R aware of how to read JSON data. You do this through the JSON for R (rjson) package, which is available on CRAN. Start by opening an R terminal. The syntax for importing a package into your local R installation is:

install.packages("rjson")

Then make it available with:

library("rjson")

You can replace rjson with the name of any package available on CRAN.

The most basic commands in rjson are fromJSON() and toJSON(). This article explores only fromJSON() because it is the most useful method when interacting with existing data, such as data InfoSphere BigInsights creates or parses.

If you haven't already done so, download the sample file grades.json and save it to an accessible folder (see Download). Then import the file into R with the following command:

grades=fromJSON(file = '/path_to_file/grades.json',  unexpected.escape = "error")

Replace path_to_file with the path to your file.

The flag unexpected.escape tells rjson how to treat an unexpected escape character. The options for this flag are error, keep, and skip.


Basic R commands

You can learn more about any given command within R by using the help function:

help(name_of_command)

Or you can use the following command:

?name_of_command

Replace name_of_command with the name of the command about which you want to learn. For instance, to read the manual page for the library command, you simply enter help(library). This opens the manual page for the library function. Enter q to exit the manual page and return to the R shell.

It is often useful to clean up data before attempting to import it into R using another language, such as Perl or Ruby. Further instructions are outside the scope of this article, but R provides tools to clean up data. In addition, the rjson package provides a flag to skip unescaped special characters. Internal to R are functions such as sub() and gsub(), which alter data based on regular expression (regex) rules. The syntax for gsub is as follows:

gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)

For example, gsub("A", "a", grades) converts every uppercase A to a lowercase a in the return value. The perl flag causes R to use the Perl-compatible regular expressions library instead of standard regex, which may be useful if you have existing regex rules you want to import from Perl.

R support in IBM SPSS Statistics and SPSS Modeler

You can execute R algorithms within SPSS Statistics and SPSS Modeler and use algorithms and statistical techniques in SPSS Statistics that have been validated and proven over 40 years of use and testing. An SPSS Statistics Programmability Extension enables you to extend SPSS Statistics with external programming languages such as Python, R, .NET version of Microsoft® Visual Basic, and the Java™ programming language. It also allows external applications to access the SPSS Statistics processor and draw upon its vast wealth of functionality. Learn more about SPSS Statistics and SPSS Modeler, and give SPSS Statistics a try at no cost (get a trial download).

The sub() method changes the first occurrence of a matched pattern. The gsub() changes all occurrences.

Another useful tool available in R is the edit() or fix() command. With it, you can edit the data set currently in use. For example, edit(grades) opens the R object grades, which was created earlier in the article. With this tool, you can change the data on the fly. R defaults to the editor vi. If you want to use a different editor, you can change it with options(editor = "nano"). Substitute nano in the previous command with the text editor of your choice — for example, Pico or gedit.

In addition, you can call other languages directly from R by using the system() command, which calls a command in the underlying shell and prints the return value in the R session. For example, system('echo "something"') breaks out of the R session and passes the command echo "something" to the underlying shell. Then it grabs the return value from standard output (stdout) — in this case, the word something— and makes it the return value in the R session. The following intern flag makes the return value an R object you can manipulate.

system('echo "something"', intern=TRUE)

Perhaps a more useful example is found with the following command, which creates an R object from the return value from any arbitrary script in any language available in the underlying shell:

system('./my/script/here.pl', intern=true)

R JSON data types

To understand how R treats JSON data, start with the rjson library, which imports data in list format. To learn more about R data types, check out the link in Resources. The list data type is the most flexible because you can decompose from list data into any of the other data types and because data does not have to be of equal lengths in list type (as opposed to vector type, which does have that limitation.) However, many of the statistical functions you can apply to a data set are not available for data in the list format. Therefore, you have to extract the useful data points from the list-formatted data into another data type.

A useful command for exploring how R views a given piece of data is str(). If you use the grades object created earlier, the output of str(grades) should look like Listing 1.

Listing 1. Structure of JSON data in R
str(grades)
List of 4
    $ :List of 4
    ..$ name  : chr "Amy"
    ..$ grade1: num 35
    ..$ grade2: num 41
    ..$ grade3: num 53
$ :List of 4
    ..$ name  : chr "Bob"
    ..$ grade1: num 44
    ..$ grade2: num 37
    ..$ grade3: num 28
$ :List of 4
    ..$ name  : chr "Charles"
    ..$ grade1: num 68
    ..$ grade2: num 65
    ..$ grade3: num 61
$ :List of 4
    ..$ name  : chr "David"
    ..$ grade1: num 72
    ..$ grade2: num 78
    ..$ grade3: num 81

Decide which data points to extract and use the c(), or concatenate to extract the data, as shown below:

grade1.num <- c(grades[[1]]$grade1, grades[[2]]$grade1, grades[[3]]$grade1,
grades[[4]]$grade1)

This command function creates a new object (grade1.num), which consists of each student's grade on the first exam. grade1.num is now a numeric vector, the most basic data type in R. To remove an R object, issue the rm() command. For example, rm(grade1.num) removes the grade1.num object just created from the R session. To create an object for a given student's grades, you can issue the following command:

Amy.grade< - c(grades[[1]]$grade1, grades[[1]]$grade2, grades[[1]]$grade3)

This command uses the assignment operator, which consists of the greater-than (>) or less-than (<) symbol, depending on the direction of the assignment, combined with a hyphen (-). Items held in list format have their individual elements accessed by index number (for example, [[1]] is the first list) or by name with the dollar sign operator (for example, $grade1).

A more useful data type in R is data.frame, which is a composite of vectors. To create a data.frame from the example data, first create numeric vectors for the remaining students' grades, as shown below:

Bob.grade <- c(grades[[2]]$grade1, grades[[2]]$grade2, grades[[2]]$grade3)
    Charles.grade <- c(grades[[3]]$grade1, grades[[3]]$grade2, grades[[3]]$grade3)
    David.grade <- c(grades[[4]]$grade1, grades[[4]]$grade2, grades[[4]]$grade3)

Next, combine all of the vectors into a single data frame with the following command:

All.grades <- data.frame(Amy.grade, Bob.grade, Charles.grade, David.grade)

Data imported into R from comma-separated values or spreadsheet programs also become a data.frame object using handlers built into R.


R statistical functions

To gain basic knowledge of the statistical nature of grade1.num, use the summary command:

              summary(Amy.grade)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     35      38      41      43      47      53

The summary command also works on data frames. The output of summary(All.grades) looks like Listing 2.

Listing 2. Summary of data frame in R
              summary(All.grades)
Amy.grade    Bob.grade     Charles.grade    David.grade  
 Min.   :35   Min.   :28.00   Min.   :61.00   Min.   :72.0  
 1st Qu.:38   1st Qu.:32.50   1st Qu.:63.00   1st Qu.:75.0  
 Median :41   Median :37.00   Median :65.00   Median :78.0  
 Mean   :43   Mean   :36.33   Mean   :64.67   Mean   :77.0  
 3rd Qu.:47   3rd Qu.:40.50   3rd Qu.:66.50   3rd Qu.:79.5  
 Max.   :53   Max.   :44.00   Max.   :68.00   Max.   :81.0

To determine whether two vectors are statistically correlated, use the easy-to-use R function that acts on numeric vectors: cor(). For example, you can examine the correlation of Bob.grade with the Amy.grade object by using following command:

cor(Amy.grade, Bob.grade)
[1] -0.9930365

Note: Variance and covariance use the same syntax as correlation, except with the functions cov() and var(), respectively, as shown below:

cov(Amy.grade, Bob.grade)
[1] -73
var(Amy.grade, Bob.grade)
[1] -73

To calculate the median absolute deviation, use the mad() function:

mad(Charles.grade)
[1] 4.4478

Another useful function that acts on numeric vectors is sd(). This function examines the standard deviation of two or more vectors:

sd(Amy.grade)
[1] 9.165151

This function shows that the standard deviation of Amy's grades is 9.165151.

Notice that the output has an index number of [1]. This lets you know you can treat the output of this function the same way you treat any other object or you can assign it a name — for example, x<- sd(Amy.grade). If you change the contents of the vector Amy.grade by using the edit() command, the value of x changes, too.

Another useful statistical function is the Kolmogorov-Smirnov test. You can use this test with the example data to determine whether the probability distributions differ between two students.

Listing 3. Kolmogorov-Smirnov test
ks.test(Amy.grade, Bob.grade)
Two-sample Kolmogorov-Smirnov test
data:  Amy.grade and Bob.grade
D = 0.3333, p-value = 1
alternative hypothesis: two-sided

You can see that Amy.grade and Bob.grade both come from the same distribution.


R visualizations

R is well known for its ability to create data visualizations. The simplest function in this family is plot(). For instance, plot(David.grade) creates a simple scatter plot diagram of that object. To add a title and labels for the axis, use the following command:

plot(David.grade, main = "David's Grades", ylab = "Grade", xlab = "Test Number")

The function dotchart is similar to plot and takes many of the same arguments, except that dotchart takes the labels=row.names(x) flag, which enables you to use the row names (if any) to label the chart.

R can also produce a histogram diagram — for example, hist(All.grades[[1]]). This command is equivalent to hist(Amy.grade) but extracts the vector from the data frame created earlier.

R can also produce bar plots with the barplot() function — for example, barplot(Charles.grade). To change the orientation of the graph, add the horiz=TRUE flag. You can add color by using the col flag, as shown below:

barplot(Charles.grade, horiz=TRUE, col="darkblue")

The graphics function boxplot— for example, boxplot(All.grades)— is useful. With boxplot, you can compare entire data frames. To add axis labels, notches for median comparison, a title, and colors, enter the following command:

boxplot(All.grades, main = "Class Grades", ylab = "Grade",
xlab = "Test Number", col=(c("gold","darkgreen","blue","red")),
notch=TRUE)

Notice the use of the concatenate function nested within the boxplot function.


Conclusion

This article explains how the R language provides a powerful tool for statistically analyzing data and displaying the results graphically. R has found a variety of uses in business sectors such as finance, biology, and engineering. Now that InfoSphere BigInsights products include R functions, R will gain in popularity.


Download

DescriptionNameSize
Sample data for this articlegrades.json.zip1KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=960479
ArticleTitle=Extract meaningful statistical measures from data in JSON using R
publish-date=01212014