Contents


Extract meaningful statistical measures from data in JSON using R

Integrate JSON data with R

Comments

R is a powerful language used for in-memory statistical computing and graphical display. It is similar to SAS, IBM SPSS® Statistics, MATLAB, or FORTRAN, except that it is open source. Converters exist that easily move data held in SAS, IBM SPSS Statistics, or MATLAB format into R. In addition, R comes with a wide array of packages available through the Comprehensive R Archive Network (CRAN). CRAN serves a similar function to CPAN for the Perl language or Rubygems.org for the Ruby language. R is also integrated with InfoSphere BigInsights (see Related topics).

InfoSphere BigInsights uses the JSON format to retain and display data, and this article explains how to use JSON with R. JSON is a key-value data store that can map directly to JavaScript objects. This article uses an example set of students and their respective grades on examinations.

Set up the development environment

First, you need to set up the development environment. You need R and, optionally, the RStudio integrated development environment (IDE). Most package management systems (apt, yum, and others) have R version 2.15 available as the default, but if you want the most up-to-date version of R (3.0.2 at the time of this writing), you must edit your available sources.

To edit the available sources in Debian, for example, open the file /etc/apt/sources.list to add the following line:

deb http://favorite-cran-mirror/bin/linux/debian wheezy-cran3/

Replace favorite-cran-mirror with the mirror closest to you. See Related topics for a list of CRAN mirrors. Your CRAN mirror provides instructions for other distros, such as Red Hat Enterprise Linux®, SuSE, and Ubuntu, in addition to a terminal-based interface to R.

Next, you need to make R aware of how to read JSON data. You do this through the JSON for R (rjson) package, which is available on CRAN. Start by opening an R terminal. The syntax for importing a package into your local R installation is:

install.packages("rjson")

Then make it available with:

library("rjson")

You can replace rjson with the name of any package available on CRAN.

The most basic commands in rjson are fromJSON() and toJSON(). This article explores only fromJSON() because it is the most useful method when interacting with existing data, such as data InfoSphere BigInsights creates or parses.

If you haven't already done so, download the sample file grades.json and save it to an accessible folder (see Download). Then import the file into R with the following command:

grades=fromJSON(file = '/path_to_file/grades.json',  unexpected.escape = "error")

Replace path_to_file with the path to your file.

The flag unexpected.escape tells rjson how to treat an unexpected escape character. The options for this flag are error, keep, and skip.

Basic R commands

You can learn more about any given command within R by using the help function:

help(name_of_command)

Or you can use the following command:

?name_of_command

Replace name_of_command with the name of the command about which you want to learn. For instance, to read the manual page for the library command, you simply enter help(library). This opens the manual page for the library function. Enter q to exit the manual page and return to the R shell.

It is often useful to clean up data before attempting to import it into R using another language, such as Perl or Ruby. Further instructions are outside the scope of this article, but R provides tools to clean up data. In addition, the rjson package provides a flag to skip unescaped special characters. Internal to R are functions such as sub() and gsub(), which alter data based on regular expression (regex) rules. The syntax for gsub is as follows:

gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)

For example, gsub("A", "a", grades) converts every uppercase A to a lowercase a in the return value. The perl flag causes R to use the Perl-compatible regular expressions library instead of standard regex, which may be useful if you have existing regex rules you want to import from Perl.

The sub() method changes the first occurrence of a matched pattern. The gsub() changes all occurrences.

Another useful tool available in R is the edit() or fix() command. With it, you can edit the data set currently in use. For example, edit(grades) opens the R object grades, which was created earlier in the article. With this tool, you can change the data on the fly. R defaults to the editor vi. If you want to use a different editor, you can change it with options(editor = "nano"). Substitute nano in the previous command with the text editor of your choice — for example, Pico or gedit.

In addition, you can call other languages directly from R by using the system() command, which calls a command in the underlying shell and prints the return value in the R session. For example, system('echo "something"') breaks out of the R session and passes the command echo "something" to the underlying shell. Then it grabs the return value from standard output (stdout) — in this case, the word something— and makes it the return value in the R session. The following intern flag makes the return value an R object you can manipulate.

system('echo "something"', intern=TRUE)

Perhaps a more useful example is found with the following command, which creates an R object from the return value from any arbitrary script in any language available in the underlying shell:

system('./my/script/here.pl', intern=true)

R JSON data types

To understand how R treats JSON data, start with the rjson library, which imports data in list format. To learn more about R data types, check out the link in Related topics. The list data type is the most flexible because you can decompose from list data into any of the other data types and because data does not have to be of equal lengths in list type (as opposed to vector type, which does have that limitation.) However, many of the statistical functions you can apply to a data set are not available for data in the list format. Therefore, you have to extract the useful data points from the list-formatted data into another data type.

A useful command for exploring how R views a given piece of data is str(). If you use the grades object created earlier, the output of str(grades) should look like Listing 1.

Listing 1. Structure of JSON data in R
str(grades)
List of 4
    $ :List of 4
    ..$ name  : chr "Amy"
    ..$ grade1: num 35
    ..$ grade2: num 41
    ..$ grade3: num 53
$ :List of 4
    ..$ name  : chr "Bob"
    ..$ grade1: num 44
    ..$ grade2: num 37
    ..$ grade3: num 28
$ :List of 4
    ..$ name  : chr "Charles"
    ..$ grade1: num 68
    ..$ grade2: num 65
    ..$ grade3: num 61
$ :List of 4
    ..$ name  : chr "David"
    ..$ grade1: num 72
    ..$ grade2: num 78
    ..$ grade3: num 81

Decide which data points to extract and use the c(), or concatenate to extract the data, as shown below:

grade1.num <- c(grades[[1]]$grade1, grades[[2]]$grade1, grades[[3]]$grade1,
grades[[4]]$grade1)

This command function creates a new object (grade1.num), which consists of each student's grade on the first exam. grade1.num is now a numeric vector, the most basic data type in R. To remove an R object, issue the rm() command. For example, rm(grade1.num) removes the grade1.num object just created from the R session. To create an object for a given student's grades, you can issue the following command:

Amy.grade< - c(grades[[1]]$grade1, grades[[1]]$grade2, grades[[1]]$grade3)

This command uses the assignment operator, which consists of the greater-than (>) or less-than (<) symbol, depending on the direction of the assignment, combined with a hyphen (-). Items held in list format have their individual elements accessed by index number (for example, [[1]] is the first list) or by name with the dollar sign operator (for example, $grade1).

A more useful data type in R is data.frame, which is a composite of vectors. To create a data.frame from the example data, first create numeric vectors for the remaining students' grades, as shown below:

Bob.grade <- c(grades[[2]]$grade1, grades[[2]]$grade2, grades[[2]]$grade3)
    Charles.grade <- c(grades[[3]]$grade1, grades[[3]]$grade2, grades[[3]]$grade3)
    David.grade <- c(grades[[4]]$grade1, grades[[4]]$grade2, grades[[4]]$grade3)

Next, combine all of the vectors into a single data frame with the following command:

All.grades <- data.frame(Amy.grade, Bob.grade, Charles.grade, David.grade)

Data imported into R from comma-separated values or spreadsheet programs also become a data.frame object using handlers built into R.

R statistical functions

To gain basic knowledge of the statistical nature of grade1.num, use the summary command:

              summary(Amy.grade)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     35      38      41      43      47      53

The summary command also works on data frames. The output of summary(All.grades) looks like Listing 2.

Listing 2. Summary of data frame in R
              summary(All.grades)
Amy.grade    Bob.grade     Charles.grade    David.grade  
 Min.   :35   Min.   :28.00   Min.   :61.00   Min.   :72.0  
 1st Qu.:38   1st Qu.:32.50   1st Qu.:63.00   1st Qu.:75.0  
 Median :41   Median :37.00   Median :65.00   Median :78.0  
 Mean   :43   Mean   :36.33   Mean   :64.67   Mean   :77.0  
 3rd Qu.:47   3rd Qu.:40.50   3rd Qu.:66.50   3rd Qu.:79.5  
 Max.   :53   Max.   :44.00   Max.   :68.00   Max.   :81.0

To determine whether two vectors are statistically correlated, use the easy-to-use R function that acts on numeric vectors: cor(). For example, you can examine the correlation of Bob.grade with the Amy.grade object by using following command:

cor(Amy.grade, Bob.grade)
[1] -0.9930365

Note: Variance and covariance use the same syntax as correlation, except with the functions cov() and var(), respectively, as shown below:

cov(Amy.grade, Bob.grade)
[1] -73
var(Amy.grade, Bob.grade)
[1] -73

To calculate the median absolute deviation, use the mad() function:

mad(Charles.grade)
[1] 4.4478

Another useful function that acts on numeric vectors is sd(). This function examines the standard deviation of two or more vectors:

sd(Amy.grade)
[1] 9.165151

This function shows that the standard deviation of Amy's grades is 9.165151.

Notice that the output has an index number of [1]. This lets you know you can treat the output of this function the same way you treat any other object or you can assign it a name — for example, x<- sd(Amy.grade). If you change the contents of the vector Amy.grade by using the edit() command, the value of x changes, too.

Another useful statistical function is the Kolmogorov-Smirnov test. You can use this test with the example data to determine whether the probability distributions differ between two students.

Listing 3. Kolmogorov-Smirnov test
ks.test(Amy.grade, Bob.grade)
Two-sample Kolmogorov-Smirnov test
data:  Amy.grade and Bob.grade
D = 0.3333, p-value = 1
alternative hypothesis: two-sided

You can see that Amy.grade and Bob.grade both come from the same distribution.

R visualizations

R is well known for its ability to create data visualizations. The simplest function in this family is plot(). For instance, plot(David.grade) creates a simple scatter plot diagram of that object. To add a title and labels for the axis, use the following command:

plot(David.grade, main = "David's Grades", ylab = "Grade", xlab = "Test Number")

The function dotchart is similar to plot and takes many of the same arguments, except that dotchart takes the labels=row.names(x) flag, which enables you to use the row names (if any) to label the chart.

R can also produce a histogram diagram — for example, hist(All.grades[[1]]). This command is equivalent to hist(Amy.grade) but extracts the vector from the data frame created earlier.

R can also produce bar plots with the barplot() function — for example, barplot(Charles.grade). To change the orientation of the graph, add the horiz=TRUE flag. You can add color by using the col flag, as shown below:

barplot(Charles.grade, horiz=TRUE, col="darkblue")

The graphics function boxplot— for example, boxplot(All.grades)— is useful. With boxplot, you can compare entire data frames. To add axis labels, notches for median comparison, a title, and colors, enter the following command:

boxplot(All.grades, main = "Class Grades", ylab = "Grade",
xlab = "Test Number", col=(c("gold","darkgreen","blue","red")),
notch=TRUE)

Notice the use of the concatenate function nested within the boxplot function.

Conclusion

This article explains how the R language provides a powerful tool for statistically analyzing data and displaying the results graphically. R has found a variety of uses in business sectors such as finance, biology, and engineering. Now that InfoSphere BigInsights products include R functions, R will gain in popularity.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=960479
ArticleTitle=Extract meaningful statistical measures from data in JSON using R
publish-date=01212014