R and the world of data

The value of statistics

From projects as diverse as bioinformatics to interfacing with Google Maps, the R programming language is coming into its own as a powerful tool for professionals in many diverse fields. This article explores unique applications of R with an eye toward inspiring its use in your profession.

Bill Zimmerly , Freelance Writer and Knowledge Engineer, Author

Photo of Bill ZimmerlyBill Zimmerly is a knowledge engineer, a low-level systems programmer with expertise in various versions of UNIX and Microsoft® Windows®, and a free thinker who worships at the altar of Logic. Creating new technologies and writing about them are his passions. He resides in rural Hillsboro, Missouri, where the air is fresh, the views are inspiring, and good wineries are all around. You can reach him at bill@zimmerly.com.



26 November 2013

Also available in Chinese Russian

As expressed by Dr. Eppes in the beginning of the television series, NUMB3RS, mathematics is critically important to modern science and engineering. As Sir Issac Newton observed, we stand on the shoulders of giants: Science and engineering equate to a communal activity whereby tools created for one purpose can often be used to increase productivity in other realms.

We all use math every day — to predict weather, to tell time, to handle money. Math is more than formulas and equations; it's logic, it's rationality, it's using your mind to solve the biggest mysteries we know.

Fictional character Dr. Charles Eppes, from the TV series "NUMB3RS"

It is difficult to overstate the importance of digital computers to modern science and engineering. The scientific method's imperative to observe and measure may begin the process, but analysis follows closely and requires tools that are powerful and easy to use to make sense of the volume of data collected.

The primary purpose of the mathematical discipline of statistics is to enable the collection and analysis of data. Statistics is a broad category and covers the collection, organization, analysis, interpretation, and presentation of data.

Guided by the desire to produce a powerful tool for such statistical analysis of data, the engineers of Bell Labs produced a programming language in 1976 called S. As S grew in power and popularity, it was transformed over time into S-Plus, a commercial software package distributed by TIBCO. (See Resources to follow the history of S-Plus.)

As often happens to successful commercial tools, an open source version of S was produced by the GNU project and given the name R. One major difference between S-Plus and R is that R is primarily a command-line-oriented software package, whereas S-Plus offers a graphical user interface (GUI).

The characteristics of R that make it particularly useful are that it is:

  • Interactive — You type commands and see results immediately.
  • Simple — You can easily obtain useful results from the moment you install it and begin to use it.
  • Comprehensive — R draws on the vast libraries of statistical analysis software that have grown up with the package.
  • Extensible — You can easily create your own libraries of functionality and share them with the R community.

R also offers a variety of tools you can use to easily import your data into the package for analysis.

R-project Toolkit in InfoSphere Streams

InfoSphere® Streams is an advanced computing platform that allows user-developed applications to ingest, analyze, and correlate information quickly as it arrives from thousands of real-time sources, handling very high data throughput rates: up to millions of events or messages per second. It includes an R-project Toolkit that enables you to apply complex data mining algorithms to detect patterns of interest in data streams. Learn more and download a trial version of InfoSphere Streams.

The importance of statistics

The discipline of statistics guides the decision-making process. Gathering meaningful data on the topic that interests you and determining various statistical values such as the minimum, maximum, mode, mean, median, and standard deviation contained within the data constrains solutions that help you make better decisions. For example, if the materials being used in a product will melt from too much heat, you need to know that before mass production. You will have to make sure the maximum temperature expected is far less than the minimum melting point of the materials being used.

Statistics help make intelligent decisions

Making informed, intelligent decisions requires work. No matter what the field of endeavor, you must apply due diligence to understand it before making decisions. There is no free lunch. If, for example, your goal is to make money as an investor (deciding which stocks have growth potential, at what price to buy them, and what the target selling price is), you must study the promising characteristics of the stock, and one of those important characteristics is how has it performed in the past.

Access to such statistical data isn't difficult to find. You can download it from sources that your broker can recommend. Downloading the historical data into a common spreadsheet format or a comma-separated text file makes input of this data into R easy. With the data in R, you can readily employ several statistical analysis tools to tease out the information you need to make informed decisions.

The same is true for the engineering or science realm. NASA engineers spend a great deal of time building prototype devices such as rocket engines and testing them so they not only provide the required thrust but have a significant margin of safety. Choosing a particular design must be an informed decision so that money and lives aren't wasted in rockets that explode soon after launch. Reams and reams of test data have to be analyzed to guide these decisions.

Statistics help interpret data

R support in SPSS Statistics and SPSS Modeler

You can execute R algorithms within SPSS Statistics and SPSS Modeler and use algorithms and statistical techniques in SPSS Statistics that have been validated and proven over 40 years of use and testing. An SPSS Statistics Programmability Extension enables you to extend SPSS Statistics with external programming languages such as Python, R, .NET version of Microsoft Visual Basic, and the Java language. It also allows external applications to access the SPSS Statistics processor and draw upon its vast wealth of functionality. Learn more about SPSS Statistics and SPSS Modeler, and give SPSS Statistics a try at no cost.

Part of the process of statistical analysis is interpreting the data — that is, assigning meaning to it and determining the implications of that meaning. When, for example, herbicides are being designed, agricultural scientists apply them to carefully isolated plant populations and evaluate how well they did over time. The idea behind herbicides is to kill weeds without harming the intended crops, and a simple measure of their effectiveness might be the ratio of crop to weed over a fixed period of time.

After gathering such data and not seeing the expected results, perhaps other clues in the data can hint at why. If other coincidental data was gathered, such as how much the plants were watered per day as well as how much sunshine fell on them, scientists might discover hidden patterns that point to the reason for the failed tests. Such hidden patterns may be as simple as not having enough water per day to make the herbicide effective. Assigning meaning such as that would not have been possible without gathering and analyzing the herbicide performance data.

Statistics help establish Bayesian inference

As additional data is collected, you can begin to employ Bayesian inference, a method of updating the probability estimate for a hypothesis. In recent years and with the advent of digital computer technology, Bayesian techniques for updating probabilities with new data have really come into their own. R is an ideal tool for applying Bayesian inference, because a significant number of contributed packages in the Comprehensive R Archive Network exist for applying it.

Statistics enable you to mine for hidden treasures

The interactive nature of R enables a refreshing degree of freedom over the old batch methods of exploring data. With R, you are constantly inputting commands that build and display objects as your sense of exploration drives you. This flexibility can enable you to find hidden treasures in the data that add to the body of scientific knowledge.


R data types

Computer models of the real world are little more than mathematical abstractions. The system being modeled in an R session can be represented with many different data types.

R data types are stored in named variables, and to display the content or value of the variable, you need only type the name. Variables in R are case-sensitive (A is not the same as a) and must be unique. After starting R, you can enter commands such as those in Listing 1.

Listing 1. Named variables in R
> a <- 7
> a
[1] 7

> hours_per_day <- 24
> days_per_week <- 7
> hours_per_week <- hours_per_day * days_per_week
> hours_per_week
[1] 168

Scalars and vectors

Mathematics is the language of science, and the simplest of mathematical objects is the scalar, a single number that represents the value or magnitude of something. In R, you create variables with scalar values by using a simple assignment. The following commands mean the same thing — to assign 3.14 to a variable named pi:

> pi <-- 3.14
> 3.14 --> pi
> pi = 3.14
> pi
[1] 3.14

The R manual defines a vector as "a single entity consisting of a collection of things." The c() function for constructing vectors appears like this:

> days_per_month <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)

Furthermore, you can attach names to elements of a vector as follows:

> names(days_per_month) <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                             "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
> days_per_month
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
 31  28  31  30  31  30  31  31  30  31  30  31

Notice that the input data to the names vector are character strings. Vectors can contain a collection of any data type R offers, including character strings.

Arrays and matrices

You can create multidimensional arrays and matrices in R simply and in many ways. One of the simplest ways is to create and fill the vector first, then apply the dimensional parameters for it. Listing 2 provides an example.

Listing 2. Constructing multidimensional arrays
> a <- c(2,4,5,67,34)
> plot(a)
> rm(a)
> ls()
character(0)
> a <- 1:20
> a
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
> dim(a) <- c(4, 5)
> a
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20
> a[3,2]
[1] 7

It is also possible to use the array() function to create the whole array in one command, as shown in Listing 3.

Listing 3. A multidimensional array with one command
> b <- array(1:20, dim=c(4, 5))
> b
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20

You can perform arithmetic operations on the arrays, too (see Listing 4).

Listing 4. Array arithmetic
> a+b
     [,1] [,2] [,3] [,4] [,5]
[1,]    2   10   18   26   34
[2,]    4   12   20   28   36
[3,]    6   14   22   30   38
[4,]    8   16   24   32   40
> a-b
     [,1] [,2] [,3] [,4] [,5]
[1,]    0    0    0    0    0
[2,]    0    0    0    0    0
[3,]    0    0    0    0    0
[4,]    0    0    0    0    0
> a*b
     [,1] [,2] [,3] [,4] [,5]
[1,]    1   25   81  169  289
[2,]    4   36  100  196  324
[3,]    9   49  121  225  361
[4,]   16   64  144  256  400
> a/b
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    1    1    1
[2,]    1    1    1    1    1
[3,]    1    1    1    1    1
[4,]    1    1    1    1    1

Next, define a square matrix and calculate the eigenvalues and eigenvectors for it (see Listing 5).

Listing 5. Calculating eigenvalues and eigenvectors
> c <- array(1:25, dim=c(5, 5))
> eigen(c)
$values
[1]  6.864208e+01+0.000000e+00i -3.642081e+00+0.000000e+00i
[3]  6.638046e-16+1.280454e-15i  6.638046e-16-1.280454e-15i
[5]  3.657972e-17+0.000000e+00i

$vectors
              [,1]           [,2]                  [,3]                  [,4]
[1,] -0.3800509+0i  0.76703416+0i  0.3058009-0.1907904i  0.3058009+0.1907904i
[2,] -0.4124552+0i  0.48590617+0i  0.1370806+0.2690219i  0.1370806-0.2690219i
[3,] -0.4448594+0i  0.20477817+0i -0.7527335+0.0000000i -0.7527335+0.0000000i
[4,] -0.4772637+0i -0.07634982+0i -0.1289781-0.0439041i -0.1289781+0.0439041i
[5,] -0.5096680+0i -0.35747782+0i  0.4388302-0.0343274i  0.4388302+0.0343274i
               [,5]
[1,] -0.11454104+0i
[2,]  0.01126422+0i
[3,]  0.58902185+0i
[4,] -0.75367218+0i
[5,]  0.26792716+0i

As you can see, R is highly interactive and can be fun to work with, too.

Data frames

Data frames in R are a complex data type that consists of a list of vectors of equal length. You can picture them as spreadsheet rows and columnar data (see Listing 6).

Listing 6. Data frames
> name <- c("Joe", "Mark", "Tom")
> age <- c(23, 35, 64)
> working <- c(TRUE, TRUE, FALSE)
> people <- data.frame(name, age, working)
> people
  name age working
1  Joe  23    TRUE
2 Mark  35    TRUE
3  Tom  64   FALSE

You can easily apply the R summary() function to summarize statistical data about any R object. In Listing 7, I apply the function to the data frame that was just created.

Listing 7. The R summary() function
> summary(people)
   name        age         working       
 Joe :1   Min.   :23.00   Mode :logical  
 Mark:1   1st Qu.:29.00   FALSE:1        
 Tom :1   Median :35.00   TRUE :2        
          Mean   :40.67   NA's :0        
          3rd Qu.:49.50                  
          Max.   :64.00

Lists

In R, a list is a generic vector containing other objects. Because vectors must contain objects of the same type, lists allow you to group collections of diverse object vectors. Listing 8 provides some examples.

Listing 8. The R list() function
> a <- 1:5
> a
[1] 1 2 3 4 5
> b <- c("xx", "yy", "zz")
> b
[1] "xx" "yy" "zz"
> c <- c(TRUE, TRUE, FALSE, TRUE, FALSE)
> c
[1]  TRUE  TRUE FALSE  TRUE FALSE
> d <- list(a, b, c)
> d
[[1]]
[1] 1 2 3 4 5

[[2]]
[1] "xx" "yy" "zz"

[[3]]
[1]  TRUE  TRUE FALSE  TRUE FALSE

> summary(d)
     Length Class  Mode     
[1,] 5      -none- numeric  
[2,] 3      -none- character
[3,] 5      -none- logical

Factors

Factors in R are known as categorical variables. They are useful in that they have a limited number of different values, often used for categorizing data. For example, consider different categories of wrestlers based on their weight:

  • Heavyweights are 190 pounds or heavier.
  • Middleweights range from 165 to 189 pounds.
  • Lightweights are those under 165 pounds.

Listing 9 shows this information in a data frame.

Listing 9. A data frame of wrestlers
> name <- c("Joe", "Mark", "Tom")
> weight <- c(135, 176, 169)
> wrestlers <- data.frame(name, weight)
> wrestlers
  name weight
1  Joe    135
2 Mark    176
3  Tom    169

Now that I have my table of wrestlers in a data frame, I'll create categories for the class in which these guys wrestle. Notice that to properly "cut" the categories, there must always be one more break than there are labels.

Listing 10. Factors or categorical variables
> labels=c('Lightweight', 'Middleweight', 'Heavyweight')
> breaks=c(1, 165, 190, 500)
> class = cut(wrestlers$weight, 3, labels, breaks)
> table(class)
class
 Lightweight Middleweight  Heavyweight 
           1            2            0

Of my three wrestlers, one is a lightweight and two are middleweights. There are no heavyweights.


R transformations

It is often necessary to make on-the-fly transformations of your data to get a better picture of what's happening. Sometimes, such transformations can make the data much more uniform and less skewed. For example, consider the growth of the Internet over the six-year period from 1996 to 2002. The number of registered domains in each year indicates explosive growth during this period. (See Resources for the source of this data.)

> years <- c(1996, 1998, 2000, 2002)
> domains <- c(1560000, 3900000, 15600000, 32500000)
> plot(years, domains)

Figure 1 shows registered Internet domains by year.

Figure 1. Registered domains by year
Image shows registered Internet domains by year

As you can see, the graph that this command produced shows a strong curvature upward, indicating approximately exponential growth during this period. If you suspect exponential growth, you would expect the logarithm of each data point to be a nearly linear diagonal. Making this transformation on the domain data confirms this:

> logD <- log(domains)
> plot(years, logD)

Figure 2 shows registered domains logarithmically.

Figure 2. Registered domains by year (logarithmic)
Image shows registered Internet domains by year logarithmically

It might also prove useful to examine histograms of the domain data and their logarithms:

> hist(domains)
> hist(logD)

Figure 3 shows registered domains as a histogram.

Figure 3. Registered domains by year (histogram)
Image shows registered Internet domains by year as a histogram

Figure 4 shows the registered domain histogram.

Figure 4. Registered domain histogram (logarithmic)
Image shows a registered domain histogram logarithmically

Notice that the logarithmic data is much less skewed. Such transformations can tease out details you may not see at first glance and assist in interpreting the results. Because R makes such transformations easy, why not leverage this power as much as you can?


A simple R project

This simple R project assumes you are using a computer running Debian Linux®. Installing R on other operating systems is similarly easy. The goal of the project is to calculate basic statistics for an input dataset found on the Internet after a Google search.

Installing R

Installing R is easy. Issuing the standard Debian apt-get command downloads and installs the package:

$ sudo apt-get install r-base
Password: _
.
(Installation messages)
.

After installation, you can start R using the Tk GUI like this:

$ R -g Tk &

Finding a data source

The world is awash in data on many different topics, so finding a data source you can play with to learn R isn't difficult. For the sake of this article, a timely topic might be "Government spending on healthcare." By typing that exact phrase into Google, I was instantly presented with more than 99 million links. I chose to click the first link that seemed to be from the U.S. government and immediately saw a table with a download link.

Importing the data

I clicked the download link on the "U.S. Health Care Spending for 2013" page and chose to download the comma-separated values file usgs_2013.csv to my workstation. As always seems to be the case with data files, a little editing was called for: I had to delete a few rows above and below the table so only the column headings and the data appeared. After that, I started R with the command given above and entered the following command to read the table data:

> hc <- read.csv(file="usgs_2013.csv", stringsAsFactors=FALSE)

Note:stringsAsFactors=FALSE is necessary to avoid interpreting strings as factors.

Analyzing the data

Suppose that the only thing I want from this data is the mean and sum of all of the numbers in the Fed column. I entered the following command, and it failed:

> sum(as.numeric(hc$Fed))
[1] NA
Warning message:
NAs introduced by coercion

The problem was that some of the values had a comma in them, which prevented the proper conversion of the numeric values. (Those with commas were "coerced" into NAs by the R conversion functions.) Listing 11 shows how the commas were removed to fix the problem.

Listing 11. Fixing errors in the data
> hc$Fed <- as.numeric(sub("\\,", "", hc$Fed))
> as.numeric(hc$Fed)
 [1]   874.3   882.2   510.5     0.0     4.6    33.1     0.0   333.9    98.0
[10]   856.5   422.4    37.0    94.5    55.9   141.4   222.8     0.0  3685.0
[19]   972.9 17249.3

Then it was smooth sailing to get the values I wanted.

Listing 12. Displaying the mean() and the sum()
> mean(as.numeric(hc$Fed))
[1] 1323.715
> sum(as.numeric(hc$Fed))
[1] 26474.3

Note: These values don't have any useful meaning other than for demonstration purposes.


The world is awash in data

As each of the following paragraphs and links will show, R is being used to process data in useful ways throughout the world. The creativity of people is amazing when they have access to the right tools for the job.

Airline performance

In 2009, the American Statistical Association held its biannual Data Exposition event. This event included various competitions, one of which was on the topic "Airline on-time performance." From the association's website:

"The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed."

That's called big data.

Although the SAS Institute produced the winning poster, teams from Iowa State University and Yale University used R to "tease some fascinating facets out of the data," as well.

Bioinformatics

Although we have reached the stage where our science can read the book of life, also known as DNA, making sense of it is still vast, mostly uncharted territory. Our greatest tools are being brought to bear on the problem of trying to tease out the meaning of the various subsequences of DNA's bases, and R is a powerful tool in this battle.

Avril Coghlan wrote three free booklets on using R in bioinformatics. In them, she described how R is being used for multivariate analysis, time series analysis, and biomedical statistics (see Resources).

Finance and economics

R is being employed by various economists and financial planners to study the ebb and flow of financial markets. A series of articles in The Economist at Large are dedicated to illustrating how R is being put to use in analyzing the performances of various financial instruments with an eye toward more efficient portfolio management.

Investing is part art, part science, and statistical analysis is at the heart of most of the science. It should come as no surprise that the R programming language can be found in this important realm.

Google Maps

Google Maps is one of the most powerful tools available for presenting data that has a mapping component to it. Examples of such data abound — the density of doctors per 10,000 citizens, the average home value for a given region, etc. What truly makes Google Maps so powerful is the application program interface that affords programmers the ability to merge this tool's capabilities with their own databases.

Thanks to the efforts of various individuals in the R community, a bridge between the Google Maps API and R has been built and is being actively used and improved on (see Resources).

Government

As has already been demonstrated, the government is a great source for data of all kinds. A simple Google search with "government data on" reveals many links on a huge variety of topics. It is necessary for those who lead a nation to know what they are doing, and the only way to accomplish that is to collect, analyze, and make informed, data-driven decisions. Because all of this activity is paid for by taxpayers, most of the data is made available to download and use (see Resources).

Oceanography

Oceanographers like Dan Kelley have access to many databases of oceanographic data. He has written OCE, a statistical analysis package for R that enables complex analysis of many of these databases. Many diverse tools in the package enable things like calculating seawater salinity from temperature and density, tools to read oceanographic data files, tools to calculate geodesic distances on earth, tools to plot a coastline, and more. It is a comprehensive and powerful package (see Resources).

The weather

Every six hours, the U.S. National Centers for Environmental Prediction (NCEP) forecasts the weather for the entire world. The data that NCEP uses is available for download by anyone who wants it, and Joe Wheatley of the R community is using that data to do his own studies (see Resources).


Conclusion

As Dr. Charles Eppes observed, we all use math every day. For most of us most of the time, every product or service we use wouldn't exist were it not for mathematics. Furthermore, we could not apply mathematics as efficiently or effectively as we have without digital computers and software like R being readily available. It is my sincere desire to see many more people take an active interest in taking advantage of this powerful tool for their businesses.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=953913
ArticleTitle=R and the world of data
publish-date=11262013