As expressed by Dr. Eppes in the beginning of the television series, NUMB3RS, mathematics is critically important to modern science and engineering. As Sir Issac Newton observed, we stand on the shoulders of giants: Science and engineering equate to a communal activity whereby tools created for one purpose can often be used to increase productivity in other realms.
“We all use math every day — to predict weather, to tell time, to handle money. Math is more than formulas and equations; it's logic, it's rationality, it's using your mind to solve the biggest mysteries we know.”
Fictional character Dr. Charles Eppes, from the TV series "NUMB3RS"
It is difficult to overstate the importance of digital computers to modern science and engineering. The scientific method's imperative to observe and measure may begin the process, but analysis follows closely and requires tools that are powerful and easy to use to make sense of the volume of data collected.
The primary purpose of the mathematical discipline of statistics is to enable the collection and analysis of data. Statistics is a broad category and covers the collection, organization, analysis, interpretation, and presentation of data.
Guided by the desire to produce a powerful tool for such statistical analysis of data, the engineers of Bell Labs produced a programming language in 1976 called S. As S grew in power and popularity, it was transformed over time into S-Plus, a commercial software package distributed by TIBCO. (See Resources to follow the history of S-Plus.)
As often happens to successful commercial tools, an open source version of S was produced by the GNU project and given the name R. One major difference between S-Plus and R is that R is primarily a command-line-oriented software package, whereas S-Plus offers a graphical user interface (GUI).
The characteristics of R that make it particularly useful are that it is:
- Interactive — You type commands and see results immediately.
- Simple — You can easily obtain useful results from the moment you install it and begin to use it.
- Comprehensive — R draws on the vast libraries of statistical analysis software that have grown up with the package.
- Extensible — You can easily create your own libraries of functionality and share them with the R community.
R also offers a variety of tools you can use to easily import your data into the package for analysis.
The importance of statistics
The discipline of statistics guides the decision-making process. Gathering meaningful data on the topic that interests you and determining various statistical values such as the minimum, maximum, mode, mean, median, and standard deviation contained within the data constrains solutions that help you make better decisions. For example, if the materials being used in a product will melt from too much heat, you need to know that before mass production. You will have to make sure the maximum temperature expected is far less than the minimum melting point of the materials being used.
Statistics help make intelligent decisions
Making informed, intelligent decisions requires work. No matter what the field of endeavor, you must apply due diligence to understand it before making decisions. There is no free lunch. If, for example, your goal is to make money as an investor (deciding which stocks have growth potential, at what price to buy them, and what the target selling price is), you must study the promising characteristics of the stock, and one of those important characteristics is how has it performed in the past.
Access to such statistical data isn't difficult to find. You can download it from sources that your broker can recommend. Downloading the historical data into a common spreadsheet format or a comma-separated text file makes input of this data into R easy. With the data in R, you can readily employ several statistical analysis tools to tease out the information you need to make informed decisions.
The same is true for the engineering or science realm. NASA engineers spend a great deal of time building prototype devices such as rocket engines and testing them so they not only provide the required thrust but have a significant margin of safety. Choosing a particular design must be an informed decision so that money and lives aren't wasted in rockets that explode soon after launch. Reams and reams of test data have to be analyzed to guide these decisions.
Statistics help interpret data
Part of the process of statistical analysis is interpreting the data — that is, assigning meaning to it and determining the implications of that meaning. When, for example, herbicides are being designed, agricultural scientists apply them to carefully isolated plant populations and evaluate how well they did over time. The idea behind herbicides is to kill weeds without harming the intended crops, and a simple measure of their effectiveness might be the ratio of crop to weed over a fixed period of time.
After gathering such data and not seeing the expected results, perhaps other clues in the data can hint at why. If other coincidental data was gathered, such as how much the plants were watered per day as well as how much sunshine fell on them, scientists might discover hidden patterns that point to the reason for the failed tests. Such hidden patterns may be as simple as not having enough water per day to make the herbicide effective. Assigning meaning such as that would not have been possible without gathering and analyzing the herbicide performance data.
Statistics help establish Bayesian inference
As additional data is collected, you can begin to employ Bayesian inference, a method of updating the probability estimate for a hypothesis. In recent years and with the advent of digital computer technology, Bayesian techniques for updating probabilities with new data have really come into their own. R is an ideal tool for applying Bayesian inference, because a significant number of contributed packages in the Comprehensive R Archive Network exist for applying it.
Statistics enable you to mine for hidden treasures
The interactive nature of R enables a refreshing degree of freedom over the old batch methods of exploring data. With R, you are constantly inputting commands that build and display objects as your sense of exploration drives you. This flexibility can enable you to find hidden treasures in the data that add to the body of scientific knowledge.
R data types
Computer models of the real world are little more than mathematical abstractions. The system being modeled in an R session can be represented with many different data types.
R data types are stored in named variables, and to display the content or value of the variable, you need only type the name. Variables in R are case-sensitive (A is not the same as a) and must be unique. After starting R, you can enter commands such as those in Listing 1.
Listing 1. Named variables in R
> a <- 7 > a  7 > hours_per_day <- 24 > days_per_week <- 7 > hours_per_week <- hours_per_day * days_per_week > hours_per_week  168
Scalars and vectors
Mathematics is the language of science, and the simplest of mathematical
objects is the scalar, a single number that represents the value
or magnitude of something. In R, you create variables with scalar values
by using a simple assignment. The following commands mean the same thing
— to assign 3.14 to a variable named
> pi <-- 3.14 > 3.14 --> pi > pi = 3.14 > pi  3.14
The R manual defines a vector as "a single entity
consisting of a collection of things." The
c() function for
constructing vectors appears like this:
> days_per_month <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)
Furthermore, you can attach names to elements of a vector as follows:
> names(days_per_month) <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec") > days_per_month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 31 28 31 30 31 30 31 31 30 31 30 31
Notice that the input data to the
names vector are character
strings. Vectors can contain a collection of any data type R offers,
including character strings.
Arrays and matrices
You can create multidimensional arrays and matrices in R simply and in many ways. One of the simplest ways is to create and fill the vector first, then apply the dimensional parameters for it. Listing 2 provides an example.
Listing 2. Constructing multidimensional arrays
> a <- c(2,4,5,67,34) > plot(a) > rm(a) > ls() character(0) > a <- 1:20 > a  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 > dim(a) <- c(4, 5) > a [,1] [,2] [,3] [,4] [,5] [1,] 1 5 9 13 17 [2,] 2 6 10 14 18 [3,] 3 7 11 15 19 [4,] 4 8 12 16 20 > a[3,2]  7
It is also possible to use the
array() function to create the
whole array in one command, as shown in Listing 3.
Listing 3. A multidimensional array with one command
> b <- array(1:20, dim=c(4, 5)) > b [,1] [,2] [,3] [,4] [,5] [1,] 1 5 9 13 17 [2,] 2 6 10 14 18 [3,] 3 7 11 15 19 [4,] 4 8 12 16 20
You can perform arithmetic operations on the arrays, too (see Listing 4).
Listing 4. Array arithmetic
> a+b [,1] [,2] [,3] [,4] [,5] [1,] 2 10 18 26 34 [2,] 4 12 20 28 36 [3,] 6 14 22 30 38 [4,] 8 16 24 32 40 > a-b [,1] [,2] [,3] [,4] [,5] [1,] 0 0 0 0 0 [2,] 0 0 0 0 0 [3,] 0 0 0 0 0 [4,] 0 0 0 0 0 > a*b [,1] [,2] [,3] [,4] [,5] [1,] 1 25 81 169 289 [2,] 4 36 100 196 324 [3,] 9 49 121 225 361 [4,] 16 64 144 256 400 > a/b [,1] [,2] [,3] [,4] [,5] [1,] 1 1 1 1 1 [2,] 1 1 1 1 1 [3,] 1 1 1 1 1 [4,] 1 1 1 1 1
Next, define a square matrix and calculate the eigenvalues and eigenvectors for it (see Listing 5).
Listing 5. Calculating eigenvalues and eigenvectors
> c <- array(1:25, dim=c(5, 5)) > eigen(c) $values  6.864208e+01+0.000000e+00i -3.642081e+00+0.000000e+00i  6.638046e-16+1.280454e-15i 6.638046e-16-1.280454e-15i  3.657972e-17+0.000000e+00i $vectors [,1] [,2] [,3] [,4] [1,] -0.3800509+0i 0.76703416+0i 0.3058009-0.1907904i 0.3058009+0.1907904i [2,] -0.4124552+0i 0.48590617+0i 0.1370806+0.2690219i 0.1370806-0.2690219i [3,] -0.4448594+0i 0.20477817+0i -0.7527335+0.0000000i -0.7527335+0.0000000i [4,] -0.4772637+0i -0.07634982+0i -0.1289781-0.0439041i -0.1289781+0.0439041i [5,] -0.5096680+0i -0.35747782+0i 0.4388302-0.0343274i 0.4388302+0.0343274i [,5] [1,] -0.11454104+0i [2,] 0.01126422+0i [3,] 0.58902185+0i [4,] -0.75367218+0i [5,] 0.26792716+0i
As you can see, R is highly interactive and can be fun to work with, too.
Data frames in R are a complex data type that consists of a list of vectors of equal length. You can picture them as spreadsheet rows and columnar data (see Listing 6).
Listing 6. Data frames
> name <- c("Joe", "Mark", "Tom") > age <- c(23, 35, 64) > working <- c(TRUE, TRUE, FALSE) > people <- data.frame(name, age, working) > people name age working 1 Joe 23 TRUE 2 Mark 35 TRUE 3 Tom 64 FALSE
You can easily apply the R
summary() function to summarize
statistical data about any R object. In Listing 7, I apply the function to
the data frame that was just created.
Listing 7. The R
> summary(people) name age working Joe :1 Min. :23.00 Mode :logical Mark:1 1st Qu.:29.00 FALSE:1 Tom :1 Median :35.00 TRUE :2 Mean :40.67 NA's :0 3rd Qu.:49.50 Max. :64.00
In R, a list is a generic vector containing other objects. Because vectors must contain objects of the same type, lists allow you to group collections of diverse object vectors. Listing 8 provides some examples.
Listing 8. The R
> a <- 1:5 > a  1 2 3 4 5 > b <- c("xx", "yy", "zz") > b  "xx" "yy" "zz" > c <- c(TRUE, TRUE, FALSE, TRUE, FALSE) > c  TRUE TRUE FALSE TRUE FALSE > d <- list(a, b, c) > d []  1 2 3 4 5 []  "xx" "yy" "zz" []  TRUE TRUE FALSE TRUE FALSE > summary(d) Length Class Mode [1,] 5 -none- numeric [2,] 3 -none- character [3,] 5 -none- logical
Factors in R are known as categorical variables. They are useful in that they have a limited number of different values, often used for categorizing data. For example, consider different categories of wrestlers based on their weight:
- Heavyweights are 190 pounds or heavier.
- Middleweights range from 165 to 189 pounds.
- Lightweights are those under 165 pounds.
Listing 9 shows this information in a data frame.
Listing 9. A data frame of wrestlers
> name <- c("Joe", "Mark", "Tom") > weight <- c(135, 176, 169) > wrestlers <- data.frame(name, weight) > wrestlers name weight 1 Joe 135 2 Mark 176 3 Tom 169
Now that I have my table of wrestlers in a data frame, I'll create categories for the class in which these guys wrestle. Notice that to properly "cut" the categories, there must always be one more break than there are labels.
Listing 10. Factors or categorical variables
> labels=c('Lightweight', 'Middleweight', 'Heavyweight') > breaks=c(1, 165, 190, 500) > class = cut(wrestlers$weight, 3, labels, breaks) > table(class) class Lightweight Middleweight Heavyweight 1 2 0
Of my three wrestlers, one is a lightweight and two are middleweights. There are no heavyweights.
It is often necessary to make on-the-fly transformations of your data to get a better picture of what's happening. Sometimes, such transformations can make the data much more uniform and less skewed. For example, consider the growth of the Internet over the six-year period from 1996 to 2002. The number of registered domains in each year indicates explosive growth during this period. (See Resources for the source of this data.)
> years <- c(1996, 1998, 2000, 2002) > domains <- c(1560000, 3900000, 15600000, 32500000) > plot(years, domains)
Figure 1 shows registered Internet domains by year.
Figure 1. Registered domains by year
As you can see, the graph that this command produced shows a strong curvature upward, indicating approximately exponential growth during this period. If you suspect exponential growth, you would expect the logarithm of each data point to be a nearly linear diagonal. Making this transformation on the domain data confirms this:
> logD <- log(domains) > plot(years, logD)
Figure 2 shows registered domains logarithmically.
Figure 2. Registered domains by year (logarithmic)
It might also prove useful to examine histograms of the domain data and their logarithms:
> hist(domains) > hist(logD)
Figure 3 shows registered domains as a histogram.
Figure 3. Registered domains by year (histogram)
Figure 4 shows the registered domain histogram.
Figure 4. Registered domain histogram (logarithmic)
Notice that the logarithmic data is much less skewed. Such transformations can tease out details you may not see at first glance and assist in interpreting the results. Because R makes such transformations easy, why not leverage this power as much as you can?
A simple R project
This simple R project assumes you are using a computer running Debian Linux®. Installing R on other operating systems is similarly easy. The goal of the project is to calculate basic statistics for an input dataset found on the Internet after a Google search.
Installing R is easy. Issuing the standard Debian
command downloads and installs the package:
$ sudo apt-get install r-base Password: _ . (Installation messages) .
After installation, you can start R using the Tk GUI like this:
$ R -g Tk &
Finding a data source
The world is awash in data on many different topics, so finding a data source you can play with to learn R isn't difficult. For the sake of this article, a timely topic might be "Government spending on healthcare." By typing that exact phrase into Google, I was instantly presented with more than 99 million links. I chose to click the first link that seemed to be from the U.S. government and immediately saw a table with a download link.
Importing the data
I clicked the download link on the "U.S. Health Care Spending for 2013" page and chose to download the comma-separated values file usgs_2013.csv to my workstation. As always seems to be the case with data files, a little editing was called for: I had to delete a few rows above and below the table so only the column headings and the data appeared. After that, I started R with the command given above and entered the following command to read the table data:
> hc <- read.csv(file="usgs_2013.csv", stringsAsFactors=FALSE)
stringsAsFactors=FALSE is necessary to
avoid interpreting strings as factors.
Analyzing the data
Suppose that the only thing I want from this data is the mean and sum of all of the numbers in the Fed column. I entered the following command, and it failed:
> sum(as.numeric(hc$Fed))  NA Warning message: NAs introduced by coercion
The problem was that some of the values had a comma in them, which
prevented the proper conversion of the numeric values. (Those with commas
were "coerced" into
NAs by the R conversion functions.)
Listing 11 shows how the commas were removed to fix the problem.
Listing 11. Fixing errors in the data
> hc$Fed <- as.numeric(sub("\\,", "", hc$Fed)) > as.numeric(hc$Fed)  874.3 882.2 510.5 0.0 4.6 33.1 0.0 333.9 98.0  856.5 422.4 37.0 94.5 55.9 141.4 222.8 0.0 3685.0  972.9 17249.3
Then it was smooth sailing to get the values I wanted.
Listing 12. Displaying the
> mean(as.numeric(hc$Fed))  1323.715 > sum(as.numeric(hc$Fed))  26474.3
Note: These values don't have any useful meaning other than for demonstration purposes.
The world is awash in data
As each of the following paragraphs and links will show, R is being used to process data in useful ways throughout the world. The creativity of people is amazing when they have access to the right tools for the job.
In 2009, the American Statistical Association held its biannual Data Exposition event. This event included various competitions, one of which was on the topic "Airline on-time performance." From the association's website:
"The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed."
That's called big data.
Although the SAS Institute produced the winning poster, teams from Iowa State University and Yale University used R to "tease some fascinating facets out of the data," as well.
Although we have reached the stage where our science can read the book of life, also known as DNA, making sense of it is still vast, mostly uncharted territory. Our greatest tools are being brought to bear on the problem of trying to tease out the meaning of the various subsequences of DNA's bases, and R is a powerful tool in this battle.
Avril Coghlan wrote three free booklets on using R in bioinformatics. In them, she described how R is being used for multivariate analysis, time series analysis, and biomedical statistics (see Resources).
Finance and economics
R is being employed by various economists and financial planners to study the ebb and flow of financial markets. A series of articles in The Economist at Large are dedicated to illustrating how R is being put to use in analyzing the performances of various financial instruments with an eye toward more efficient portfolio management.
Investing is part art, part science, and statistical analysis is at the heart of most of the science. It should come as no surprise that the R programming language can be found in this important realm.
Google Maps is one of the most powerful tools available for presenting data that has a mapping component to it. Examples of such data abound — the density of doctors per 10,000 citizens, the average home value for a given region, etc. What truly makes Google Maps so powerful is the application program interface that affords programmers the ability to merge this tool's capabilities with their own databases.
Thanks to the efforts of various individuals in the R community, a bridge between the Google Maps API and R has been built and is being actively used and improved on (see Resources).
As has already been demonstrated, the government is a great source for data of all kinds. A simple Google search with "government data on" reveals many links on a huge variety of topics. It is necessary for those who lead a nation to know what they are doing, and the only way to accomplish that is to collect, analyze, and make informed, data-driven decisions. Because all of this activity is paid for by taxpayers, most of the data is made available to download and use (see Resources).
Oceanographers like Dan Kelley have access to many databases of oceanographic data. He has written OCE, a statistical analysis package for R that enables complex analysis of many of these databases. Many diverse tools in the package enable things like calculating seawater salinity from temperature and density, tools to read oceanographic data files, tools to calculate geodesic distances on earth, tools to plot a coastline, and more. It is a comprehensive and powerful package (see Resources).
Every six hours, the U.S. National Centers for Environmental Prediction (NCEP) forecasts the weather for the entire world. The data that NCEP uses is available for download by anyone who wants it, and Joe Wheatley of the R community is using that data to do his own studies (see Resources).
As Dr. Charles Eppes observed, we all use math every day. For most of us most of the time, every product or service we use wouldn't exist were it not for mathematics. Furthermore, we could not apply mathematics as efficiently or effectively as we have without digital computers and software like R being readily available. It is my sincere desire to see many more people take an active interest in taking advantage of this powerful tool for their businesses.
- Check out The R Journal.
- Learn more about the history of S-Plus.
- Find the data and uses of R I mentioned on:
- Check out Avril Coghlan's free books on R for bioinformatics.
- Find more information about the use of R with Google Maps, as well as the Google Maps API for R.
- If you're asking yourself Do I need to learn R? check out this description of R, a flexible programming language.
- Learn about Optimization in R.
- Learn more about the NCEP Global Forecast System.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners to advanced functionality for experienced developers.
- Learn more about big data in the developerWorks big data content area. Find technical documentation, how-to articles, education, downloads, product information, and more.
- Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoop-based offering that extends the value of open source Hadoop with features like Big SQL, text analytics, and BigSheets.
- Follow these self-paced tutorials (PDF) to learn how to manage your big data environment, import data for analysis, analyze data with BigSheets, develop your first big data application, develop Big SQL queries to analyze big data, and create an extractor to derive insights from text documents with InfoSphere BigInsights.
- Find resources to help you get started with InfoSphere Streams, IBM's high-performance computing platform that enables user-developed applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of real-time sources.
- Stay current with developerWorks technical events and webcasts.
- Follow developerWorks on Twitter.
Get products and technologies
- Visit the R project site for more information about this powerful statistical language.
- Check out Dan Kelley's OCE package for R.
- Download a trial version of InfoSphere Streams, a high-performance analytics platform that includes the R-project toolkit.
- Download InfoSphere BigInsights Quick Start Edition, available as a native software installation or as a VMware image.
- Download InfoSphere Streams, available as a native software installation or as a VMware image.
- Use InfoSphere Streams on IBM SmartCloud Enterprise.
- Build your next development project with IBM trial software, available for download directly from developerWorks.
- Check out the developerWorks blogs and get involved in the developerWorks community.
- Check out IBM big data and analytics on Facebook.