Server clinic: R handy for crunching data

Good choices for sophisticated statistical processing

R is sophisticated open-source software for managing statistical calculations. It's easy enough to use that it can benefit you even if you need only a fraction of its capabilities.

Share:

Cameron Laird (claird@phaseit.net), Vice president, Phaseit, Inc.

Cameron is a full-time consultant for Phaseit, Inc. He writes and speaks frequently on open source and other technical topics.



30 July 2003

Also available in Japanese

Want statistics? Get R

developerWorks has published several recent articles on the expanding role of open source software in scientific and engineering work (please see Resources for links to those). One of the recurring points scientists made in the interviews for those articles is that they're assessing adoption of open source applications that are worthy competitors to their commercial counterparts, in the dimensions that matter: the programs have nearly all the capabilities of proprietary products, and occasionally more.

R is just such a program. And although it came up often during that earlier cycle of profiles, I found then that I had to exclude it from those stories, simply to limit the articles to manageable sizes. Several researchers have since emphasized to me that, while R might be statistical rather than scientific in some pedantic sense, it's so important that it deserves prompt attention. Let's take a look, then, at R and related software, with a view to discussing what R means for the server-side developers and administrators who read this column.

R is an industrial-strength, general-purpose, statistics-oriented open source language based on more than two decades of production experience.


R re-implements S

A Bell Labs team began developing a research project called "S" back in the mid-'70s. Eventually, the project became a full-blown, general-purpose computing language, with rich statistical capabilities. This was a revelation at the time; in the early '80s, most of the engineers and experimentalists working in this area assumed that serious computer programs required painstaking low-level craftsmanship and expertise. S demonstrated "generic" solutions of high quality. Project leader Dr. John Chambers received the ACM Software System Award in 1999, in recognition that, among other achievements, "S has forever altered the way people analyze, visualize, and manipulate data." Among S's many strengths, it "plays nicely" with modules written in such other languages as Fortran and C.

Insightful Corporation sells a commercially successful, widely respected, descendant of S it calls S-PLUS. In the early 1990s, Robert Gentleman and Ross Ihaka of the University of Auckland began work on R, which they released as free software, and which evolved (according to Ihaka) to resemble S quite closely. R's implementation, though, along with a few of its interfaces are entirely different from S and S-PLUS. The R core development team, which Chambers joined in 1997, works principally in Scheme. As statistical computing specialist Patrick Burns writes, "at the present time, neither S-PLUS nor R dominates the other ... Some things are better and/or faster in S-PLUS, others are better and/or faster in R."


What's the payoff?

Why does this matter to you? Start by looking at it from the perspective of Lucent Corporation and other major enterprises that use S internally. Lucent, the present-day name of Bell Labs' parent, advertises that it has saved millions of dollars by acting on the conclusions of S-based data analysis of cellular telephone fraud patterns, circuit manufacture quality measurements, and much more. S made results possible that Lucent simply wasn't going to achieve any other way. These early successes were so impressive that they crucially influenced development of what we now call "data mining."

Cymer, Inc., "the world's leading supplier of excimer light sources," headquartered in San Diego, has publicized more recent successes in the same vein. Cymer's customers are semiconductor manufacturers at over a hundred sites around the world. Such chip factories are very large, highly automated, and extremely expensive. Variations in quality, let alone failures, can waste millions of dollars, in days. Cymer uses S-PLUS-based "analytics" to report in real time on product performance. This has accelerated decision-making cycles "from weeks to minutes."

How does statistical analysis produce such miraculous results? It augments human judgment, and replaces it for tasks we characteristically perform poorly. People are good at detecting patterns -- sometimes too good, though, and only certain kinds of patterns. Good statistical analysis can pick up tiny systematic signals in, for example, chip production, or telephone fraud, or consumer buying patterns, long before decision-makers would notice a trend. Early detection gives an organization far greater opportunity to correct or tune a process, before it elevates to the status of "problem."

The workplaces of Server clinic's readers present plenty of such instances. Many of us drown in logs of Web traffic or other network processes. It's natural to wonder questions like, "how much bandwidth do I need to handle 98 percent of requests within one second?" or "What's the best time of day to send out offsite backups so they don't interfere with customer traffic?" Those questions remain idle, though, without either careful data reduction by a skilled statistician, or the kinds of analysis and reporting that S makes a snap.

So put that power to work for your reporting and analysis needs. Most computer users with a requirement for statistical results turn to desktop-oriented "packages," starting with popular spreadsheet and related "office automation" products, and many of these are quite effective in their roles. As a general-purpose computing language, though, S has far more ability to abstract and scale up to handle large, server-side problems. If you want to do (soft) real-time statistical calculations based on incoming data of e-commerce, Web service, network loads, sensor readings, or other common challenges, you likely will benefit from abstractive power, as well as S's openness to other languages. If your experience is at all typical, you'll discover surprises that you wouldn't have learned any other way: for example, that one specific warehouse has anomalous "losses," that customers enter incorrect data in a characteristic pattern, or that you don't need a new server, you just need to cache a couple of key calculations.

Perhaps you can imagine that S is good for statistics, but you believe that that's a specialized pursuit whose only application for you is a year-end report or occasional budget request. This is simply false. All of us make statistical inferences constantly: judgments about the likelihood of traffic or weather problems, the need for more hardware, the chance of five simultaneous customer crises on a Friday afternoon, and so on. For most of our lives, we get by on the implicit models our intuition provides. Research has shown how faulty these are, though, and we're beginning to understand their costs. The alternative S provides is the chance to make our models explicit, and calculate real costs and benefits.

S itself is easy enough to learn and use that its payoff is practical, not merely "theoretical." It might help to think of it as a technology like "writing to-do lists" or "budgeting." There's absolutely no guarantee that a model expressed in S is accurate, just as plenty of spreadsheeted profit-and-loss projections have been obvious works of fiction. The first thing all these exercises do, though, is shine light on our assumptions and their consequences. That's the only sure way to improve them.


Technologies for high-level statistical work

The R re-implementation of S owes much of its design to working with the Scheme programming language. Among other benefits, this gives R lexical scoping. R's standard Frequently Asked Questions document (see Resources for a link) provides interesting examples of how (lexical) closures allow more natural expression of common function definitions. The S-PLUS documentation suggests that the way to compute the density function of the rth-order statistic from a sample of size n is:

Listing 1. Density of rth-order statistic, in S
  dorder <- function(n, r, pfun, dfun) {
      f <- function(x) NULL
      con <- round(exp(lgamma(n + 1) - lgamma(r) - lgamma(n - r + 1)))
      PF <- call(substitute(pfun), as.name("x"))
      DF <- call(substitute(dfun), as.name("x"))
      f[[length(f)]] <-
           call("*", con, call("*", call("^", PF, r - 1),
               call("*", call("^", call("-", 1, PF), n - r), DF)))
      f
  }

Lexical scoping allows R to express the same thing as:

Listing 2. Density of rth-order statistic, in R
  dorder <- function(n, r, pfun, dfun) {
      con <- round(exp(lgamma(n + 1) - lgamma(r) - lgamma(n - r + 1)))
      function(x) {
          con * pfun(x)^(r - 1) * (1 - pfun(x))^(n - r) * dfun(x)
      }
  }

As a development environment, R's greatest advantage might be the Comprehensive R Archive Network (CRAN), a package archive homologous to Perl's CPAN. CRAN (again, see Resources for a link) makes use of code written by others far, far easier than it would be without CRAN.

While S is standard in professional statistical circles, there certainly are alternative powerful approaches. Matlab, SAS, and other commercial scientific packages mentioned in the earlier "open science" series (see Resources) boast statistical libraries of generally satisfactory quality.

Yet another solution is to work within a general-purpose computing language, and add on statistical capabilities. Good libraries are available for C, Fortran, Java, and other languages. As with most problems, Server clinic favors higher-level languages. Perl, Python, and Yorick, in particular, enjoy particularly polished libraries, including PDL, Numeric, Scientific Python, stats.py, and SalStat. Several of these also build in visualization and documentation capabilities to produce quick and satisfying graphical results.

SalStat author Alan James Salmoni offers this example of how straightforward it is to use his package in a natural object-oriented Python idiom:

Listing 3. Simple SalStat example
import salstat_stats
a = [2,3,4,3,4,5] # first data set
b = [6,7,8,7,8,4] # second data set
x = salstat_stats.TwoSampleTests(a,b)
x.TTestUnpaired()
print x.df # prints out the degrees of freedom
print x.t  # t statistic
print x.prob # probability from the t statistic and the df

Code this lucid encourages statistical and software experts to collaborate fruitfully on real statistical problems, rather than dissipating themselves on the constraints of poorly-designed packages.


Conclusion

My thanks to all the correspondents who've written me to share their experiences with R and related projects. An overwhelming excess of software claims to address the problems that appear in the Server clinic. The only hope I have of keeping up with the best solutions is to make good use of help from readers.

It's not enough, of course, just to know that solutions exist, or even to know their names. You also need to learn how to use them. One final tip for this installment of the column is to take a look at Computer Science & Perl Programming (CS&PP). This seven-hundred page collection of seventy updates of the best articles published in The Perl Journal appeared on bookshelves at the end of 2002. As you'd expect, CS&PP is filled with good server-side programming examples, as well as principles that apply beyond the confines of Perl development. Moreover, specific pieces cover topics this column has broached over the past year, including:

  • Automation of Microsoft Office and data manager components
  • Scientific software, including bioinformatics
  • Security
  • E-mail management

While there are only a few brief examples in CS&PP of the sort of high-powered statistical processing at which R specializes, there's a higher-level connection between the book and the language. One way to think about R is that it's an efficient, flexible tool for managing, summarizing, and exploiting large-scale data. The same mentality motivates Perl: succinct, powerful expressions which can economically extract results from masses of observations. Both make it practical to explore new ideas, to play with them to see where they lead.

Resources

  • Check out the previous chapters of Server clinic.
  • The R home page is maintained as a GNU project.
  • The Comprehensive R Archive Network -- or CRAN -- is a remarkably successful archiving and retrieval site.
  • The best way to get a sense of the value R provides is to read R News, the project's newsletter. It's fascinating, and well-edited besides. There, you can catch up with statistical methods for predicting gas consumption in the United Kingdom, strategies for coping with genomic data, and the how and why of all R's other capabilities.
  • Kurt Honrik's FAQ on R answers questions on, among other topics, the relation of R to S.
  • John Chambers' professional home page points to numerous publications on S and more.
  • Patrick Burns offers an "Intro to the S language" and two other tutorials from his frameful site.
  • Insightful Corporation sells S-PLUS, among other products.
  • StatLib -- Software and extensions for the S (Splus) language is a library maintained at Carnegie-Mellon University.
  • The creator of S, John Chambers, has posted interesting information about the S System and its evolution.
  • The description of S from the Dictionary of Programming Languages is interesting and useful to those with no prior familiarity with the language.
  • The column mentioned R's ability to use code written in other languages. One example is the R-Tcl/Tk package, described in this issue of R News, along with many other riches.
  • stats.py is a Python module written by Gary Strangman.
  • Scientific Python has introductory statistical capability.
  • SalStat adapts and extends stats.py. Like the latter, SalStat is written in and best used from Python.
  • The RPy module embeds R within a Python interpreter. It's quite polished, and savvy enough, for example, to pass NumPy arrays to and from R.
  • "Perl Data Language gives standard Perl the ability to compactly store and speedily manipulate the large N-dimensional data arrays which are the bread and butter of scientific computing," according to its home page.
  • Computer Science & Perl Programming is the first of three massive volumes on Perl techniques O'Reilly and Associates published this year.
  • cephes implements low-level forms of many mathematical functions in portable C.

Although R didn't appear in the published form of this series, many of the experimentalists working with these programs mentioned their reliance on and affection for R.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=11331
ArticleTitle=Server clinic: R handy for crunching data
publish-date=07302003