Want statistics? Get R
developerWorks has published several recent articles on the expanding role of open source software in scientific and engineering work (please see Resources for links to those). One of the recurring points scientists made in the interviews for those articles is that they're assessing adoption of open source applications that are worthy competitors to their commercial counterparts, in the dimensions that matter: the programs have nearly all the capabilities of proprietary products, and occasionally more.
R is just such a program. And although it came up often during that earlier cycle of profiles, I found then that I had to exclude it from those stories, simply to limit the articles to manageable sizes. Several researchers have since emphasized to me that, while R might be statistical rather than scientific in some pedantic sense, it's so important that it deserves prompt attention. Let's take a look, then, at R and related software, with a view to discussing what R means for the serverside developers and administrators who read this column.
R is an industrialstrength, generalpurpose, statisticsoriented open source language based on more than two decades of production experience.
R reimplements S
A Bell Labs team began developing a research project called "S" back in the mid'70s. Eventually, the project became a fullblown, generalpurpose computing language, with rich statistical capabilities. This was a revelation at the time; in the early '80s, most of the engineers and experimentalists working in this area assumed that serious computer programs required painstaking lowlevel craftsmanship and expertise. S demonstrated "generic" solutions of high quality. Project leader Dr. John Chambers received the ACM Software System Award in 1999, in recognition that, among other achievements, "S has forever altered the way people analyze, visualize, and manipulate data." Among S's many strengths, it "plays nicely" with modules written in such other languages as Fortran and C.
Insightful Corporation sells a commercially successful, widely respected, descendant of S it calls SPLUS. In the early 1990s, Robert Gentleman and Ross Ihaka of the University of Auckland began work on R, which they released as free software, and which evolved (according to Ihaka) to resemble S quite closely. R's implementation, though, along with a few of its interfaces are entirely different from S and SPLUS. The R core development team, which Chambers joined in 1997, works principally in Scheme. As statistical computing specialist Patrick Burns writes, "at the present time, neither SPLUS nor R dominates the other ... Some things are better and/or faster in SPLUS, others are better and/or faster in R."
What's the payoff?
Why does this matter to you? Start by looking at it from the perspective of Lucent Corporation and other major enterprises that use S internally. Lucent, the presentday name of Bell Labs' parent, advertises that it has saved millions of dollars by acting on the conclusions of Sbased data analysis of cellular telephone fraud patterns, circuit manufacture quality measurements, and much more. S made results possible that Lucent simply wasn't going to achieve any other way. These early successes were so impressive that they crucially influenced development of what we now call "data mining."
Cymer, Inc., "the world's leading supplier of excimer light sources," headquartered in San Diego, has publicized more recent successes in the same vein. Cymer's customers are semiconductor manufacturers at over a hundred sites around the world. Such chip factories are very large, highly automated, and extremely expensive. Variations in quality, let alone failures, can waste millions of dollars, in days. Cymer uses SPLUSbased "analytics" to report in real time on product performance. This has accelerated decisionmaking cycles "from weeks to minutes."
How does statistical analysis produce such miraculous results? It augments human judgment, and replaces it for tasks we characteristically perform poorly. People are good at detecting patterns  sometimes too good, though, and only certain kinds of patterns. Good statistical analysis can pick up tiny systematic signals in, for example, chip production, or telephone fraud, or consumer buying patterns, long before decisionmakers would notice a trend. Early detection gives an organization far greater opportunity to correct or tune a process, before it elevates to the status of "problem."
The workplaces of Server clinic's readers present plenty of such instances. Many of us drown in logs of Web traffic or other network processes. It's natural to wonder questions like, "how much bandwidth do I need to handle 98 percent of requests within one second?" or "What's the best time of day to send out offsite backups so they don't interfere with customer traffic?" Those questions remain idle, though, without either careful data reduction by a skilled statistician, or the kinds of analysis and reporting that S makes a snap.
So put that power to work for your reporting and analysis needs. Most computer users with a requirement for statistical results turn to desktoporiented "packages," starting with popular spreadsheet and related "office automation" products, and many of these are quite effective in their roles. As a generalpurpose computing language, though, S has far more ability to abstract and scale up to handle large, serverside problems. If you want to do (soft) realtime statistical calculations based on incoming data of ecommerce, Web service, network loads, sensor readings, or other common challenges, you likely will benefit from abstractive power, as well as S's openness to other languages. If your experience is at all typical, you'll discover surprises that you wouldn't have learned any other way: for example, that one specific warehouse has anomalous "losses," that customers enter incorrect data in a characteristic pattern, or that you don't need a new server, you just need to cache a couple of key calculations.
Perhaps you can imagine that S is good for statistics, but you believe that that's a specialized pursuit whose only application for you is a yearend report or occasional budget request. This is simply false. All of us make statistical inferences constantly: judgments about the likelihood of traffic or weather problems, the need for more hardware, the chance of five simultaneous customer crises on a Friday afternoon, and so on. For most of our lives, we get by on the implicit models our intuition provides. Research has shown how faulty these are, though, and we're beginning to understand their costs. The alternative S provides is the chance to make our models explicit, and calculate real costs and benefits.
S itself is easy enough to learn and use that its payoff is practical, not merely "theoretical." It might help to think of it as a technology like "writing todo lists" or "budgeting." There's absolutely no guarantee that a model expressed in S is accurate, just as plenty of spreadsheeted profitandloss projections have been obvious works of fiction. The first thing all these exercises do, though, is shine light on our assumptions and their consequences. That's the only sure way to improve them.
Technologies for highlevel statistical work
The R reimplementation of S owes much of its design to working with the Scheme programming language. Among other benefits, this gives R lexical scoping. R's standard Frequently Asked Questions document (see Resources for a link) provides interesting examples of how (lexical) closures allow more natural expression of common function definitions. The SPLUS documentation suggests that the way to compute the density function of the rthorder statistic from a sample of size n is:
Listing 1. Density of rthorder statistic, in S
dorder < function(n, r, pfun, dfun) { f < function(x) NULL con < round(exp(lgamma(n + 1)  lgamma(r)  lgamma(n  r + 1))) PF < call(substitute(pfun), as.name("x")) DF < call(substitute(dfun), as.name("x")) f[[length(f)]] < call("*", con, call("*", call("^", PF, r  1), call("*", call("^", call("", 1, PF), n  r), DF))) f }
Lexical scoping allows R to express the same thing as:
Listing 2. Density of rthorder statistic, in R
dorder < function(n, r, pfun, dfun) { con < round(exp(lgamma(n + 1)  lgamma(r)  lgamma(n  r + 1))) function(x) { con * pfun(x)^(r  1) * (1  pfun(x))^(n  r) * dfun(x) } }
As a development environment, R's greatest advantage might be the Comprehensive R Archive Network (CRAN), a package archive homologous to Perl's CPAN. CRAN (again, see Resources for a link) makes use of code written by others far, far easier than it would be without CRAN.
While S is standard in professional statistical circles, there certainly are alternative powerful approaches. Matlab, SAS, and other commercial scientific packages mentioned in the earlier "open science" series (see Resources) boast statistical libraries of generally satisfactory quality.
Yet another solution is to work within a generalpurpose computing language, and add on statistical capabilities. Good libraries are available for C, Fortran, Java, and other languages. As with most problems, Server clinic favors higherlevel languages. Perl, Python, and Yorick, in particular, enjoy particularly polished libraries, including PDL, Numeric, Scientific Python, stats.py, and SalStat. Several of these also build in visualization and documentation capabilities to produce quick and satisfying graphical results.
SalStat author Alan James Salmoni offers this example of how straightforward it is to use his package in a natural objectoriented Python idiom:
Listing 3. Simple SalStat example
import salstat_stats a = [2,3,4,3,4,5] # first data set b = [6,7,8,7,8,4] # second data set x = salstat_stats.TwoSampleTests(a,b) x.TTestUnpaired() print x.df # prints out the degrees of freedom print x.t # t statistic print x.prob # probability from the t statistic and the df
Code this lucid encourages statistical and software experts to collaborate fruitfully on real statistical problems, rather than dissipating themselves on the constraints of poorlydesigned packages.
Conclusion
My thanks to all the correspondents who've written me to share their experiences with R and related projects. An overwhelming excess of software claims to address the problems that appear in the Server clinic. The only hope I have of keeping up with the best solutions is to make good use of help from readers.
It's not enough, of course, just to know that solutions exist, or even to know their names. You also need to learn how to use them. One final tip for this installment of the column is to take a look at Computer Science & Perl Programming (CS&PP). This sevenhundred page collection of seventy updates of the best articles published in The Perl Journal appeared on bookshelves at the end of 2002. As you'd expect, CS&PP is filled with good serverside programming examples, as well as principles that apply beyond the confines of Perl development. Moreover, specific pieces cover topics this column has broached over the past year, including:
 Automation of Microsoft Office and data manager components
 Scientific software, including bioinformatics
 Security
 Email management
While there are only a few brief examples in CS&PP of the sort of highpowered statistical processing at which R specializes, there's a higherlevel connection between the book and the language. One way to think about R is that it's an efficient, flexible tool for managing, summarizing, and exploiting largescale data. The same mentality motivates Perl: succinct, powerful expressions which can economically extract results from masses of observations. Both make it practical to explore new ideas, to play with them to see where they lead.
Resources
 Check out the previous chapters of Server clinic.
 The R home page is maintained as a GNU project.
 The Comprehensive R Archive Network  or CRAN  is a remarkably successful archiving and retrieval site.
 The best way to get a sense of the value R provides is to read R News, the project's newsletter. It's fascinating, and welledited besides. There, you can catch up with statistical methods for predicting gas consumption in the United Kingdom, strategies for coping with genomic data, and the how and why of all R's other capabilities.
 Kurt Honrik's FAQ on R answers questions on, among other topics, the relation of R to S.
 John Chambers' professional home page points to numerous publications on S and more.
 Patrick Burns offers an "Intro to the S language" and two other tutorials from his frameful site.
 Insightful Corporation sells SPLUS, among other products.
 StatLib  Software and extensions for the S (Splus) language is a library maintained at CarnegieMellon University.
 The creator of S, John Chambers, has posted interesting information about the S System and its evolution.
 The description of S from the Dictionary of Programming Languages is interesting and useful to those with no prior familiarity with the language.
 "Statistics in Advanced Manufacturing," "Process Improvement Through ... Analysis ..." and "Detecting Fraud in the Real World" are just a few of the titles generated as applications of Statistics Research at Bell Labs.
 Modern Applied Statistics with SPLUS, 3rd ed provides a practical introduction to the statistical inferences that can improve processes and profits significantly.
 A Brief History of Data Mining tangentially alludes to the work at Bell Labs and elsewhere done with S.
 "Statistical Software Engineering" describes alternatives to intuition in project management and execution.
 "Telling the Truth with Statistics" is a universitylevel class targeted at particle physicists delivered as a Webcast.
 The Wikipedia page on statistical inference defines the term in a sensible way, and points usefully to related concepts.
 It is unfortunate though true that the study of statistics is, all too often, tedious and even boring. If this column has left you feeling that you know you should be applying this stuff in daytoday work, but can't bear the thought, reexcite your love for statistics and probability by browsing through the Exploring Data pages at Central Queensland University.
 "Statistical Inference" defines the mathematical structure of the topic.
 Intuition: Its Powers and Perils is David G. Myers' book on the possibilities of formalizing our decision models.
 The column mentioned R's ability to use code written in other languages. One example is the RTcl/Tk package, described in this issue of R News, along with many other riches.
 stats.py is a Python module written by Gary Strangman.
 Scientific Python has introductory statistical capability.
 SalStat adapts and extends stats.py. Like the latter, SalStat is written in and best used from Python.
 The RPy module embeds R within a Python interpreter. It's quite polished, and savvy enough, for example, to pass NumPy arrays to and from R.
 "Perl Data Language gives standard Perl the ability to compactly store and speedily manipulate the large Ndimensional data arrays which are the bread and butter of scientific computing," according to its home page.
 Computer Science & Perl Programming is the first of three massive volumes on Perl techniques O'Reilly and Associates published this year.
 cephes implements lowlevel forms of many mathematical functions in portable C.
 Read these developerWorks articles by Cameron related to scientific topics:
Although R didn't appear in the published form of this series, many of the experimentalists working with these programs mentioned their reliance on and affection for R.
 Read Cameron's personal notes on open source for science.
 This IDG article on BioConductor discusses the use of R in the biosciences.
 "Yorick Plays a Role" introduces the capabilities of this highlevel scientific computing language, which emphasizes easy visualization.
 Cameron's personal notes on Scheme provide references to the voluminous online literature about the language.
Comments
Dig deeper into Linux on developerWorks

developerWorks Premium
Exclusive tools to build your next great app. Learn more.

developerWorks Labs
Technical resources for innovators and early adopters to experiment with.

IBM evaluation software
Evaluate IBM software and solutions, and transform challenges into opportunities.