The first two installments of this series looked at R in fairly "real world" usage. We explored various statistical analyses and graphing capabilities using a large series of temperature data that the coauthor of those installments had collected. As mentioned in those prior articles, what we actually explored barely scratched the surface of the deep and rich statistical libraries in R.
In this article, I want to leave aside consideration of further statistical analysis per se (in large part because I do not personally have the requisite knowledge of statistics to decide the most relevant techniques; both my earlier coauthor, Brad Huntting, and many readers know far more in these areas). As a complement to the rich statistical concepts we offered in the first two articles, I would like to provide readers with some pointers to the underlying language facilities of R. The prior installments made gestures towards the functional programming orientation of R; my hunch is that many readers will be more familiar with imperative and object-oriented languages.
As well, we have previously only looked at R in a rather ad hoc fashion. In this installment, I'll discuss creating reusable and modular components for R development.
Before getting to R's notion of object orientation, let's review and clarify R's data and functions. The main thing to keep in mind about R data is that "everything is a vector." Even objects that look superficially distinct from vectors -- matrices, arrays, data.frames, etc. -- are really just vectors with extra (mutable) attributes that tell R to treat them in special ways.
The dimension (spelled dim) is one of the most important attributes that (some) R vectors have. The functions matrix(), array(), and dim() are simply convenience functions for setting the dimensions of a vector. R's OOP system similarly boils down to what (if anything) is in an object's class attribute.
Let's review dimensioning by first looking at the code in Listing 1:
Listing 1. Creating vectors and assigning dimensions
> v = 1:1000 > typeof(v) [1] "integer" > attributes(v) NULL > dim(v) = c(10,10,10) # (Re)dimension > attributes(v) $dim [1] 10 10 10 > v2 = matrix(1:1000, nrow=100, ncol=10) > typeof(v2) [1] "integer" > attributes(v2) $dim [1] 100 10 > attr(v2,'dim') = c(10,10,10) # Redimension > attributes(v2) $dim [1] 10 10 10 |
In short, there are several syntax conveniences for attaching a dim attribute to a vector, but at heart that is all the syntax sugar does.
One thing that can be confusing about R's "everything is a vector" approach is that row-wise and column-wise operations may not adhere to your intuition. For example, it's simple enough to create a 2D array (a matrix) and even to operate on a single column or row:
Listing 2. Operating on matrix vector and row-wise
> m = matrix(1:12, nrow=3, ncol=4)
> m
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
> sum(m) # sum of all elements of m
[1] 78
> sum(m[1,]) # sum of first row
[1] 22
|
But if you wanted to create a vector to sum each row, you might be inclined to try the following:
Listing 3. The wrong way to perform multiple row-wise operations
> sum(m[c(1,2,3),]) # NOT sum of each row [1] 78 |
You could construct a loop here, but that feels at odds with R's functional and vector-oriented operations. Instead, the trick is to use the function apply():
Listing 4. apply() function for row-wise operations
> apply(m, 1, sum) # by row
[1] 22 26 30
> apply(m, 2, sum) # by column
[1] 6 15 24 33
> apply(m, c(1,2), sum) # by column AND row (sum of each single cell)
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
# Kinda worthless to sum each single cell, but what about this:
> a = array(1:24,c(3,4,2))
> apply(a, c(1,2), sum) # sum by depth in 3-D array
[,1] [,2] [,3] [,4]
[1,] 14 20 26 32
[2,] 16 22 28 34
[3,] 18 24 30 36
> apply(a, c(3), sum) # sum of each depth slice
[1] 78 222
|
A construct that is sometimes useful to have, for perfectly practical reasons, is an infinite sequence of numbers. For example, my coauthor of the prior installments was doing some analysis of Monte Carlo integration techniques and for his purposes, it was useful to have an infinitely long sequence of random numbers. Understand that the type of infinite sequence needed is not simply the ability to generate a new number as needed; it is also necessary to be able to refer back to a specific previously referenced element and have it refer to the same value it had before.
Obviously, no computer language nor computer can store an infinite sequence -- what they can store is a lazy and unbounded sequence. More elements can be added to a realized list when and only when, access is required. For example, in Python you might accomplish this by creating a list-like object with a custom .__getitem__() method that extends the internal list as needed. In Haskell, laziness is built deeply into the language -- in effect, everything is lazy. In my Haskell tutorial (Resources) I used the example of creating a list of all the prime numbers:
Listing 5. Haskell list of all primes with Sieve of Eratosthenes
primes :: [Int] primes = sieve [2 .. ] sieve (x:xs) = x : sieve [y | y <- xs, (y `rem` x)/=0] |
In respect to the infinite, R is closer to Python than it is to Haskell. You need to explicitly construct more list elements as needed. We need to wait for the OOP section to let vector indexing itself launch the behind-the-scenes mechanism; there is still not much scaffolding involved.
Listing 6. Define a vector and a means to dynamically expand it
inf_vector = rnorm(10, 0, 1) # arbritrarily start w/ init 10 items
assure <- function(index) {
extend_by = max(index-length(inf_vector), 0)
extras = rnorm(extend_by, 0, 1)
v <- c(inf_vector, extras)
assign("inf_vector", v, env=.GlobalEnv)
return(index)
}
getRand <- function(index) {
assure(index)
return(inf_vector[index])
}
|
Probably the preferred usage is to access values through the getRand() wrapper function. Notice that you are perfectly free to use slices or computed values, as well as single indices:
Listing 7. Using wrapper function as proxy to infinite virtual vector
> getRand(3) # Single index [1] 0.5557101 > getRand(1:5) # Range [1] -0.05472011 -0.30419695 0.55571013 0.91667175 -0.40644081 > getRand(sqrt(c(4,16))) # Computed index collection [1] -0.3041970 0.9166717 > getRand(100) # Force background vector extension [1] 0.6577079 |
If you prefer, you can simply assure() the vector is large enough before using elements:
Listing 8. Extending vector (if needed) before access
> assure(2000) [1] 2000 > inf_vector[1500] [1] 1.267652 |
R is capable of fully general object-oriented programming, but to understand this, you need to step back a bit in your thinking about what OOP is. Users of languages like Java™ and C++, or even of Python, Ruby, or Smalltalk, might have a relatively circumscribed picture of object orientation. Not wrong, but limited to one model.
R's approach to OOP is based on generic functions rather than on class hierarchies. This concept will be familiar to readers who have used Lisp's CLOS or who have read my discussion of multiple dispatch using Python (Resources). Unfortunately, R's approach is still the single-dispatch approach; in that respect, it is equivalent to the "traditional" languages I mentioned (C++, Java, etc.).
I should note -- though I will not discuss it in detail in this article -- that the recent version of R comes with a package called methods that defines and manipulates so-called "formal methods." In many ways, using these formal methods imposes much of the discipline (and limitations) you find in traditional OOP languages. In any case, formal OOP in R is built atop the "informal OOP" that I talk about in this article. The methods package is still somewhat tentative from what I can tell, but some moderately tweaked version of it seems certain to continue in later R versions. Resources will offer more background.
The thing to keep in mind in understanding OOP sui generis is that OOP is not really a matter of inheritance specifically, but of dispatch decisions more generally. That is, a call to obj.method() in a traditional OOP language will look through the method resolution order (MRO) of an object to find the "first" ancestor class of obj that has a method .method().
What "first" means is a more subtle question than is obvious at first (see Resources for a good discussion of the evolving MRO design in Python). R makes the same decision, but it turns the idea of inheritance inside out -- rather than have a bunch of classes which may define and override various methods in their bodies, R creates a family of generic functions that carry a tag describing what type of object they want to operate on.
As a simple example, let us create a generic function called whoami() and some tagged methods to dispatch to:
Listing 9. Creating a generic function and some tagged methods
#------------- Create a generic method
> whoami <- function(x, ...) UseMethod("whoami")
> whoami.foo <- function(x) print("I am a foo")
> whoami.bar <- function(x) print("I am a bar")
> whoami.default <- function(x) print("I don't know who I am")
|
The key here is that every object in R may belong to zero, one, or more classes. Specifically, the MRO of any given object (relative to a particular method) is simply the vector of named classes, if any, in its class attribute. For example:
Listing 10. Tagging objects with class memberships
> a = 1:10
> b = 2:20
> whoami(a) # No class assigned
[1] "I don't know who I am"
> attr(a,'class') <- 'foo'
> attr(b,'class') <- c('baz','bam','bar')
> whoami(a)
[1] "I am a foo"
> whoami(b) # Search MRO for defined method
[1] "I am a bar"
> attr(a,'class') <- 'bar' # Change the class of 'a'
> whoami(a)
[1] "I am a bar"
|
As with traditional inheritance languages, an object need not utilize the same class for every method it calls. Traditionally, if Child inherits from Mom and Dad an object of type Child might utilize .meth1() from Mom and .meth2() from Dad. You can do this in R, naturally, but Mom and Dad are nothing substantial, just names:
Listing 11. Dispatch resolution per method
> meth1 <- function(x) UseMethod("meth1")
> meth1.Mom <- function(x) print("Mom's meth1")
> meth1.Dad <- function(x) print("Dad's meth1")
> meth2 <- function(x) UseMethod("meth2")
> meth2.Dad <- function(x) print("Dad's meth2")
> attr(a,'class') <- c('Mom','Dad')
> meth1(a) # Even though meth1.Dad exists, Mom comes first for a
[1] "Mom's meth1"
> meth2(a)
[1] "Dad's meth2"
|
It might seem limiting to need to explicitly specify the MRO of an object rather than rely on its implicit resolution through inheritance syntax. In point of fact, you can perfectly easily implement inheritance-based MROs with a minimal wrapper functions. The MRO I use in Listing 11 is probably not the best one possible (see Simionato's essay in Resources), but it shows the idea:
Listing 12. Implementing inheritance-based MRO with minimal wrapper functions
char0 = character(0)
makeMRO <- function(classes=char0, parents=char0) {
# Create a method resolution order from an optional
# explicit list and an optional list of parents
mro <- c(classes)
for (name in parents) {
mro <- c(mro, name)
ancestors <- attr(get(name),'class')
mro <- c(mro, ancestors[ancestors != name])
}
return(mro)
}
NewInstance <- function(value=0, classes=char0, parents=char0) {
# Create a new object based on initial value,
# explicit classes and parents (all optional)
obj <- value
attr(obj,'class') <- makeMRO(classes, parents)
return(obj)
}
MaternalGrandma <- NewInstance()
PaternalGrandma <- NewInstance()
Mom <- NewInstance(classes='Mom', parents='MaternalGrandma')
Dad <- NewInstance(0, classes=c('Dad','Uncle'), 'PaternalGrandma')
Me <- NewInstance(value='Hello World', 'Me', c('Mom','Dad'))
|
In action, the code in Listing 12 looks like this:
Listing 13. An object with an inheritance-generated MRO
> print(Me) [1] "Hello World" attr(,"class") [1] "Me" "Mom" "MaternalGrandma" "Dad" [5] "Uncle" "PaternalGrandma" |
If you wish to follow a traditional approach to class/instance relationships, you want to include the name of the class you create (like Mom in its classes argument). Actually, given that each class is itself a perfectly good object, the above system is closer to prototype-based OOP systems than class-based ones.
Then again, the whole system is flexible enough to include all variations. You are perfectly free to segregate class objects from instance objects if you wish -- you might distinguish classes with a naming convention (like Mom versus mom) by attaching another attribute (like type could be class or instance; and utility functions would check that) or by other means.
Now that we have some OOP scaffolding to work with, we can actually do a better job with the infinite vector that was presented previously. The first solution is perfectly workable, but it might be nice to have an even more seamless and invisible infinite vector.
Operators in R are just shorthand ways to make function calls; you are just as free to specialize operator behavior on a class basis as you are any other function call. While we are at it, we can fix a few other weaknesses in the first system:
- We would like to be able to generate as many distinct infinite vectors as need.
- We would like to be able to configure the random distribution used.
- We would like to have the option of initializing an infinite random vector with the values in another vector.
So let's do all that:
Listing 14. Define an indexable infinite random vector
"[.infinite_random" <- function(v, index) {
name <- attr(v, 'name')
rfunc <- attr(v, 'rfunc')
extend_by = max(index-length(v), 0)
extras = rfunc(extend_by)
new <- c(v, extras)
makeInfiniteRandomVector(name, v=new, rfunc)
return(new[index])
}
unitnorm <- function(n) return(rnorm(n,0,1))
empty <- vector('numeric', 0)
makeInfiniteRandomVector <- function(name, v=empty, rfunc=unitnorm) {
# Create an infinite vector
# optionally extend existing vector, configurable rand func
attr(v,'class') <- 'infinite_random'
attr(v,'name') <- name
attr(v,'rfunc') <- rfunc
assign(name, v, env=.GlobalEnv)
}
makeInfiniteRandomVector('v')
# makeInfiniteRandomVector('inf_poisson', rfunc=my_poisson)
# Usage is just, e.g.: v[1]; v[10]; v[9:12]; etc.
|
Indexing is already defined by R as a generic function, so there is no need to call UseMethod() to set it up; you just define as many new specializations as you wish. Likewise, the built-in print() function is also generic. We could specialize that with:
Listing 15. Printing an infinite vector
print.infinite_random <- function(v) {
a_few = 5
len = length(v)
end_range = (len-a_few)+1
cat('* Infinite Random Vector *\n')
cat('[1] ', v[1:a_few], '...\n')
cat('[')
cat(end_range)
cat('] ', v[end_range:len], '...\n')
}
|
In action, the code produces the following:
Listing 16. Example of printing infinite vector
> v[1000] [1] -1.341881 > print(v) * Infinite Random Vector * [1] -0.6247392 1.308057 1.654919 1.691754 -2.251065 ... [996] 1.027440 0.8376 -0.7066545 -0.7778386 -1.341881 ... |
Programming general purpose functions, objects, and classes in R requires taking a step back from the ways of thinking that traditional procedural and object-oriented programmers are used to. The prior two installments showed you some examples of ad hoc statistical exploration and did not really require such a rethinking, but once you wish to reuse your code, it is worth understanding the concepts of generic functions and the "inside out" OOP you can write with them (the inside out form is actually more general).
The trick is to think of OOP entirely as a question of "what code gets called" and "how is the decision made." Do not get hung up on the specific syntax familiar languages -- whether C++, Objective C, Java, Ruby, or Python -- use to express that; focus on the concept of dispatch itself.
Learn
-
Statistical programming with R: Part 1. Dabbling with a wealth of statistical facilities (developerWorks, September 2004) introduces R and relates it to basic statistical analysis.
-
Statistical programming with R: Part 2. Functional programming and data exploration (developerWorks, October 2004) adds to part one by diving into the language's functionality.
-
Wikipedia defines Monte Carlo integration, algorithms for solving various kinds of computational problems by using random numbers.
-
Michele Simionato's article, The Python 2.3 Method Resolution Order, has a good discussion of the concepts of method resolution order (MRO) and the merits of different MRO algorithms.
-
The R formal methods package provides an implementation of formal classes and methods.
-
The R development core team offers R: A Language and Environment for Statistical Computing (12MB PDF file), a full reference manual for R; Chapter 4 discusses formal methods and the
methodspackage. -
Beginning Haskell (developerWorks, September 2001) is a gentle tutorial introduction to the paradigm of functional programming with specific illustrations in the Haskell 98 language.
-
Phillip J. Eby gave a nice summary presentation on generic functions at PYCON '05 and has made the slides available.
-
Charming Python: Multiple dispatch (developerWorks, March 2003) provides a solid framework for the generalization of OOP to generic functions (a generic framework that is also a superset of R's capabilities).
-
Find more resources for Linux developers in the developerWorks Linux zone.
Get products and technologies
-
Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
-
Build your next development project on Linux with IBM trial software, available for download directly from developerWorks.
Discuss
-
Get involved in the developerWorks community by participating in developerWorks blogs.

To David Mertz, all the world is a stage; his career is devoted to providing marginal staging instructions. For more on his life, see his personal Web page. He's been writing the developerWorks columns Charming Python and XML Matters since 2000. Check out his book, Text Processing in Python. You can contact David at mertz@gnosis.cx.
Comments (Undergoing maintenance)





