Skip to main content

Statistical programming with R, Part 3: Reusable and object-oriented programming

A look at R's underlying features

David Mertz (mertz@gnosis.cx), Developer, Gnosis Software
David Mertz
To David Mertz, all the world is a stage; his career is devoted to providing marginal staging instructions. For more on his life, see his personal Web page. He's been writing the developerWorks columns Charming Python and XML Matters since 2000. Check out his book, Text Processing in Python. You can contact David at mertz@gnosis.cx.

Summary:  R is rich statistical environment, released as free software, that includes a programming language, an interactive shell, and extensive graphing capability. This article follows up David's two prior installments (written with Brad Huntting) and looks at the object-orientation in R along with some additional general programming concepts in R.

View more content in this series

Date:  26 Jan 2006
Level:  Intermediate
Activity:  4226 views

The first two installments of this series looked at R in fairly "real world" usage. We explored various statistical analyses and graphing capabilities using a large series of temperature data that the coauthor of those installments had collected. As mentioned in those prior articles, what we actually explored barely scratched the surface of the deep and rich statistical libraries in R.

In this article, I want to leave aside consideration of further statistical analysis per se (in large part because I do not personally have the requisite knowledge of statistics to decide the most relevant techniques; both my earlier coauthor, Brad Huntting, and many readers know far more in these areas). As a complement to the rich statistical concepts we offered in the first two articles, I would like to provide readers with some pointers to the underlying language facilities of R. The prior installments made gestures towards the functional programming orientation of R; my hunch is that many readers will be more familiar with imperative and object-oriented languages.

As well, we have previously only looked at R in a rather ad hoc fashion. In this installment, I'll discuss creating reusable and modular components for R development.

Back to basics

Before getting to R's notion of object orientation, let's review and clarify R's data and functions. The main thing to keep in mind about R data is that "everything is a vector." Even objects that look superficially distinct from vectors -- matrices, arrays, data.frames, etc. -- are really just vectors with extra (mutable) attributes that tell R to treat them in special ways.

The dimension (spelled dim) is one of the most important attributes that (some) R vectors have. The functions matrix(), array(), and dim() are simply convenience functions for setting the dimensions of a vector. R's OOP system similarly boils down to what (if anything) is in an object's class attribute.

Let's review dimensioning by first looking at the code in Listing 1:


Listing 1. Creating vectors and assigning dimensions

> v = 1:1000
> typeof(v)
[1] "integer"
> attributes(v)
NULL
> dim(v) = c(10,10,10)  # (Re)dimension
> attributes(v)
$dim
[1] 10 10 10
> v2 = matrix(1:1000, nrow=100, ncol=10)
> typeof(v2)
[1] "integer"
> attributes(v2)
$dim
[1] 100  10
> attr(v2,'dim') = c(10,10,10)  # Redimension
> attributes(v2)
$dim
[1] 10 10 10

In short, there are several syntax conveniences for attaching a dim attribute to a vector, but at heart that is all the syntax sugar does.

One thing that can be confusing about R's "everything is a vector" approach is that row-wise and column-wise operations may not adhere to your intuition. For example, it's simple enough to create a 2D array (a matrix) and even to operate on a single column or row:


Listing 2. Operating on matrix vector and row-wise

> m = matrix(1:12, nrow=3, ncol=4)
> m
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
> sum(m)  # sum of all elements of m
[1] 78
> sum(m[1,])  # sum of first row
[1] 22

But if you wanted to create a vector to sum each row, you might be inclined to try the following:


Listing 3. The wrong way to perform multiple row-wise operations

> sum(m[c(1,2,3),])  # NOT sum of each row
[1] 78

You could construct a loop here, but that feels at odds with R's functional and vector-oriented operations. Instead, the trick is to use the function apply():


Listing 4. apply() function for row-wise operations

> apply(m, 1, sum) # by row
[1] 22 26 30
> apply(m, 2, sum) # by column
[1]  6 15 24 33
> apply(m, c(1,2), sum) # by column AND row (sum of each single cell)
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
# Kinda worthless to sum each single cell, but what about this:
> a = array(1:24,c(3,4,2))
> apply(a, c(1,2), sum)  # sum by depth in 3-D array
     [,1] [,2] [,3] [,4]
[1,]   14   20   26   32
[2,]   16   22   28   34
[3,]   18   24   30   36
> apply(a, c(3), sum)    # sum of each depth slice
[1]  78 222


An infinite sequence

A construct that is sometimes useful to have, for perfectly practical reasons, is an infinite sequence of numbers. For example, my coauthor of the prior installments was doing some analysis of Monte Carlo integration techniques and for his purposes, it was useful to have an infinitely long sequence of random numbers. Understand that the type of infinite sequence needed is not simply the ability to generate a new number as needed; it is also necessary to be able to refer back to a specific previously referenced element and have it refer to the same value it had before.

Obviously, no computer language nor computer can store an infinite sequence -- what they can store is a lazy and unbounded sequence. More elements can be added to a realized list when and only when, access is required. For example, in Python you might accomplish this by creating a list-like object with a custom .__getitem__() method that extends the internal list as needed. In Haskell, laziness is built deeply into the language -- in effect, everything is lazy. In my Haskell tutorial (Resources) I used the example of creating a list of all the prime numbers:


Listing 5. Haskell list of all primes with Sieve of Eratosthenes

primes :: [Int]
primes = sieve [2 .. ]
sieve (x:xs) = x : sieve [y | y <- xs, (y `rem` x)/=0]

In respect to the infinite, R is closer to Python than it is to Haskell. You need to explicitly construct more list elements as needed. We need to wait for the OOP section to let vector indexing itself launch the behind-the-scenes mechanism; there is still not much scaffolding involved.


Listing 6. Define a vector and a means to dynamically expand it

inf_vector = rnorm(10, 0, 1)   # arbritrarily start w/ init 10 items
assure <- function(index) {
  extend_by = max(index-length(inf_vector), 0)
  extras = rnorm(extend_by, 0, 1)
  v <- c(inf_vector, extras)
  assign("inf_vector", v, env=.GlobalEnv)
  return(index)
}
getRand <- function(index) {
  assure(index)
  return(inf_vector[index])
}

Probably the preferred usage is to access values through the getRand() wrapper function. Notice that you are perfectly free to use slices or computed values, as well as single indices:


Listing 7. Using wrapper function as proxy to infinite virtual vector

> getRand(3)                # Single index
[1] 0.5557101
> getRand(1:5)              # Range
[1] -0.05472011 -0.30419695  0.55571013  0.91667175 -0.40644081
> getRand(sqrt(c(4,16)))    # Computed index collection
[1] -0.3041970  0.9166717
> getRand(100)              # Force background vector extension
[1] 0.6577079

If you prefer, you can simply assure() the vector is large enough before using elements:


Listing 8. Extending vector (if needed) before access

> assure(2000)
[1] 2000
> inf_vector[1500]
[1] 1.267652


Object-oriented R

R is capable of fully general object-oriented programming, but to understand this, you need to step back a bit in your thinking about what OOP is. Users of languages like Java™ and C++, or even of Python, Ruby, or Smalltalk, might have a relatively circumscribed picture of object orientation. Not wrong, but limited to one model.

R's approach to OOP is based on generic functions rather than on class hierarchies. This concept will be familiar to readers who have used Lisp's CLOS or who have read my discussion of multiple dispatch using Python (Resources). Unfortunately, R's approach is still the single-dispatch approach; in that respect, it is equivalent to the "traditional" languages I mentioned (C++, Java, etc.).

I should note -- though I will not discuss it in detail in this article -- that the recent version of R comes with a package called methods that defines and manipulates so-called "formal methods." In many ways, using these formal methods imposes much of the discipline (and limitations) you find in traditional OOP languages. In any case, formal OOP in R is built atop the "informal OOP" that I talk about in this article. The methods package is still somewhat tentative from what I can tell, but some moderately tweaked version of it seems certain to continue in later R versions. Resources will offer more background.

The thing to keep in mind in understanding OOP sui generis is that OOP is not really a matter of inheritance specifically, but of dispatch decisions more generally. That is, a call to obj.method() in a traditional OOP language will look through the method resolution order (MRO) of an object to find the "first" ancestor class of obj that has a method .method().

What "first" means is a more subtle question than is obvious at first (see Resources for a good discussion of the evolving MRO design in Python). R makes the same decision, but it turns the idea of inheritance inside out -- rather than have a bunch of classes which may define and override various methods in their bodies, R creates a family of generic functions that carry a tag describing what type of object they want to operate on.

Generic functions

As a simple example, let us create a generic function called whoami() and some tagged methods to dispatch to:


Listing 9. Creating a generic function and some tagged methods

#------------- Create a generic method
> whoami <- function(x, ...) UseMethod("whoami")
> whoami.foo <- function(x) print("I am a foo")
> whoami.bar <- function(x) print("I am a bar")
> whoami.default <- function(x) print("I don't know who I am")

The key here is that every object in R may belong to zero, one, or more classes. Specifically, the MRO of any given object (relative to a particular method) is simply the vector of named classes, if any, in its class attribute. For example:


Listing 10. Tagging objects with class memberships

> a = 1:10
> b = 2:20
> whoami(a)                 # No class assigned
[1] "I don't know who I am"
> attr(a,'class') <- 'foo'
> attr(b,'class') <- c('baz','bam','bar')
> whoami(a)
[1] "I am a foo"
> whoami(b)                 # Search MRO for defined method
[1] "I am a bar"
> attr(a,'class') <- 'bar'  # Change the class of 'a'
> whoami(a)
[1] "I am a bar"

As with traditional inheritance languages, an object need not utilize the same class for every method it calls. Traditionally, if Child inherits from Mom and Dad an object of type Child might utilize .meth1() from Mom and .meth2() from Dad. You can do this in R, naturally, but Mom and Dad are nothing substantial, just names:


Listing 11. Dispatch resolution per method

> meth1 <- function(x) UseMethod("meth1")
> meth1.Mom <- function(x) print("Mom's meth1")
> meth1.Dad <- function(x) print("Dad's meth1")
> meth2 <- function(x) UseMethod("meth2")
> meth2.Dad <- function(x) print("Dad's meth2")
> attr(a,'class') <- c('Mom','Dad')
> meth1(a)   # Even though meth1.Dad exists, Mom comes first for a
[1] "Mom's meth1"
> meth2(a)
[1] "Dad's meth2"

Including ancestors

It might seem limiting to need to explicitly specify the MRO of an object rather than rely on its implicit resolution through inheritance syntax. In point of fact, you can perfectly easily implement inheritance-based MROs with a minimal wrapper functions. The MRO I use in Listing 11 is probably not the best one possible (see Simionato's essay in Resources), but it shows the idea:


Listing 12. Implementing inheritance-based MRO with minimal wrapper functions

char0 = character(0)
makeMRO <- function(classes=char0, parents=char0) {
    # Create a method resolution order from an optional
    # explicit list and an optional list of parents
    mro <- c(classes)
    for (name in parents) {
        mro <- c(mro, name)
        ancestors <- attr(get(name),'class')
        mro <- c(mro, ancestors[ancestors != name])
    }
    return(mro)
}
NewInstance <- function(value=0, classes=char0, parents=char0) {
    # Create a new object based on initial value,
    # explicit classes and parents (all optional)
    obj <- value
    attr(obj,'class') <- makeMRO(classes, parents)
    return(obj)
}
MaternalGrandma <- NewInstance()
PaternalGrandma <- NewInstance()
Mom <- NewInstance(classes='Mom', parents='MaternalGrandma')
Dad <- NewInstance(0, classes=c('Dad','Uncle'), 'PaternalGrandma')
Me <- NewInstance(value='Hello World', 'Me', c('Mom','Dad'))

In action, the code in Listing 12 looks like this:


Listing 13. An object with an inheritance-generated MRO

> print(Me)
[1] "Hello World"
attr(,"class")
[1] "Me"              "Mom"             "MaternalGrandma" "Dad"
[5] "Uncle"           "PaternalGrandma"

If you wish to follow a traditional approach to class/instance relationships, you want to include the name of the class you create (like Mom in its classes argument). Actually, given that each class is itself a perfectly good object, the above system is closer to prototype-based OOP systems than class-based ones.

Then again, the whole system is flexible enough to include all variations. You are perfectly free to segregate class objects from instance objects if you wish -- you might distinguish classes with a naming convention (like Mom versus mom) by attaching another attribute (like type could be class or instance; and utility functions would check that) or by other means.

Revisiting an infinite vector

Now that we have some OOP scaffolding to work with, we can actually do a better job with the infinite vector that was presented previously. The first solution is perfectly workable, but it might be nice to have an even more seamless and invisible infinite vector.

Operators in R are just shorthand ways to make function calls; you are just as free to specialize operator behavior on a class basis as you are any other function call. While we are at it, we can fix a few other weaknesses in the first system:

  • We would like to be able to generate as many distinct infinite vectors as need.
  • We would like to be able to configure the random distribution used.
  • We would like to have the option of initializing an infinite random vector with the values in another vector.

So let's do all that:


Listing 14. Define an indexable infinite random vector

"[.infinite_random" <- function(v, index) {
    name <- attr(v, 'name')
    rfunc <- attr(v, 'rfunc')
    extend_by = max(index-length(v), 0)
    extras = rfunc(extend_by)
    new <- c(v, extras)
    makeInfiniteRandomVector(name, v=new, rfunc)
    return(new[index])
}
unitnorm <- function(n) return(rnorm(n,0,1))
empty <- vector('numeric', 0)
makeInfiniteRandomVector <- function(name, v=empty, rfunc=unitnorm) {
    # Create an infinite vector
    # optionally extend existing vector, configurable rand func
    attr(v,'class') <- 'infinite_random'
    attr(v,'name') <- name
    attr(v,'rfunc') <- rfunc
    assign(name, v, env=.GlobalEnv)
}
makeInfiniteRandomVector('v')
# makeInfiniteRandomVector('inf_poisson', rfunc=my_poisson)
# Usage is just, e.g.: v[1]; v[10]; v[9:12]; etc.

Indexing is already defined by R as a generic function, so there is no need to call UseMethod() to set it up; you just define as many new specializations as you wish. Likewise, the built-in print() function is also generic. We could specialize that with:


Listing 15. Printing an infinite vector

print.infinite_random <- function(v) {
    a_few = 5
    len = length(v)
    end_range = (len-a_few)+1
    cat('* Infinite Random Vector *\n')
    cat('[1] ', v[1:a_few], '...\n')
    cat('[')
    cat(end_range)
    cat('] ', v[end_range:len], '...\n')
}

In action, the code produces the following:


Listing 16. Example of printing infinite vector

> v[1000]
[1] -1.341881
> print(v)
* Infinite Random Vector *
[1]  -0.6247392 1.308057 1.654919 1.691754 -2.251065 ...
[996]  1.027440 0.8376 -0.7066545 -0.7778386 -1.341881 ...


Conclusion

Programming general purpose functions, objects, and classes in R requires taking a step back from the ways of thinking that traditional procedural and object-oriented programmers are used to. The prior two installments showed you some examples of ad hoc statistical exploration and did not really require such a rethinking, but once you wish to reuse your code, it is worth understanding the concepts of generic functions and the "inside out" OOP you can write with them (the inside out form is actually more general).

The trick is to think of OOP entirely as a question of "what code gets called" and "how is the decision made." Do not get hung up on the specific syntax familiar languages -- whether C++, Objective C, Java, Ruby, or Python -- use to express that; focus on the concept of dispatch itself.


Resources

Learn

Get products and technologies

  • Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

  • Build your next development project on Linux with IBM trial software, available for download directly from developerWorks.


Discuss

About the author

David Mertz

To David Mertz, all the world is a stage; his career is devoted to providing marginal staging instructions. For more on his life, see his personal Web page. He's been writing the developerWorks columns Charming Python and XML Matters since 2000. Check out his book, Text Processing in Python. You can contact David at mertz@gnosis.cx.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=102649
ArticleTitle=Statistical programming with R, Part 3: Reusable and object-oriented programming
publish-date=01262006
author1-email=mertz@gnosis.cx
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers