SPSS Statistics was, as far as I know, the first commercial software to deliver an integration with the R statistical language. It first appeared in SPSS 16, over six years ago at this writing, complementing the Python language integration that first appeared in SPSS 14. This post reviews the rationale and developments in this feature.

R has become the computational language of the statistics profession. It's the way a new statistical algorithm is first published, and the R library contains a vast collection of statistical functions. The R jungle contains many gems, but it has drawbacks, too. This isn't the place to discuss all the good and the bad, but suffice it to say that using R directly imposes a style of doing statistical analysis based on a programming model that does not always suit an analyst, and the output from a R package is usually not in a format suitable for publication. And, while there are various point and click interfaces that can be added on for some R packages, serious usage requires the user to learn the R language, which is not easy.

SPSS Statistics and, as of Version 16, SPSS Modeler, bring to bear the ease of use of these products and their output presentation capabilities that allow a user to work with these products while still tapping the power and packages of R. While a user can write programs in the R language that run within the Statistics or Modeler program, typically the SPSS user takes advantage of R packages that have already been integrated using the published apis and tools for this purpose without the need to learn R or deal with it directly. R packages can extend the statistical capabilities of these products without sacrificing the benefits of SPSS software. The R connection requires an extra installation step (R itself and, via the SPSS Community website, the R Essentials), but all the pieces for this are free. Statistics and Modeler can be a great way to deploy the functionality of R.

Organizations and individuals can do their own, private integrations of R packages, but the SPSS Community site provides a means of sharing integrations with everyone. Instructions for sharing are on the front page of the site. For SPSS Statistics, you can start here to see what has been shared. With Statistics version 22 or later, you can also download and install package integrations from the *Utilities *menu within Statistics without even visiting the site.

The image also shows extensions implemented in Python.

As of this writing, there are 25 R packages that have been integrated by IBM and 10 contributed by users. Package integrations generally include a dialog box interface produced by the Statistics Custom Dialog Builder and traditional SPSS syntax for the package. They produce their output as SPSS pivot tables and R graphic images that appear in the Statistics Viewer along with other output produced by native SPSS commands. For packages that are included in the R Essentials, the dialog and output are usually translated into all the languages that Statistics itself provides.

For Modeler 16, an adaptation of the Custom Dialog Builder is included, and nodes can build models and provide code to be used with those models for scoring. Using the new Hadoop integration, the scoring can be performed on the Hadoop cluster with big performance benefits. Similar to Statistics, the user of R- based nodes sees the same behavior that comes with native nodes.

Producing a package integration for Statistics is usually easy for someone who knows the R language. It can be as simple as adding a line to fetch the data. Usually the integration will convert plain text R output to one or more pivot tables, and it may create new datasets. That takes a little longer, but it is still typically a few hours to a few days. The apis for all this are covered in detail in the help installed with the R plug-in: Help > Programmability > R Plug-In. Since the package integrations are usually distributed in source form, they can serve as examples. Integration creators can also do translations, but not many are prepared to handle this. The SPSS forums are a good place to ask questions about this technology.

There is a white paper that discusses the benefits and technology of using R with Statistics that provides more details,

In sum, the R integrations for Statistics and Modeler allow access to the large R library but package it in a form that fits in with the native capabilities of these products. It's a win for everyone, and it's all free.

## Comments (10)

1Gwylym commented PermalinkWhy is the R essentials limited to using R 2.15? The current version is 3.02.

2JonPeck commented PermalinkWe know. We generally integrate with the latest stable version of R at the time we have to make our cut for a Statistics release. We have to use R header files from a particular R version in order to compile our code, and we have to get that R distribution approved by Legal before we can ship as well as do all our QA testing. We hope to loosen the requirement of a specific R version at some point in the future. However, you can have multiple R versions installed on the same machine without them interfering with each other. I have six.

3Gwylym commented PermalinkIs there a document that gives a quick guide to installing multiple versions of R?

4JonPeck commented PermalinkThere are no special instructions for installing multiple R versions. Beyond the obvious requirement that they be installed in different directories, you would just select the version you want from the R CRAN site (not always easy to find but look under Previous Releases), and do the standard R install. Only the first two parts of the R version number matter for Statistics. I avoid installing R under Program Files, since then Admin rights are required in some cases, but that's just a convenience.

5Flavio_F commented PermalinkDear Jon,

I tried to move away from the basic "BEGIN PROGRAM R - END PROGRAM" implementation and wrapped the functions in an extension command, in the hope that this might increase speed, but the running time remained exactly the same.

Or do I have to live with it?

6JonPeck commented PermalinkThe R code running within Statistics should run at the same speed as running in R standalone once the R engine is started. And once started, that process will continue to be available through the Statistics session without restarting. However, data transfer between the Statistics backend process and the R process or vice versa is slow. Can you identify where the time is being spent in the R code?

7Flavio_F commented PermalinkDear Jon. Thank you so much for the extremely quick reply! You are absolutely right, all the time is indeed spent for the data transfer. I checked it by just transfering my dataset from SPSS to R and back without doing any computations inbetween.The problem lies probably in the size of the dataset, which is quite large (ca. 23000 rows x 300 columns). I guess the only solution is then to first subset the dataset in SPSS and send to R only the specific subset needed for a given analysis? Should find a way to automate that though, I'm unfortunately not yet very familiar with SPSS syntax.

8JonPeck commented PermalinkThe biggest slowdown is in writing back to Statistics, especially with a lot of variables, so if you can only request the variables you actually need, things should go faster.

9Flavio_F commented Permalink... oh well, of course I can transfer the whole dataset once at the beginning of the session and that's it, instead of having the data transfer step embedded in the function at each request. Well, thank you so much!!

10Flavio_F commented PermalinkSorry, to avoid misunderstandings, I wrote the last comment before having seen yours. I guess it should also be possible to transfer the dataset once at the beginning of the session, and then run multiple requests on it from SPSS?