Calling R from SPSS

An introduction to the R plug-in for SPSS

Starting with version 16, IBM® SPSS® provides a free plug-in that enables you to run R syntax from within SPSS. The plug-in connects R to the active database. You can write results that are obtained from R into a new SPSS database for further manipulation in SPSS. This article is for the reader who is familiar with R and SPSS but who has not yet tried to use them in tandem.

Catherine Dalzell (mail@catherinedalzell.ca), Statistician, Dalzell Consulting

Photo of Catherine J. DalzellCatherine Dalzell is a statistician with more than 15 years of experience in data mining and data analytics, mostly in a healthcare setting. She first used the S language in the 1980s. She has followed with enthusiasm the development of the S language through S-Plus and R as the language has brought flexible data analytics and high-level graphics to the desktop. She holds a doctorate from Carnegie-Mellon University and a master's degree in Biomathematics from the University of Oxford. Currently, she teaches at the University of Ottawa and runs her own statistical consulting business.



31 October 2013

Also available in Russian

IBM® SPSS® can talk to R. It's something of a well-kept secret, judging from the low level of activity in the R blogosphere on this point. The low level of interest is not surprising: SPSS users are, more often than not, people who use only SPSS for their data analysis; and R users are accustomed to applying ugly hacks as part of doing business with R. An R user who wants to analyse data in .sav format typically opens the file in SPSS, saves it to comma-separated values (CSV) format, and opens the result in R by using the read.csv() method. A cleaner way is to save to SPSS Statistics Portable (POR) format from SPSS and open the result by using the read.spss() method from the foreign library. This method usually works, in the sense that only a few dozen lines of R code are then required to cope with categories, missing values, time variables, and other features that are either lost or damaged in translation. If you need to return data from R back to SPSS, the return journey is more awkward.

Tedious data manipulation notwithstanding, you can certainly work both applications without a plug-in to connect them. Is the effort of learning the plug-in worth the gain in productivity? Is there a gain in productivity, or are the advantages of a different sort? To these questions, I would answer Yes and Yes. Translating from one data format to another is always tricky and time consuming. When you use R from SPSS, you can apply R functions to SPSS data while you maintain the integrity of the original database. Using R from SPSS allows you to apply R functions to SPSS data while you maintain the integrity of the original database.

A further advantage to using the R integration plug-in. Where R and SPSS are both used on the same data, use of the R integration plug-in fosters reproducible research.

Reproducible research

Reproducible research is mainly an organizational principle. Given the original data file and the syntax file, it is possible to re-create every step of the analysis from these two files. Months later, if you need to return to the problem with additional data or a new analysis, it is possible to rebuild the original project. With SPSS, you can maintain a record of every procedure that is run on the data, be it a transformation of the data, the creation of new variables, or an analysis. If R is to play a role in the analysis, either as an assist in recoding variables or to supply a function not currently available in SPSS, maintaining both SPSS and R syntax in the same syntax file has value. You can run SPSS and R code from the same SPSS syntax file and apply it to the same database. Everything stays together.


Extending the functionality of SPSS

In a previous article, I argued that data analysts should learn R. Briefly, most advances in statistics appear first as R packages before they are added to the drop-down menus. R gives the SPSS user more tools for the job, and although you might implement these tools outside of SPSS by exporting the data, data export is never seamless. With the R plug-in, you retain all the features of an SPSS database, particularly the labels of category data and the long descriptors.


R extensions

SPSS allows you to create more menu items and add them to the existing menu bar. In particular, R functions can be bundled as extensions and supplied to you through the menu. You can implement a function in R with no knowledge of R programming. Writing extensions goes beyond the scope of this article, but they are an important reason to learn to use the R plug-in. Through this plug-in, you can supply R functions to SPSS users who are unfamiliar with R.


Finding and installing the plug-in

Installing the plug-in is fairly straightforward, but the process does contain a few hurdles. For one thing, you must start several pages before the actual download page. You need to register with IBM developerWorks, if you are not already. It's free.

Another hurdle in the installation is that the plug-in works with only one version of R, not necessarily the current one. Which version of R you need depends on the version of SPSS you are running. Unfortunately, the download page does not specify. However, for SPSS version 22, use R-2.15. For SPSS version 21, use R-2.14.0.

Be warned that the R integration plug-in is specific about the R version. For SPSS version 21, for example, you must install R-2.14.0. If you install 2.14.1 or 2.14.2, it will not work. During the installation process, the plug-in looks for a folder that contains the correct version of R. For example, if you use SPSS version 21 on Windows®, it looks for C:\Program Files\R\R-2.14.0. The installer queries you for the location of R if it can't find the folder that it wants. From this query, you can infer the precise version of R you need:

  1. Obtain the appropriate version of R from r-cran, then download and install it.

    If you already use a different version of R and you want to keep it as your default, be sure to clear the Store version number in registry check box. If you want to install R packages to run with SPSS, you need to install them from the version of R that SPSS uses. R packages that are downloaded for the current version are invisible to the R integration package.

  2. To find the plug-in for download, click Help > Working with R from the menu bar in SPSS to reach the opening page.
  3. Midway down the page, click the link for SPSS plug-ins.

    SPSS has many plug-ins, but select the one for R. This link brings you to the login screen for IBM downloads.

  4. On the login page, log in or register (it's free). Proceed to the download page.
  5. Each version of SPSS has its own plug-in. Find the one for your version, download it, and install it.

    At this stage, if you don't have the correct version of R installed, you see a message that the installer can't find it. Install a different version, and try again.

  6. If installation is successful, the installer displays a large documentation file.

    With the installation of the plug-in, this file is available from the SPSS Help menu under Programmability > R plugin. The Working with R menu command now points to more documentation and tutorials.


Using R from SPSS

The R integration plug-in does two things: It opens communication between SPSS and R, and it provides R with a package of functions with which to translate SPSS data structures into R objects.

Hello R!

Open a syntax file, and type the following lines. Select and run the command by clicking the green arrow:

BEGIN PROGRAM R.
cat("\t\tHello R!\n")
END PROGRAM.

The line BEGIN PROGRAM R. launches R and loads the requisite library of data management functions. It also sets several option variables for R that override any options that you might set in your .First() function.

The first and last lines here follow the conventions of SPSS syntax code and end with a period (.). All code between those two lines is interpreted as R code and must obey the rules of R syntax, so no period marks the end of a line.

When SPSS meets the END PROGRAM. statement, it interprets subsequent commands as SPSS syntax, but it does not quit the R session. Any variables that an R chunk creates are available to subsequent R chunks during the SPSS session.


Reading data into R and returning changes to SPSS

R chunks that are called from SPSS can read and write data from external sources in the usual way. But if you run R from SPSS, it's because you want access to an SPSS database. I created a simple test database to illustrate different data types, available with the downloads. Consider the lines in Listing 1.

Listing 1. Read and write a database
BEGIN PROGRAM R. 
# Pull the data into a data frame
testData = spssdata.GetDataFromSPSS() 

# Pull the data dictionary into another data frame
testDict = spssdictionary.GetDictionaryFromSPSS()

# Take a look 
print(testData) 
print(testDict)

# Check what data types the variables of the R data frame have

lapply(testData, class)

# Set up a new SPSS database with the same dictionary 
spssdictionary.SetDictionaryToSPSS("Test2",testDict) 

# Copy the data to the new SPSS database
spssdata.SetDataToSPSS("Test2", testData) 

# Tell SPSS you're done creating data
spssdictionary.EndDataStep() 

END PROGRAM.

When you run this code, the output in Listing 2 should appear in an SPSS output file.

Listing 2. Output reading and writing a database
             CustName Age Rating        Date Weight 
1 Mary                  21      1 13594608000   55.2 
2 John                  45      3 13594694400   73.4 
3 Henry                 33      2 13563244800   80.0 
                               X1    X2              X3 X4                   X5 
varName                  CustName   Age          Rating Date                 Weight 
varLabel            Customer Name   Age Customer rating Date of first trans  Weight 
varType                        20     0               0 0                    0 
varFormat                     A20    F8              F6 ADATE10              F5.1 
varMeasurementLevel       nominal scale         ordinal scale                scale 

$CustName 
[1] "factor" 
 
$Age 
[1] "numeric" 
 
$Rating 
[1] "numeric" 
 
$Date 
[1] "numeric" 
 
$Weight 
[1] "numeric"

What just happened?

The great strength of SPSS as a data vault lies in the detailed data dictionary that you can create. You can store some of this information—variable types and names— as class and variable names in an R data frame but not without some loss of detail. The R integration plug-in lets you create two data frames from the active SPSS data set: one for the data and one for the data dictionary.

Data conversion from SPSS to R

Look at each variable in turn from the test database and see what happens when it is read into R:

  • CustName. This variable is a string variable of length 20 in SPSS, nominal type. It becomes a factor in R.
  • Age. This variable is numeric in SPSS, scale type, of length 6 with no decimals. It becomes numeric in R.
  • Rating. This variable is numeric of type ordinal. The numeric codes were given descriptive labels in SPSS that are lost in translation. (For more about categorical data, see Working with categories.)
  • Date. This variable is a date, formatted dd-mmm-yyyy. It becomes numeric in R. (For more about dates, see Working with dates.)
  • Weight. A numeric variable that is formatted in SPSS to have one decimal. It becomes numeric in R.

The data dictionary

The data dictionary can be imported to a data frame in R, as shown in Listing 1. You don't need this dictionary to work on the data in R, but you do need to build a data dictionary to create an SPSS database. The data dictionary is a data frame of character vectors. It has one column for each variable of the SPSS database and one row for each entry in the dictionary. As you can see from the example in Listing 2, a range of format types is available. The complete list is given in the documentation for the R plug-in.


Working with dates

10 October 1582: A date to remember

Nothing actually happened on this date in history. Realizing that it was later than he thought, Pope Gregory XIII decreed that 5 October 1582 would be followed, the next day, by 15 October. This change, with some adjustments to leap years, was the Gregorian reform of the calendar. SPSS stores dates as the number of elapsed seconds from midnight, 10 October 1582, the notional start of the Gregorian calendar. R Stores dates as elapsed seconds from 1 January 1970.

R integration function spssdictionary.GetDictionaryFromSPSS(), with no arguments, transforms dates into numbers. The number that you get is the elapsed time in seconds from midnight, 10 October 1582.

To convert the date variable for use in R, I might add testData$Date = as.POSIXlt(testData$Date, origin="1582-10-10"). Alternatively, I can take advantage of a useful argument of the GetDataFromSPSS() function (see Listing 3).

Listing 3. Reading dates from SPSS into R
BEGIN PROGRAM R. 
# Pull the data into a data frame adjusting for dates
testData = spssdata.GetDataFromSPSS(rDate="POSIXct") 
testDict = spssdictionary.GetDictionaryFromSPSS()
print(testData) 
END PROGRAM. 


              CustName Age Rating       Date Weight 
1 Mary                  21      1 2013-07-31   55.2 
2 John                  45      3 2013-08-01   73.4

Writing time data to SPSS

The example in Listing 4 shows how to write date-time data back to SPSS from R. File IBM.csv contains a record of NYSE stock market data for IBM stock, obtained from the well-known finance site on Yahoo.com. Here you see the first few lines of data, reading back from 8 August 2013.

Listing 4. Writing dates from R to SPSS
	Date	Open	High	Low		Close	Volume	Adj Close
28/08/2013	182.68	183.47	181.1	182.16	3979200	182.16
27/08/2013	183.63	184.5	182.57	182.74	3179300	182.74
26/08/2013	185.27	187		184.68	184.74	2170400	184.74
23/08/2013	185.34	185.74	184.57	185.42	2292700	185.42
22/08/2013	185.65	186.25	184.25	185.19	2354300	185.19
21/08/2013	184.67	186.57	184.28	184.86	3551000	184.86

I can read the data into SPSS, but the date format is not a format that the SPSS date-time wizard supports. R to the rescue! Using R syntax from SPSS, I can open the file from R, convert the date to an appropriate format, and create an SPSS database with the results. Here are the steps:

  1. The default working directory for the R integration plug-in is somewhere deep in the SPSS program directory tree. That's not what you want. Set the working directory to the location of your data file so that R can find it.
  2. These lines of code read in the dates, in character format, and convert them to Portable Operating System Interface for UNIX® (POSIX) format, with the correct starting date of 10 October 1582.
  3. The spssdictionary.CreateSPSSDictionary() function automates some features of building up the data dictionary. Format DATE11 invokes date format 28-Aug-2013.
  4. Create the database and populate it.

Listing 5 shows how to carry out these steps.

Listing 5. Reading data directly into R and creating an SPSS database from them
BEGIN PROGRAM R.
# Set the working directory
setwd("C:\\Users\\Catherine\\SPSSWork") # (1)
IBM = read.csv("IBM.csv", header=TRUE, stringsAsFactors=FALSE)
PosixDate = as.POSIXct(strptime(IBM$Date, format="%d/%m/%Y") , 
	format="d-b-Y",origin="1582-10-10") # (2)
IBM.spss = data.frame(Date=PosixDate, IBM[,-1])
head(IBM.spss)

# Create the data dictionary (3)
IBM.dict = 
  spssdictionary.CreateSPSSDictionary(c("Date","Trading date", "0", "DATE11","scale"), 
 c("Open","Opening price","0","F8.2","scale"),
 c("High","High price","0","F8.2","scale"),
 c("Low","Low price","0","F8.2","scale"),
 c("Close","Closing price","0","F8.2","scale"),
 c("Volume","Trading volume","0","F8.2","scale"),
 c("AdjClose","Adjusted closing","0","F8.2","scale")
)

# Create the new database (4)

spssdictionary.SetDictionaryToSPSS("IBM",IBM.dict)
spssdata.SetDataToSPSS("IBM",IBM.spss)
spssdictionary.EndDataStep()

END PROGRAM.

Working with categories

My simple example did not handle the categorical variable Rating at all well. R got the numeric codes for that variable but not the descriptive labels for the different levels the variable might take: Poor, Average, and Excellent.

You can do something about that issue. The factorMode argument that is shown in Listing 6 imports category levels instead of numeric values.

Listing 6. The factorMode argument
BEGIN PROGRAM R. 
testData = spssdata.GetDataFromSPSS(rDate="POSIXct", factorMode="labels") 
testDict = spssdictionary.GetDictionaryFromSPSS() 
print(testData) 
END PROGRAM. 
              CustName Age    Rating       Date Weight 
1 Mary                  21      Poor 2013-07-31   55.2 
2 John                  45 Excellent 2013-08-01   73.4 
3 Henry                 33   Average 2012-08-02   80.0

Building a dictionary for categorical variables

The factorMode argument gives me a choice, depending on whether I want numeric codes or values for a categorical variable. But I need more if I want to create an SPSS database with categorical data. The solution lies in adding further structure to the data dictionary. The example in Listing 7 illustrates how to build an SPSS database from an R data frame with factors.

The famous iris data set is bundled with base R. It is a data frame with four numeric variables and one factor, denoting one of three species of iris. To build a database in SPSS, I complete the following steps:

  1. Create a data dictionary for the iris data.

    This dictionary is a data frame of five columns (one for each variable of the iris set).

  2. Create a category dictionary for the factor.

    The R structure here is complex. It is a list of length 2. The first component contains the names of the factors. The second component is a list of lists. Each item is a list of length 2: one component for the numeric codes and one component for their labels.

  3. Begin creation of an SPSS database by "setting" the data and category dictionaries.
  4. Populate the database.
  5. End the data step.
  6. Run the code.

    Doing so creates a database in SPSS but does not save it to disk. The active database remains whatever it was.

Listing 7. Building a dictionary for categorical data
BEGIN PROGRAM R.
data(iris)
head(iris)
iris.dict = vector(mode="list", length=5)
# Name the columns
names(iris.dict) = paste("X", 1:5, sep="")
# Fill in the numeric variables

for(i in 1:4){
iris.dict[[i]] = c(names(iris)[i],"","0","F3.2","scale")
}

#
# Fill information for the category
iris.dict[[5]] = c("Species","Species of Iris","0","F3","nominal")
# Square it off and add row names
iris.dict = data.frame(iris.dict)
row.names(iris.dict) = c("varName","varLabel","varType","varFormat",
	"varMeasurementLevel")
#
# Now build the category dictionary
iris.cat = vector(mode="list",length=2)
names(iris.cat) = c("name","dictionary")
iris.cat$name = "Species"

# Note that the dictionary is a list of lists
# With only one category, the first list has length 1
# The dictionary list contains two lists
#

iris.cat$dictionary = vector(mode="list", length=1)
iris.cat$dictionary[[1]] = list(levels=c(1,2,3), 
	labels=levels(iris$Species))

#
# Now build the SPSS database.
spssdictionary.SetDictionaryToSPSS("Iris", iris.dict, iris.cat)
spssdata.SetDataToSPSS("Iris", iris, iris.cat)
spssdictionary.EndDataStep()

END PROGRAM.

Conclusion

The R integration package contains many functions to provide a seamless transfer from SPSS to R. For instance, SPSS allows greater flexibility in defining missing values than R. The R integration package contains functions for managing missing values so that nothing is lost in passing from SPSS to R and back again. Another important feature is the ability to create SPSS extensions that use R. Menu items can be added to the Analysis menu that enable R functions to be run on the active data set without needing to write explicit code in a syntax file. In this way, you can make R functionality available to users who have no knowledge of R. The R integration package has a lot to offer data analysts who use both SPSS and R.


Download

DescriptionNameSize
Sample R code for this articleR-code-examples.zip33KB

Resources

Learn

  • See The Comprehensive R Archive Network, the main site for the R project and each R package. The help pages and manuals that are associated with optimx, nlmrt, and Rcgmin are detailed. Numerous references are provided.
  • Read Do I need to learn R? (Catherine Dalzell, developerWorks, September 2013) to learn why R is a valuable tool for data analytics that was expressly designed to reflect the way that statisticians think and work.
  • Find the resources that you need to improve outcomes and control risk in the developerWorks Business analytics content area.
  • Learn more about big data in the developerWorks big data content area. Find technical documentation, how-to articles, education, downloads, product information, and more.
  • Follow developerWorks on Twitter.
  • Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners to advanced functionality for experienced developers.

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=950334
ArticleTitle=Calling R from SPSS
publish-date=10312013