Reading Case Data from IBM SPSS Statistics (R)

The spssdata.GetDataFromSPSS function reads case data from the IBM® SPSS® Statistics active dataset and, by default, stores it to an R data frame. You can choose to retrieve the cases for all variables or a selected subset of the variables in the active dataset. Variables are specified by name or by an index value representing position in the active dataset, starting with 0 for the first variable in file order.

Example: Retrieving Cases for All Variables

DATA LIST FREE /age (F4) income (F8.2) car (F8.2) employ (F4).
BEGIN DATA.
55  72  36.20  23
56  153 76.90  35
28  28  13.70  4
END DATA.
BEGIN PROGRAM R.
casedata <- spssdata.GetDataFromSPSS()
print(casedata)
END PROGRAM.

Result

  age income  car employ 
1  55     72 36.2     23 
2  56    153 76.9     35 
3  28     28 13.7      4

Each column of the returned data frame contains the case data for a single variable from the active dataset. The column name is the variable name and can be used to extract the data for that variable, as in:

income <- casedata$income

Each row of the returned data frame contains the data for a single case. By default, the rows are labeled with consecutive integers. When calling GetDataFromSPSS, you can include the optional argument row.label to specify a variable from the active dataset whose case values will be the row labels of the resulting data frame.

Example: Retrieving Cases for Selected Variables

DATA LIST FREE /age (F4) income (F8.2) car (F8.2) employ (F4).
BEGIN DATA.
55  72  36.20  23
56  153 76.90  35
28  28  13.70  4
END DATA.
BEGIN PROGRAM R.
casedata <- spssdata.GetDataFromSPSS(variables=c("age","income","employ"))
END PROGRAM.

The argument variables is an R vector specifying a subset of variables for which case data will be retrieved. In this example, the R function c() is used to create a character vector of variable names. The resulting R data frame (casedata) will contain the three columns labeled age, income, and employ.

You can use the TO keyword to specify a range of variables as you can in IBM SPSS Statistics--for example, variables=c("age TO car"). If you prefer to work with variable index values (index values represent position in the dataset, starting with 0 for the first variable in file order), you can specify a range of variables with an expression such as variables=c(0:2). The R code c(0:2) creates a vector consisting of the integers between 0 and 2 inclusive.

Example: Retrieving Categorical Variables

The analogue of a categorical variable in IBM SPSS Statistics is a factor in R. You can specify that categorical variables are converted to factors, although by default they are not. To convert categorical variables to R factors, use the factorMode argument of the GetDataFromSPSS function.

DATA LIST FREE /id (F4) gender (A1) training (F1).
VARIABLE LABELS id 'Employee ID'
   /training 'Training Level'.
VARIABLE LEVEL id (SCALE)
   /gender (NOMINAL)
   /training (ORDINAL).
VALUE LABELS training 1 'Beginning' 2 'Intermediate' 3 'Advanced'
   /gender 'f' 'Female' 'm' 'Male'.
BEGIN DATA
18 m 1
37 f 2
10 f 3
22 m 2
END DATA.
BEGIN PROGRAM R.
casedata <- spssdata.GetDataFromSPSS(factorMode="labels")
casedata
END PROGRAM.
  • The value "labels" for factorMode, used in this example, specifies that categorical variables are converted to factors whose levels are the value labels of the variables. The alternate value "levels" specifies that categorical variables are converted to factors whose levels are the values of the variables. See the topic spssdata.GetDataFromSPSS Function (R) for more information.

Result

  id gender     training 
1 18   Male    Beginning 
2 37 Female Intermediate 
3 10 Female     Advanced 
4 22   Male Intermediate

Note: If you intend to write factors retrieved with factorMode="labels" to a new IBM SPSS Statistics dataset, special handling is required. See the topic Writing Results to a New IBM SPSS Statistics Dataset (R) for more information.

Example: Handling IBM SPSS Statistics Datetime Values

When retrieving values of IBM SPSS Statistics variables with date or datetime formats, you'll most likely want to convert the values to R date/time (POSIXt) objects. By default, such variables are not converted and are simply returned in the internal representation used by IBM SPSS Statistics (floating point numbers representing some number of seconds and fractional seconds from an initial date and time). To convert variables with date or datetime formats to R date/time objects, you use the rDate argument of the GetDataFromSPSS function.

DATA LIST FREE /bdate (ADATE10).
BEGIN DATA
05/02/2009
END DATA.
BEGIN PROGRAM R.
data<-spssdata.GetDataFromSPSS(rDate="POSIXct")
data
END PROGRAM.

Result

       bdate 
1 2009-05-02

Example: Missing Data

By default, missing values for numeric variables (user-missing and system-missing) are converted to the R NaN value and user-missing values of string variables are converted to the R NA value.

DATA LIST LIST (',') /numVar (f) stringVar (a4).
BEGIN DATA
1,a
,b
3,,
9,d
END DATA.
MISSING VALUES numVar (9) stringVar (' ').
BEGIN PROGRAM R.
data <- spssdata.GetDataFromSPSS()
cat("Case data with missing values:\n")
print(data)
END PROGRAM.

Result

Case data with missing values:

  numVar stringVar 
1      1         a 
2    NaN         b 
3      3       <NA>
4    NaN         d 

Note: You can specify that missing values of numeric variables be converted to the R NA value, with the missingValueToNA argument, as in:

data<-spssdata.GetDataFromSPSS(missingValueToNA=TRUE)

You can specify that user-missing values be treated as valid data by setting the optional argument keepUserMissing to TRUE, as shown in the following reworking of the previous example.

DATA LIST LIST (',') /numVar (f) stringVar (a4).
BEGIN DATA
1,a
,b
3,,
9,d
END DATA.
MISSING VALUES numVar (9) stringVar (' ').
BEGIN PROGRAM R.
data <- spssdata.GetDataFromSPSS(keepUserMissing=TRUE)
cat("Case data with user-missing values treated as valid:\n")
print(data)
END PROGRAM.

Result

Case data with user-missing values treated as valid:

  numVar stringVar 
1      1         a 
2    NaN         b 
3      3           
4      9         d 

Example: Handling Data with Splits

When reading from IBM SPSS Statistics datasets with split groups, use the GetSplitDataFromSPSS function to retrieve each split separately, as shown in this example.

DATA LIST FREE /salary (F6) jobcat (F2).
BEGIN DATA
21450 1
45000 1
30000 2
30750 2
103750 3
72500 3
57000 3
END DATA.

SORT CASES BY jobcat.
SPLIT FILE BY jobcat.
BEGIN PROGRAM R.
varnames <- spssdata.GetSplitVariableNames()
if(length(varnames) > 0)
{
   while (!spssdata.IsLastSplit()){
      data <- spssdata.GetSplitDataFromSPSS()
      cat("\n\nSplit variable values:")
      for (name in varnames) cat("\n",name,":",
                                 as.character(data[1,name]))
      cat("\nCases in Split: ",length(data[,1]))
   }
   spssdata.CloseDataConnection()
}
END PROGRAM.

Result

Split variable values: 
 jobcat : 1 
Cases in Split:  2 
 
Split variable values: 
 jobcat : 2 
Cases in Split:  2 
 
Split variable values: 
 jobcat : 3 
Cases in Split:  3
  • The GetSplitVariableNames function returns the names of the split variables, if any, from the active dataset.
  • The GetSplitDataFromSPSS function retrieves the case data for the next split group from the active dataset, and returns it as an R data frame.
  • The IsLastSplit function returns TRUE if the current split group is the last one in the active dataset.
  • The CloseDataConnection function should be called when the necessary split groups have been read. In particular, GetSplitDataFromSPSS implicitly starts a data connection for reading from split files and this data connection must be closed with CloseDataConnection.