Reading Case Data from IBM SPSS Statistics (R)
The spssdata.GetDataFromSPSS
function reads case data from the IBM® SPSS® Statistics active dataset and, by default, stores
it to an R data frame. You can choose to retrieve the cases for all
variables or a selected subset of the variables in the active dataset.
Variables are specified by name or by an index value representing
position in the active dataset, starting with 0 for the first variable
in file order.
Example: Retrieving Cases for All Variables
DATA LIST FREE /age (F4) income (F8.2) car (F8.2) employ (F4).
BEGIN DATA.
55 72 36.20 23
56 153 76.90 35
28 28 13.70 4
END DATA.
BEGIN PROGRAM R.
casedata <- spssdata.GetDataFromSPSS()
print(casedata)
END PROGRAM.
Result
age income car employ
1 55 72 36.2 23
2 56 153 76.9 35
3 28 28 13.7 4
Each column of the returned data frame contains the case data for a single variable from the active dataset. The column name is the variable name and can be used to extract the data for that variable, as in:
income <- casedata$income
Each row of the returned data frame contains
the data for a single case. By default, the rows are labeled with
consecutive integers. When calling GetDataFromSPSS
, you can include the optional argument row.label to specify a variable from the active dataset whose case values
will be the row labels of the resulting data frame.
Example: Retrieving Cases for Selected Variables
DATA LIST FREE /age (F4) income (F8.2) car (F8.2) employ (F4).
BEGIN DATA.
55 72 36.20 23
56 153 76.90 35
28 28 13.70 4
END DATA.
BEGIN PROGRAM R.
casedata <- spssdata.GetDataFromSPSS(variables=c("age","income","employ"))
END PROGRAM.
The argument variables is an R vector specifying a subset of variables for which case data
will be retrieved. In this example, the R function c()
is used to create a character vector
of variable names. The resulting R data frame (casedata) will contain the three columns labeled age, income, and employ.
You can use the TO
keyword to specify a range of variables as you can in IBM SPSS Statistics--for example, variables=c("age TO car")
. If you prefer to work with
variable index values (index values represent position in the dataset,
starting with 0 for the first variable in file order), you can specify
a range of variables with an expression such as variables=c(0:2)
. The R code c(0:2)
creates a vector consisting of the integers between 0
and 2
inclusive.
Example: Retrieving Categorical Variables
The analogue of a categorical variable in IBM SPSS Statistics is a factor in R. You can specify
that categorical variables are converted to factors, although by default
they are not. To convert categorical variables to R factors, use the factorMode argument of the GetDataFromSPSS
function.
DATA LIST FREE /id (F4) gender (A1) training (F1).
VARIABLE LABELS id 'Employee ID'
/training 'Training Level'.
VARIABLE LEVEL id (SCALE)
/gender (NOMINAL)
/training (ORDINAL).
VALUE LABELS training 1 'Beginning' 2 'Intermediate' 3 'Advanced'
/gender 'f' 'Female' 'm' 'Male'.
BEGIN DATA
18 m 1
37 f 2
10 f 3
22 m 2
END DATA.
BEGIN PROGRAM R.
casedata <- spssdata.GetDataFromSPSS(factorMode="labels")
casedata
END PROGRAM.
- The value
"labels"
for factorMode, used in this example, specifies that categorical variables are converted to factors whose levels are the value labels of the variables. The alternate value"levels"
specifies that categorical variables are converted to factors whose levels are the values of the variables. See the topic spssdata.GetDataFromSPSS Function (R) for more information.
Result
id gender training
1 18 Male Beginning
2 37 Female Intermediate
3 10 Female Advanced
4 22 Male Intermediate
Note:
If you intend to write factors retrieved with factorMode="labels"
to a new IBM SPSS Statistics dataset, special handling is required. See the topic Writing Results to a New IBM SPSS Statistics Dataset (R) for more information.
Example: Handling IBM SPSS Statistics Datetime Values
When retrieving values of IBM SPSS Statistics variables with date or datetime formats, you'll most likely want
to convert the values to R date/time (POSIXt) objects. By default,
such variables are not converted and are simply returned in the internal
representation used by IBM SPSS Statistics (floating
point numbers representing some number of seconds and fractional seconds
from an initial date and time). To convert variables with date or
datetime formats to R date/time objects, you use the rDate argument of the GetDataFromSPSS
function.
DATA LIST FREE /bdate (ADATE10).
BEGIN DATA
05/02/2009
END DATA.
BEGIN PROGRAM R.
data<-spssdata.GetDataFromSPSS(rDate="POSIXct")
data
END PROGRAM.
Result
bdate
1 2009-05-02
Example: Missing Data
By default, missing values for numeric variables (user-missing and system-missing) are converted to the R NaN value and user-missing values of string variables are converted to the R NA value.
DATA LIST LIST (',') /numVar (f) stringVar (a4).
BEGIN DATA
1,a
,b
3,,
9,d
END DATA.
MISSING VALUES numVar (9) stringVar (' ').
BEGIN PROGRAM R.
data <- spssdata.GetDataFromSPSS()
cat("Case data with missing values:\n")
print(data)
END PROGRAM.
Result
Case data with missing values:
numVar stringVar
1 1 a
2 NaN b
3 3 <NA>
4 NaN d
Note: You can specify that missing values of numeric variables be converted to the R NA value, with the missingValueToNA argument, as in:
data<-spssdata.GetDataFromSPSS(missingValueToNA=TRUE)
You can specify that user-missing values be treated as valid data by setting the optional argument keepUserMissing to TRUE, as shown in the following reworking of the previous example.
DATA LIST LIST (',') /numVar (f) stringVar (a4).
BEGIN DATA
1,a
,b
3,,
9,d
END DATA.
MISSING VALUES numVar (9) stringVar (' ').
BEGIN PROGRAM R.
data <- spssdata.GetDataFromSPSS(keepUserMissing=TRUE)
cat("Case data with user-missing values treated as valid:\n")
print(data)
END PROGRAM.
Result
Case data with user-missing values treated as valid:
numVar stringVar
1 1 a
2 NaN b
3 3
4 9 d
Example: Handling Data with Splits
When reading from IBM SPSS Statistics datasets
with split groups, use the GetSplitDataFromSPSS
function to retrieve each split separately, as shown in this example.
DATA LIST FREE /salary (F6) jobcat (F2).
BEGIN DATA
21450 1
45000 1
30000 2
30750 2
103750 3
72500 3
57000 3
END DATA.
SORT CASES BY jobcat.
SPLIT FILE BY jobcat.
BEGIN PROGRAM R.
varnames <- spssdata.GetSplitVariableNames()
if(length(varnames) > 0)
{
while (!spssdata.IsLastSplit()){
data <- spssdata.GetSplitDataFromSPSS()
cat("\n\nSplit variable values:")
for (name in varnames) cat("\n",name,":",
as.character(data[1,name]))
cat("\nCases in Split: ",length(data[,1]))
}
spssdata.CloseDataConnection()
}
END PROGRAM.
Result
Split variable values:
jobcat : 1
Cases in Split: 2
Split variable values:
jobcat : 2
Cases in Split: 2
Split variable values:
jobcat : 3
Cases in Split: 3
- The
GetSplitVariableNames
function returns the names of the split variables, if any, from the active dataset. - The
GetSplitDataFromSPSS
function retrieves the case data for the next split group from the active dataset, and returns it as an R data frame. - The
IsLastSplit
function returns TRUE if the current split group is the last one in the active dataset. - The
CloseDataConnection
function should be called when the necessary split groups have been read. In particular,GetSplitDataFromSPSS
implicitly starts a data connection for reading from split files and this data connection must be closed withCloseDataConnection
.