spss.Dataset Class (Python)
spss.Dataset(name,hidden,cvtDates). Provides the ability
to create new datasets, read from existing datasets, and modify existing
datasets. A Dataset object provides access to
the case data and variable information contained in a dataset, and
allows you to read from the dataset, add new cases, modify existing
cases, add new variables, and modify properties of existing variables.
An instance of the Dataset class can only be created
within a data step or StartProcedure-EndProcedure block,
and cannot be used outside of the data step or procedure block in
which it was created. Data steps are initiated with the spss.StartDataStep function.
You can also use the spss.DataStep class to implicitly
start and end a data step without the need to check for pending transformations.
See the topic spss.DataStep Class (Python) for
more information.
- The argument name is optional and specifies the name of
an open dataset for which a
Datasetobject will be created. Note that this is the name as assigned by IBM® SPSS® Statistics or as specified withDATASET NAME. Specifyingname="*"or omitting the argument will create aDatasetobject for the active dataset. If the active dataset is unnamed, then a name will be automatically generated for it in the case that theDatasetobject is created for the active dataset. - If the Python data type None or the empty string
''is specified for name, then a new empty dataset is created. The name of the dataset is automatically generated and can be retrieved from thenameproperty of the resultingDatasetobject. The name cannot be changed from within the data step. To change the name, use theDATASET NAMEcommand followingspss.EndDataStep.A new dataset created with the
Datasetclass is not set to be the active dataset. To make the dataset the active one, use thespss.SetActivefunction. - The optional argument hidden specifies whether the Data
Editor window associated with the dataset is hidden--by default, it
is displayed. Use
hidden=Trueto hide the associated Data Editor window. - The optional argument cvtDates specifies whether IBM SPSS Statistics variables
with date or datetime formats are converted to Python
datetime.datetimeobjects when reading data from IBM SPSS Statistics. The argument is a boolean--True to convert all variables with date or datetime formats, False otherwise. If cvtDates is omitted, then no conversions are performed.Note: Values of variables with date or datetime formats that are not converted with cvtDates are returned as integers representing the number of seconds from October 14, 1582.
- Instances of the
Datasetclass created withinStartProcedure-EndProcedureblocks cannot be set as the active dataset. - The
Datasetclass does not honor case filters specified with theFILTERorUSEcommands. If you need case filters to be honored, then consider using theCursorclass. - For release 22 Fix Pack 1 and higher, the
Datasetclass supports caching. Caching typically improves performance when cases are modified in a random manner, and is specified with thecacheproperty of aDatasetobject.
The number of variables in the dataset associated with a Dataset instance
is available using the len function, as in:
len(datasetObj)
Note: Datasets that are not required outside of the data
step or procedure in which they were accessed or created should be
closed prior to ending the data step or procedure in order to free
the resources allocated to the dataset. This is accomplished by calling
the close method of the Dataset object.
Example: Creating a New Dataset
BEGIN PROGRAM.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset(name=None)
datasetObj.varlist.append('numvar',0)
datasetObj.varlist.append('strvar',1)
datasetObj.varlist['numvar'].label = 'Sample numeric variable'
datasetObj.varlist['strvar'].label = 'Sample string variable'
datasetObj.cases.append([1,'a'])
datasetObj.cases.append([2,'b'])
spss.EndDataStep()
END PROGRAM.
- You add variables to a dataset using the
append(orinsert) method of theVariableListobject associated with the dataset. TheVariableListobject is accessed from thevarlistproperty of theDatasetobject, as indatasetObj.varlist. See the topic VariableList Class (Python) for more information. - Variable properties, such as the variable label and measurement
level, are set through properties of the associated
Variableobject, accessible from theVariableListobject. For example,datasetObj.varlist['numvar']accesses theVariableobject associated with the variable numvar. See the topic Variable Class (Python) for more information. - You add cases to a dataset using the
append(orinsert) method of theCaseListobject associated with the dataset. TheCaseListobject is accessed from thecasesproperty of theDatasetobject, as indatasetObj.cases. See the topic CaseList Class (Python) for more information.
Example: Saving New Datasets
When creating new datasets that you intend to save, you'll want to keep track of the dataset names since the save operation is done outside of the associated data step.
DATA LIST FREE /dept (F2) empid (F4) salary (F6).
BEGIN DATA
7 57 57000
5 23 40200
3 62 21450
3 18 21900
5 21 45000
5 29 32100
7 38 36000
3 42 21900
7 11 27900
END DATA.
DATASET NAME saldata.
SORT CASES BY dept.
BEGIN PROGRAM.
import spss
with spss.DataStep():
ds = spss.Dataset()
# Create a new dataset for each value of the variable 'dept'
newds = spss.Dataset(name=None)
newds.varlist.append('dept')
newds.varlist.append('empid')
newds.varlist.append('salary')
dept = ds.cases[0,0][0]
dsNames = {newds.name:dept}
for row in ds.cases:
if (row[0] != dept):
newds = spss.Dataset(name=None)
newds.varlist.append('dept')
newds.varlist.append('empid')
newds.varlist.append('salary')
dept = row[0]
dsNames[newds.name] = dept
newds.cases.append(row)
# Save the new datasets
for name,dept in dsNames.iteritems():
strdept = str(dept)
spss.Submit(r"""
DATASET ACTIVATE %(name)s.
SAVE OUTFILE='/mydata/saldata_%(strdept)s.sav'.
""" %locals())
spss.Submit(r"""
DATASET ACTIVATE saldata.
DATASET CLOSE ALL.
""" %locals())
END PROGRAM.
- The code
newdsObj = spss.Dataset(name=None)creates a new dataset. The name of the dataset is available from the name property, as innewdsObj.name. In this example, the names of the new datasets are stored to the Python dictionary dsNames. - To save new datasets created with the
Datasetclass, use theSAVEcommand after callingspss.EndDataStep. In this example,DATASET ACTIVATEis used to activate each new dataset, using the dataset names stored in dsNames.
Example: Modifying Case Values
DATA LIST FREE /cust (F2) amt (F5).
BEGIN DATA
210 4500
242 6900
370 32500
END DATA.
BEGIN PROGRAM.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset()
for i in range(len(datasetObj.cases)):
# Multiply the value of amt by 1.05 for each case
datasetObj.cases[i,1] = 1.05*datasetObj.cases[i,1][0]
spss.EndDataStep()
END PROGRAM.
- The
CaseListobject, accessed from thecasesproperty of aDatasetobject, allows you to read or modify case data. To access the value for a given variable within a particular case you specify the case number and the index of the variable (index values represent position in the active dataset, starting with 0 for the first variable in file order, and case numbers start from 0). For example,datasetObj.cases[i,1]specifies the value of the variable with index1for case numberi. - When reading case values, results are returned as a list. In the present example we're accessing a single value within each case so the list has one element.
See the topic CaseList Class (Python) for more information.
Example: Comparing Datasets
Dataset objects allow you to concurrently work
with the case data from multiple datasets. As a simple example, we'll
compare the cases in two datasets and indicate identical cases with
a new variable added to one of the datasets.
DATA LIST FREE /id (F2) salary (DOLLAR8) jobcat (F1).
BEGIN DATA
1 57000 3
3 40200 1
2 21450 1
END DATA.
SORT CASES BY id.
DATASET NAME empdata1.
DATA LIST FREE /id (F2) salary (DOLLAR8) jobcat (F1).
BEGIN DATA
3 41000 1
1 59280 3
2 21450 1
END DATA.
SORT CASES BY id.
DATASET NAME empdata2.
BEGIN PROGRAM.
import spss
spss.StartDataStep()
datasetObj1 = spss.Dataset(name="empdata1")
datasetObj2 = spss.Dataset(name="empdata2")
nvars = len(datasetObj1)
datasetObj2.varlist.append('match')
for i in range(len(datasetObj1.cases)):
if datasetObj1.cases[i] == datasetObj2.cases[i,0:nvars]:
datasetObj2.cases[i,nvars] = 1
else:
datasetObj2.cases[i,nvars] = 0
spss.EndDataStep()
END PROGRAM.
- The two datasets are first sorted by the variable id which is common to both datasets.
- Since
DATA LISTcreates unnamed datasets (the same is true forGET), the datasets are named usingDATASET NAMEso that you can refer to them when callingspss.Dataset. -
datasetObj1anddatasetObj2areDatasetobjects associated with the two datasets empdata1 and empdata2 to be compared. - The code
datasetObj1.cases[i]returns case numberifrom empdata1. The codedatasetObj2.cases[i,0:nvars]returns the slice of case numberifrom empdata2 that includes the variables with indexes 0,1,...,nvars-1. - The new variable match, added to empdata2, is set to 1 for cases that are identical and 0 otherwise.