spss.Dataset Class (Python)
spss.Dataset(name,hidden,cvtDates). Provides the ability
to create new datasets, read from existing datasets, and modify existing
datasets. A Dataset
object provides access to
the case data and variable information contained in a dataset, and
allows you to read from the dataset, add new cases, modify existing
cases, add new variables, and modify properties of existing variables.
An instance of the Dataset
class can only be created
within a data step or StartProcedure-EndProcedure
block,
and cannot be used outside of the data step or procedure block in
which it was created. Data steps are initiated with the spss.StartDataStep
function.
You can also use the spss.DataStep
class to implicitly
start and end a data step without the need to check for pending transformations.
See the topic spss.DataStep Class (Python) for
more information.
- The argument name is optional and specifies the name of
an open dataset for which a
Dataset
object will be created. Note that this is the name as assigned by IBM® SPSS® Statistics or as specified withDATASET NAME
. Specifyingname="*"
or omitting the argument will create aDataset
object for the active dataset. If the active dataset is unnamed, then a name will be automatically generated for it in the case that theDataset
object is created for the active dataset. - If the Python data type None or the empty string
''
is specified for name, then a new empty dataset is created. The name of the dataset is automatically generated and can be retrieved from thename
property of the resultingDataset
object. The name cannot be changed from within the data step. To change the name, use theDATASET NAME
command followingspss.EndDataStep
.A new dataset created with the
Dataset
class is not set to be the active dataset. To make the dataset the active one, use thespss.SetActive
function. - The optional argument hidden specifies whether the Data
Editor window associated with the dataset is hidden--by default, it
is displayed. Use
hidden=True
to hide the associated Data Editor window. - The optional argument cvtDates specifies whether IBM SPSS Statistics variables
with date or datetime formats are converted to Python
datetime.datetime
objects when reading data from IBM SPSS Statistics. The argument is a boolean--True to convert all variables with date or datetime formats, False otherwise. If cvtDates is omitted, then no conversions are performed.Note: Values of variables with date or datetime formats that are not converted with cvtDates are returned as integers representing the number of seconds from October 14, 1582.
- Instances of the
Dataset
class created withinStartProcedure-EndProcedure
blocks cannot be set as the active dataset. - The
Dataset
class does not honor case filters specified with theFILTER
orUSE
commands. If you need case filters to be honored, then consider using theCursor
class. - For release 22 Fix Pack 1 and higher, the
Dataset
class supports caching. Caching typically improves performance when cases are modified in a random manner, and is specified with thecache
property of aDataset
object.
The number of variables in the dataset associated with a Dataset
instance
is available using the len
function, as in:
len(datasetObj)
Note: Datasets that are not required outside of the data
step or procedure in which they were accessed or created should be
closed prior to ending the data step or procedure in order to free
the resources allocated to the dataset. This is accomplished by calling
the close
method of the Dataset
object.
Example: Creating a New Dataset
BEGIN PROGRAM.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset(name=None)
datasetObj.varlist.append('numvar',0)
datasetObj.varlist.append('strvar',1)
datasetObj.varlist['numvar'].label = 'Sample numeric variable'
datasetObj.varlist['strvar'].label = 'Sample string variable'
datasetObj.cases.append([1,'a'])
datasetObj.cases.append([2,'b'])
spss.EndDataStep()
END PROGRAM.
- You add variables to a dataset using the
append
(orinsert
) method of theVariableList
object associated with the dataset. TheVariableList
object is accessed from thevarlist
property of theDataset
object, as indatasetObj.varlist
. See the topic VariableList Class (Python) for more information. - Variable properties, such as the variable label and measurement
level, are set through properties of the associated
Variable
object, accessible from theVariableList
object. For example,datasetObj.varlist['numvar']
accesses theVariable
object associated with the variable numvar. See the topic Variable Class (Python) for more information. - You add cases to a dataset using the
append
(orinsert
) method of theCaseList
object associated with the dataset. TheCaseList
object is accessed from thecases
property of theDataset
object, as indatasetObj.cases
. See the topic CaseList Class (Python) for more information.
Example: Saving New Datasets
When creating new datasets that you intend to save, you'll want to keep track of the dataset names since the save operation is done outside of the associated data step.
DATA LIST FREE /dept (F2) empid (F4) salary (F6).
BEGIN DATA
7 57 57000
5 23 40200
3 62 21450
3 18 21900
5 21 45000
5 29 32100
7 38 36000
3 42 21900
7 11 27900
END DATA.
DATASET NAME saldata.
SORT CASES BY dept.
BEGIN PROGRAM.
import spss
with spss.DataStep():
ds = spss.Dataset()
# Create a new dataset for each value of the variable 'dept'
newds = spss.Dataset(name=None)
newds.varlist.append('dept')
newds.varlist.append('empid')
newds.varlist.append('salary')
dept = ds.cases[0,0][0]
dsNames = {newds.name:dept}
for row in ds.cases:
if (row[0] != dept):
newds = spss.Dataset(name=None)
newds.varlist.append('dept')
newds.varlist.append('empid')
newds.varlist.append('salary')
dept = row[0]
dsNames[newds.name] = dept
newds.cases.append(row)
# Save the new datasets
for name,dept in dsNames.iteritems():
strdept = str(dept)
spss.Submit(r"""
DATASET ACTIVATE %(name)s.
SAVE OUTFILE='/mydata/saldata_%(strdept)s.sav'.
""" %locals())
spss.Submit(r"""
DATASET ACTIVATE saldata.
DATASET CLOSE ALL.
""" %locals())
END PROGRAM.
- The code
newdsObj = spss.Dataset(name=None)
creates a new dataset. The name of the dataset is available from the name property, as innewdsObj.name
. In this example, the names of the new datasets are stored to the Python dictionary dsNames. - To save new datasets created with the
Dataset
class, use theSAVE
command after callingspss.EndDataStep
. In this example,DATASET ACTIVATE
is used to activate each new dataset, using the dataset names stored in dsNames.
Example: Modifying Case Values
DATA LIST FREE /cust (F2) amt (F5).
BEGIN DATA
210 4500
242 6900
370 32500
END DATA.
BEGIN PROGRAM.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset()
for i in range(len(datasetObj.cases)):
# Multiply the value of amt by 1.05 for each case
datasetObj.cases[i,1] = 1.05*datasetObj.cases[i,1][0]
spss.EndDataStep()
END PROGRAM.
- The
CaseList
object, accessed from thecases
property of aDataset
object, allows you to read or modify case data. To access the value for a given variable within a particular case you specify the case number and the index of the variable (index values represent position in the active dataset, starting with 0 for the first variable in file order, and case numbers start from 0). For example,datasetObj.cases[i,1]
specifies the value of the variable with index1
for case numberi
. - When reading case values, results are returned as a list. In the present example we're accessing a single value within each case so the list has one element.
See the topic CaseList Class (Python) for more information.
Example: Comparing Datasets
Dataset
objects allow you to concurrently work
with the case data from multiple datasets. As a simple example, we'll
compare the cases in two datasets and indicate identical cases with
a new variable added to one of the datasets.
DATA LIST FREE /id (F2) salary (DOLLAR8) jobcat (F1).
BEGIN DATA
1 57000 3
3 40200 1
2 21450 1
END DATA.
SORT CASES BY id.
DATASET NAME empdata1.
DATA LIST FREE /id (F2) salary (DOLLAR8) jobcat (F1).
BEGIN DATA
3 41000 1
1 59280 3
2 21450 1
END DATA.
SORT CASES BY id.
DATASET NAME empdata2.
BEGIN PROGRAM.
import spss
spss.StartDataStep()
datasetObj1 = spss.Dataset(name="empdata1")
datasetObj2 = spss.Dataset(name="empdata2")
nvars = len(datasetObj1)
datasetObj2.varlist.append('match')
for i in range(len(datasetObj1.cases)):
if datasetObj1.cases[i] == datasetObj2.cases[i,0:nvars]:
datasetObj2.cases[i,nvars] = 1
else:
datasetObj2.cases[i,nvars] = 0
spss.EndDataStep()
END PROGRAM.
- The two datasets are first sorted by the variable id which is common to both datasets.
- Since
DATA LIST
creates unnamed datasets (the same is true forGET
), the datasets are named usingDATASET NAME
so that you can refer to them when callingspss.Dataset
. -
datasetObj1
anddatasetObj2
areDataset
objects associated with the two datasets empdata1 and empdata2 to be compared. - The code
datasetObj1.cases[i]
returns case numberi
from empdata1. The codedatasetObj2.cases[i,0:nvars]
returns the slice of case numberi
from empdata2 that includes the variables with indexes 0,1,...,nvars-1. - The new variable match, added to empdata2, is set to 1 for cases that are identical and 0 otherwise.