IBM SPSS Statistics contains many tools for data management. This post discusses several different ways to solve an example data management problem using technology from different areas of the product. The purpose of this post is to help you decide where to look when creating a problem solution considering the available functionality and other characteristics.
The problem to be solved here is to convert a text file of comma-separated id values such as postal codes into an SPSS dataset with one item per line. This might then be used as a lookup table.The input data for this example look like this. It is a text file with each line containing a comma-separated string of values. Some maximum line length or number of values is assumed to be known.
The goal is to produce a dataset where each value is a separate case.
Solution 1. The first approach is to read the text using GET DATA into a set of variables, say, x, y, z - as many as needed. Then use the VARSTOCASES command to restructure these cases into a new dataset. The syntax would simply be
VARSTOCASES /make output FROM x y z
Discussion. VARSTOCASES has many features for more complicated problems of this type such as repeating groups, nonvarying variables and id variables. The restructure data wizard on the Data menu will walk you through this specification, but personally I find the syntax easier to figure out than the wizard in this case. CASESTOVARS does the reverse operation. NULL=DROP means here that if there were fewer than three values on an input line, the empty values would not contribute output cases.
So, game over, what could be easier? Common data manipulation problems often have a special command available for their solution. (AGGREGATE is another of my favorites) You might not be so lucky as to have your problem fit exactly into an existing command, so it's worth looking at more general mechanisms. The next solution uses an INPUT PROGRAM.
Solution 2. Input programs are a powerful but little-known mechanism to build cases in a very general way. They take as input raw data described by one or more DATA LIST commands and produce an SPSS dataset that might have a quite different structure. Here is an input program to solve our problem. I've inserted it as an image in order to take advantage of the syntax coloring of the SPSS Syntax Editor introduced in Version 17.
Discussion. This program defines its input with a Data List command for the csv file. It loops over each input record emitting the value of the input up to the next comma as a case containing a single variable named output. This continues for the input record until all the text has been consumed. Then it breaks and proceeds to the next input record.
Input programs can deal with hierarchical input, repeating groups, and self-defining data formats, among other things. Why are they not better known? I'll offer several reasons.
- There is no user interface for input programs. For many users, if there is no gui, the feature doesn't exist.
- The documentation in the Command Syntax Reference is not very good. It is scattered through the separate commands, so it is hard to grasp what input programs can do and how to use them.
- In the database era, data often come in rectangular database tables that are manipulated with SQL or in spreadsheets, so an input program is not needed. Those are the domain of the GET DATA command.
You usually need to experiment with input programs to figure out the solution. Once you get the pattern, though, you can easily solve other similar problems. One SPSS feature that you might not realize is actually an input program is the Make New Dataset with Cases custom dialog box available from the files section of this site. You specify in the dialog the number of variables and cases and a few other options, and it creates a dataset of random numbers. If you look into the dialog syntax, you will see that it generates an input program.
Solution 3. The next approach is to use Python programmability. Python being a very general purpose language, there are many variations that would work (hence the 1/2 in the title). I'll start by getting the data by using the csv module in the Python standard library and generating an SPSS dataset from that input. The csv module understands several common dialects of comma-separated files and allows you to control details of the formats, but we don't need that generality here.
The code loops over the lines in the csv file and then over the values within the line. The csv reader automatically splits the line into fields at the comma boundaries. It regards the line contents following the last comma as an additional empty field, so the code ignores that. The spss.Dataset class is used to generate cases containing the values extracted from the input file. This code would be wrapped in BEGIN PROGRAM/END PROGRAM if run in the regular SPSS syntax stream, but it could also be run in external mode directly from a Python environment. In that mode, which can be much faster than the usual internal mode, no SPSS user interface appears. It is much like using the StatisticsB module included with SPSS Statistics Server, but it runs locally.
Alternatives with the Python approach would be to read the input from the active SPSS dataset and create a new one. The spssdata.Spssdata class could also be used to write the new dataset.
So, which is best? The answer, of course, depends. VARSTOCASES wins if it fits the problem. The choice between an input program and a Python program is partly a matter of taste and skill set, partly whether you want the help of an IDE in creating and debugging the code. Using direct SPSS syntax will generally be faster for passing cases, although for some problems the power of the Python language might win out on performance grounds. Performance improvements in the Dataset class are currently underway.
You can even use the input program within a Python program by just using spss.Submit to run that whole block of code. One reason to do that would be for convenient parameterization of the code while getting the speed advantage of native SPSS case passing.
At least the filename and perhaps the output variable name would vary if you use this code repeatedly. There are three general ways to parameterize it. The traditional route is the SPSS macro language, For both VARSTOCASES and the input program, you could define a macro that substitutes the parameters into the syntax. Running it just means calling the macro. Except that you have to load the file containing the macro explicitly for each session.
With Python, you might turn this program into a function of two parameters and then reference those parameters in the code. You would just import the function definition (no location information required) and call it in a short program. You might, in fact, collect a bunch of utility functions in a single module and just import it once in the job. Even without creating a function, though, you could use multiline triple-quoted Python literals and named parameters in the command string to produce readable and flexible code. The Programming and Data Management book downloadable from this site in teh Articles section illustrates this idea.
Finally, you could create a custom dialog box using the Custom Dialog Builder and just put the code in whichever form you chose in that dialog. Here's my little dialog box. You could make your own in a few minutes.
If this isn't enough choices, you could also turn the program into an extension command with traditional SPSS syntax. You can read about that in earlier posts in this blog and on this site. That tidies up the log and makes the program available without any explicit load required.
If you have complex file types including nesting, grouped records, mixed record types, and repeating data, you might want to take a look at Appendix C in the Command Syntax Reference, entitled Defining Complex Files.
In summary, you have many tools in SPSS Statistics for problems such as this. Choosing the right tool for the problem and for your skill set will get the job done with the minimum amount of effort. Don't hesitate to venture into areas of SPSS technology that might be new to you.