We are sometimes asked why we use Python as the main programmability and scripting language in IBM SPSS Statistics. Here's a quote from a story dated October 17, 2011 InfoWorld Developer_World article here
The article is entitled From PHP to Perl: What's hot, what's not in scripting languages.
Hot scripting language: Python
In a sense, the tipping point for Python came when the housing market crashed.
those stuck trying to decode bond market prospectuses to figure out who
got paid what when the bankruptcy dominoes were done falling, one thing
became clear: English is a weaselly language, and some weaselly folks
revel in its ambiguities to profit from complicated derivatives.
one smart group that offered a compelling solution: Force every bond to
come with a Python program that defined who got paid and when they were
paid. While they may have been a bit too hopeful about the power of
computer languages, the proposal spoke to Python's growing acceptance in
the greater world of smart people who aren't computer geeks. Relatively
easy to pick up, Python is becoming increasingly popular in economic
research, science departments, and biology labs.
I'm not so sure about the bond market idea(!), but Python excels in its combination of clarity, flexibility, and power. And, case in point, statistics and data analysis. SPSS began embedding Python with version 14 way back in 2005. As of October, the TIOBE index shows Python as the 8th most popular programming language overall FWIW. While this is only a partial view of language usage, it confirms again that Python is a class A language.
Recently this problem was posed on the SPSSX-L listserv (linked on the SPSS Community site): Count the number of distinct values in a set of variables for each case. This led to a lively discussions of alternative solutions. Most used traditional syntax. I used Python programmability.
Datasets sometimes need to be restructured between wide form
, where there are multiple measurements on a concept expressed as different variables within a case, and long form
, where each repeated measurement is a separate case. Commonly, repeated measures, for example, test scores at different points of time for a subject, are stored in wide form, and IBM SPSS Statistics procedures that focus on repeated measures designs tend to require this form. Most data transformations and simple summary statistics, however, are intended for long form. Thus an easy way to convert between these forms is important.
IBM SPSS Statistics provides the commands CASESTOVARS
to convert between these forms. These are used by the Restructure Data Wizard , which appears on the Data menu.
SPSS Statistics has a number of transformation functions that work across variables within a case, such as mean, median, and sum, and the COUNT command that counts occurrences of a particular value, but there is no built-in function for counting distinct values. By converting these data from wide form to long form, the problem can be solved with traditional syntax.
The traditional syntax solution, then, has these steps (mainly worked out by David Marso) The syntax can all be generated using the GUI.
- Save the dataset if there have been changes.
- Use VARSTOCASES to replace the active dataset with one containing an id variable (from the original data or generated) and one variable representing the variables over which the calculation will be done. Call that new variable Z. The number of records is M * NV, where M is the number of IDs and NV is the number of variables over which the count is required.
- Use AGGREGATE with the ID variable and Z as the break variables. Use N as the statistic. Now the number of records in this new dataset is M * average number of distinct values.
- Activate the new dataset and use AGGREGATE again on it breaking just on ID, and use N as the statistic. This results in one record per ID with the count of distinct values. The dataset set is M records.
- Get the original dataset, and use MATCH FILES to add this value back to it.
This works, but it is quite a few data passes, and it takes some study to understand the code.
The programmability solution is much simpler. It uses the SPSSINC TRANS
extension command along with a two-line Python program to arrive at the same result. Here is the Python program followed by the command syntax. The program explanation
The SPSSINC TRANS explanation
- The "*args" signature in the countThem function means that args will be a list of the arguments (variable values in this case) passed to the function when it is called, so the function can handle any number of variables.
- set(args) creates a set from that list. Since a set can only contain the same item once, it will contain only the distinct values. If I pass it the list [1,2,1], the set will contain two members: 1 and 2.
- The length of the set, returned by the len function, is the number of members and hence the return value is the number of distinct values. This includes None, if there were any SYSMIS values. That could easily be excluded if they should not be counted.
- The program defined in the begin program block remains available throughout the session even though the program block has terminated, so once it is defined, it can be called elsewhere in that session.
- SPSSINC TRANS is an extension command implemented in Python that applies the code in the FORMULA subcommand to the cases in the active dataset and stores the result in the RESULT variable. It fetches the variable values referenced in the formula and calls whatever function was specified.
- Since countThem is called without any module qualifier, the extension command first tries to find that function in the items that have been defined in begin program blocks. If not found there, it looks for a function built in to Python, e.g., min, max, sum, ... If you have a function defined in some other module, you can reference it as modulename.funcname, and the command will load that module and then call the indicated function.
- The entire formula is quoted so that it will not be digested by the Statistics parser but rather passed as is to the command. Variable names are case sensitive.
- SPSSINC TRANS can return more than one value, hence creating multiple variables, and it has a number of other features that can be found in its syntax or dialog box help (Transform>Programmability Transformation).
So which is better? The traditional syntax does not require any programmability knowledge and works on quite elderly versions of Statistics, but it is rather complicated and takes many data passes. The Python approach requires a little knowledge of programmability and requires at least version 17 (2008) and the Python plugin, but, given that, it is easy to read and takes only one data pass. That data pass, however, will be slower than a native data pass.
The Python Essentials and the SPSSINC TRANS extension command can be downloaded from the SPSS Community site (www.ibm.com/developerworks/spssdevcentral if you are not reading this on the site).
IBM SPSS Statistics Version 18 introduced a new variable property: role. The role can be Input, Target, Both, None, Split, or Partition. This new metadata comes from IBM SPSS Modeler and is useful in abstracting and generalizing jobs.
Roles are normally set by the user. Currently, these simply
make initial settings in some dialog boxes. But if the roles are set correctly,
it becomes possible to automate and raise the level of abstraction of repetitive
tasks. For example, you might need to produce a standard set of
analyses/reports across a variety of datasets that have a similar structure
but vary in the exact variables they contain or other details. By abstracting
the logic of a job to use roles, measurement levels, custom attributes and other
variable properties, you can reduce the number of versions of a job that need to be
developed and maintained.
This can save time and reduce the number of errors.I have
seen customer sites where there are huge numbers of job files - syntax,
templates, macros, scripts, etc - that are very similar but duplicated and
modified, because the variables coming in are a little different or the coding
of variables is a little different. Once you build a big set of jobs like this,
making improvements or bug fixes becomes a nightmare, not to mention the extra time it takes to do things this way.The long-standing macro facility provides some
possibilities for abstraction, but it is static and can't use the metadata in a dataset. In contrast, the SPSSINC SELECT VARIABLES command
allows you to define sets of variables based on the metadata rather than just a
hard-coded list of names. It can use explicit names, patterns in names (all the variable names that contain AGE), measurement level,
type (numeric vs string), custom attributes, and, finally, role, to define sets
of variables that can be used in the job. These sets are embodied in, yes, macros. Of course, you could write your own
code to use this sort of information, but SELECT VARIABLES can do a lot of this
without the need to learn programmability.
And it has a dialog box interface as shown here. For example, suppose you have a mostly standard questionnaire that is used in many
studies, but it has a few custom questions that vary from study to study, or some variables are sometimes omitted. You
need to produce tabulations and estimate similar models for these studies. By
intelligent use of the metadata, including role, you can perhaps have one master
job rather than dozens. This leaves the analyst or researcher free to focus on
the brain work part of the job rather than the tedious mechanical and error prone
If you have a data supplier who collects and prepares your datasets, you can instruct them on what roles and custom attributes should be defined. Then your analysis syntax can at least in part be based on these properties.
Custom attributes, first introduced in SPSS version 14 can also hold metadata such as questionnaire text, interviewer instructions, measurement units, or anything else that is useful in documenting your data or in programmatically manipulating it. In syntax, these can be created with the VARIABLE ATTRIBUTE command. (There is also a DATAFILE ATTRIBUTE command.) Roles can be defined with the VARIABLE ROLE command. Attributes and Roles can also be defined in the Data Editor or the Define Variable Properties dialog. They all persist with the saved data. These can be used in Modeler, too.In summary, it's all about
generalization and automation. Role is just one more attribute that can be used
in this effort. SPSSINC SELECT VARIABLES
can be obtained from the SPSS Community and requires the Python Programmability plugin/essentials.
I recently posted a new extension command, SPSSINC PROCESS FILES (along with SPSSINC SPLIT DATASET). The command applies a syntax file to a set of files. Recently someone asked for a way to search across the case data in many data files to find a particular ID value. This can be done with SPSSINC PROCESS FILES, but it occurred to me that a few enhancements to the command would simplify this process. This new version is the results. Another extension command, GATHERMD, can collect variable names and labels across data files into a single dataset, making it easy to search that metadata, but it does not look at the case data.
Note: Updated again to reflect more improvements to the case data searching user interface and syntax.
PROCESS files requires at a minimum an input specification, perhaps a wildcard expression, to select the files to process, a syntax file to apply to them, and maybe some output specifications. The awkwardness in using this for search is the need to edit the syntax file every time you have a different search specification. (File names and such are already provided for by file handles and macros.) My thought was to fix this by adding a way for users to define macro expressions using PROCESS FILES. There is now a new subcommand, MACROVALUES, with a set of keywords to define macro expressions, and there is an accompanying subdialog box for the command.
Using this feature, you could create a single syntax file that finds the cases according to specified criteria and lists the results. Macro parameters are used to pass the search criteria to the syntax file. A search syntax file is now included in the package. You can look at that file in the package or by using the dialog box help, so I won't list it here. To use it, you would run a command like this.
SPSSINC PROCESS FILES INPUTDATA="c:\spss18\samples\english\e*.sav"
SYNTAX="some location\searchfiles.sps" CONTINUEONERROR=YES
VIEWERFILE= "c:/temp/outputfiles/searchresults.spv" CLOSEDATA=YES
/MACRODEFS ITEMS PARM1="ID >= 100 and ID <110" PARM2="ID" PARM3="educ salary".
- PARM1 defines the cases you want to find.
- PARM2 names the case id variable for display purposes.
- PARM3 names any other variables that should be listed.
You need to list at least one variable in either PARM2 or PARM3.
This is quite general: the expression defining the cases you are looking for can be any SPSS logical expression, and if variables don't exist in a file being searched or there are no matches in the cases, everything is quietly tidied up for you. If you specify a single Viewer file for all the output as in the example above, the result is a file with a table listing the matching cases for each input data file.
Update: Macro parameters can be very useful, but for searching in particular, a simpler user interface would be useful. I have added a keyword SEARCH=YES|NO, and if you choose YES, you omit the syntax file, and the command takes care of the details. To make this easier to use, I have created a search-specific dialog box that generates the appropriate syntax. It is simplified a lot to remove options that are probably not particularly useful for searching, and it labels the required fields with their purpose rather than using terms like PARM1. This is pretty easy to do, so if you have a specialized task to do, you could adapt the dialog box text accordingly. As part of the search improvements, I have eliminated the separate syntax file for searching, since this is now built into the command.
I created SPSSINC PROCESS FILES originally having in mind examples of batch processing of similar sets of files. Realizing that it could also provide a cross-file data search was a serendipitous extra.
The new version is now available on SPSS Developer Central. Feedback is welcome. As with the earlier version, the command requires at least SPSS Statistics Version 17 and the Python programmability plugin.
Traditional SPSS syntax jobs refer to specific variable names and are written with assumptions about what the variables mean. That is often fine, but you can sometimes accommodate variations in names or roles by using the properties and attributes of variables instead of relying on the names. A new extension command, SPSSINC SELECT VARIABLES, can help with this.
To start with the most basic approach, suppose you would like to create a job that would produce summary univariate statistics for any dataset without knowing what the variables will be. The CODEBOOK command will do very condensed statistics, but let's suppose you want more detail. The measurement level of a variable is a good place to start in determining what statistics would be appropriate, but there is no syntax that says, for example, run FREQUENCIES on all categorical variables and DESCRIPTIVES or EXAMINE on all scale (continuous, ratio) variables. The analyst can do this interactively easily - the dialog box variable lists can even be sorted by measurement level, but it is hard to build a production job with this flexibility.
Using SPSSINC SELECT VARIABLES, you can easily define macros that list the variables for selected measurement levels. The macros can then be used in the appropriate commands. Here's what the job might look like.
SPSSINC SELECT VARIABLES MACRONAME = "!catvars"
/PROPERTIES LEVEL=NOMINAL ORDINAL.
FREQUENCIES !catvars /BARCHART.
SPSSINC SELECT VARIABLES MACRONAME = "!scalevars"
Simple! If you wanted to select only variables whose names end with "Education", you could add PATTERN=".*education" to the selection commands above.
The SPSSINC SELECT VARIABLES command has lots of other ways of selecting variables, and it has a dialog box interface as well. I'll only write about one additional dimension of this command, which touches on an exciting way to raise the level of generality.
Since Version 14, SPSS has allowed users to create custom attributes for variables and files. They can be anything you want, and they are saved with the data just as variable labels, missing value codes and other metainformation are. Useful examples might include units (currency, distance, weight), sources, roles (predictor, dependent, id), question text, privacy (confidential, semi-public, public).
Let's assume that the datasets to be fed to your job have a set of attributes that have already been assigned by the data preparation team. You can use this information to affect the macro definitions generated by SPSSINC SELECT VARIABLES. As one example, perhaps you want to exclude variables from your statistical summary if they don't have a useful role for analysis. You could change the syntax above to reflect this. Just redoing the categorical variable example, you would write.
SPSSINC SELECT VARIABLES MACRONAME="!interestingCatVars"
/PROPERTIES LEVEL=NOMINAL ORDINAL
/ATTRVALUES NAME=role VALUE="dependent" "predictor".
SPSSINC SELECT VARIABLES is an extension command available from SPSS Developer Central in the Downloads section. It requires at least Version 17 and the Python programmability plugin. Check it out: no need to learn the macro language or Python programmability to start taking advantage of this. If you are building automation jobs where you don't even know the names of the datasets that might arrive (but you do have some idea of the structure!), check out the SPSSINC PROCESS FILES extension command, also available from Developer Central.
The IDLE IDE (Integrated Development Environment) is installed with the Python language and can be used with SPSS, but I am often asked to recommend a better one. I'll tell you what I do and offer some tips on using it with IBM SPSS Statistics. This is not an in-depth review of any of these editors.
IDLE has the virtue of always being present if Python is installed, and Python scripting (not programmability) within Statistics uses it by default, but there are much better choices available. Many of the alternatives are free, and many are multi-platform. I will focus on Windows, though, since that's what I know best. If you want to follow up on any of these, you can find them easily via Google, so I won't add urls. Any Python IDE or even the plain Python command line can drive Statistics in external mode where your Python program invokes and drives Statistics without the usual user interface. Using one in internal mode is tougher.
In the free and simple category, I use Pythonwin, which as the name suggests is Windows only. It comes with the win32 COM extensions, but you don't have to get into those to use it. As you can see from the image below, it's a pretty plain window, but it provides syntax coloring, pop-up help, syntax checking and other niceties. The output from any commands you run appears right in the interactive window, but you can open a program window and run a complete program. That separates the input and output.
The Pythonwin IDE
Pythonwin is quite easy to learn, and it's free. It gives you lots more help than the Statistics syntax window, which knows about Statistics syntax but not about Python syntax. I have used this in training classes, and attendees pick it up quite quickly. It has a debugger, but it's hard to use and not especially powerful. It's best for trying out bits of code and running small programs. You can develop a program or Python script with it and then transfer it to the Statistics syntax window later.
If you are going to do a lot of Python work or you want to debug code running in internal mode, it is worth getting a more powerful tool. Some people like Pydev, the Eclipse plug-in for Python, but I find that way too busy and complicated for my needs. Stani is another popular free editor.
I do not claim to know all the available IDEs, but what I like best of what I have tried is Wing IDE. It's a commercial product but inexpensive. I chose it originally several years ago because it was tops in a comparative review of a number of IDEs and was used by some Python gurus I knew. It's interface is a lot more complicated than IDLE or Pythonwin, because it has many tool and debugging windows, but I have used it in teaching, and students catch on more quickly than I originally expected. It separates the output into its own window, and provides much deeper help on command completion than Pythonwin. Here's a picture. It's scrunched down, but you should be able to get the general idea.
The user interface is pretty configurable. I usually have many modules open at the same time and the Debug window open on a second monitor.
Wing really shines when it comes to debugging, but there are a number of settings and a little extra code that really amp up what you can do. If you are running Statistics in external mode, then the obvious debugging features don't require any special settings. Debugging in internal mode, where you start from the Statistics syntax window and invoke a saved Python module is a really powerful way to to work, but you need to do a few special things to make this work. The setup is that you are starting from the syntax window (or an INSERTed syntax file), and the Python code you want to debug is in a saved module that is open in Wing.
First, you need to adjust some settings in Wing (some of this information is available from the Wing help material.)
- Go to Edit>Preferences>Debugger>External/Remote. Check the box Enable Passive Listen
- In Debugger I/O you may want to tinker with the settings for Debug I/O Encoding and Debug Shell Encoding if you are working with characters beyond 7-bit ascii.
- Open the file wingdbstub.py in the editor. You will find it in the Wing installation directory. It's heavily commented to guide you through the settings, but all you need to do is to find the kEmbedded line and change it to read
kEmbedded = 1
- Make sure that the line that sets the value of WINGHOME points to where you have installed WING. (This is usually automatically done by the install process.)
- Save wingdbstub.py and copy it also to wherever you keep your Python modules.
The steps above are a one-time thing. The next part is what you do to your code whenever you want to debug. I always insert this code into a new module and just toggle it on or off as needed. You can do that using the Source>Comment Region Toggle menu item. (The exact menu wording varies with different versions of Wing.
Insert the following code someplace where it will be executed each time you want to debug the code. That implies that you should not just put this at the top of a module so that it is executed on module import, because you need to execute the code each time you debug something. Putting it outside executing code means it only gets executed once - when the module is imported.
if wingdbstub.debugger != None:
This code uses a try block so that if you forget to comment it out before distributing a module, your code will still work even though the users don't have Wing. The rest of it ensures that the debugger is listening and can grab control when a breakpoint is encountered. (This is more complicated than the usual debugging situation because of the way Statistics and programmability/scripting are architected.)
Once this block of code has been executed, if a breakpoint has been set, the program will pause at that point and give control to the editor. You can examine or change variables, step through the code from that point or do any other useful debugging operations. Just click Debug to continue execution to the next breakpoint or the end of the code.
This facility is incredibly useful, but there are a few things you can't do this way. If you change code that has already been imported in your Statistics session and rerun the module, the old code will still be used, because import is a no-op if repeated. You can use the Python reload function to get a new version, but that isn't entirely reliable. And any existing variables are not automatically replaced unless the relevant code is reexecuted.
It's safer to restart Statistics after you change the code (and don't forget to save the file in the editor!). Because of the way Python scripts (not programs) are executed, you might not have to restart Statistics for those, but if your changes don't seem to be working, it might be because you need to restart.
Finally, if you are not sure whether the debugger is hooked in, watch the icon at the left of the Wing status bar. It will change color when the debugger is active.
Debugger not watching
You may be thinking that having adopted a better IDE, you would like to use it in place of IDLE when you create or edit a script from within Statistics by using File>New>Script or File>Open>Script. How well this works depends on the IDE, but it is possible.
The IDE configuration is specified in the file clientscriptingcfg.ini, which lives in your Statistics installat ion directory. First, save a copy of the file in case of trouble. Then open the standard version in an editor that understands Unicode and Unix-style line endings. You might be surprised to learn that Wordpad on Windows works.
In the section marked [Python], change the EDITOR_PATH line to point to your chosen IDE. Mine looks like this
EDITOR_PATH=C:\Wing IDE 3.2\bin\wing.exe
The second step is to change the EDITOR_ARGS line to read
Save the file and restart Statistics, and your IDE should come up from the menus. For Wordpad, be sure to save it as a Unicode Text Document, not as rtf. For the Open action, the file to be edited will not be opened, but you can just open it from the editor directly. If you are using Wing, it will remember the file the next time (along with any other files you opened), so this isn't too inconvenient.
You can't do this for programs, so you would have to edit them independently in the IDE.
In summary, using a good IDE will help you get the most out of programmability and maximize your productivity.
IBM SPSS Statistics contains many tools for data management. This post discusses several different ways to solve an example data management problem using technology from different areas of the product. The purpose of this post is to help you decide where to look when creating a problem solution considering the available functionality and other characteristics.
The problem to be solved here is to convert a text file of comma-separated id values such as postal codes into an SPSS dataset with one item per line. This might then be used as a lookup table.
The input data for this example look like this. It is a text file with each line containing a comma-separated string of values. Some maximum line length or number of values is assumed to be known.
The goal is to produce a dataset where each value is a separate case.
Solution 1. The first approach is to read the text using GET DATA into a set of variables, say, x, y, z - as many as needed. Then use the VARSTOCASES command to restructure these cases into a new dataset. The syntax would simply be
VARSTOCASES /make output FROM x y z
Discussion. VARSTOCASES has many features for more complicated problems of this type such as repeating groups, nonvarying variables and id variables. The restructure data wizard on the Data menu will walk you through this specification, but personally I find the syntax easier to figure out than the wizard in this case. CASESTOVARS does the reverse operation. NULL=DROP means here that if there were fewer than three values on an input line, the empty values would not contribute output cases.
So, game over, what could be easier? Common data manipulation problems often have a special command available for their solution. (AGGREGATE is another of my favorites) You might not be so lucky as to have your problem fit exactly into an existing command, so it's worth looking at more general mechanisms. The next solution uses an INPUT PROGRAM.
Solution 2. Input programs are a powerful but little-known mechanism to build cases in a very general way. They take as input raw data described by one or more DATA LIST commands and produce an SPSS dataset that might have a quite different structure. Here is an input program to solve our problem. I've inserted it as an image in order to take advantage of the syntax coloring of the SPSS Syntax Editor introduced in Version 17.
An Input Program to Restructure Variables into Cases
Discussion. This program defines its input with a Data List command for the csv file. It loops over each input record emitting the value of the input up to the next comma as a case containing a single variable named output. This continues for the input record until all the text has been consumed. Then it breaks and proceeds to the next input record.
Input programs can deal with hierarchical input, repeating groups, and self-defining data formats, among other things. Why are they not better known? I'll offer several reasons.
- There is no user interface for input programs. For many users, if there is no gui, the feature doesn't exist.
- The documentation in the Command Syntax Reference is not very good. It is scattered through the separate commands, so it is hard to grasp what input programs can do and how to use them.
- In the database era, data often come in rectangular database tables that are manipulated with SQL or in spreadsheets, so an input program is not needed. Those are the domain of the GET DATA command.
You usually need to experiment with input programs to figure out the solution. Once you get the pattern, though, you can easily solve other similar problems. One SPSS feature that you might not realize is actually an input program is the Make New Dataset with Cases custom dialog box available from the files section of this site. You specify in the dialog the number of variables and cases and a few other options, and it creates a dataset of random numbers. If you look into the dialog syntax, you will see that it generates an input program.
Solution 3. The next approach is to use Python programmability. Python being a very general purpose language, there are many variations that would work (hence the 1/2 in the title). I'll start by getting the data by using the csv module in the Python standard library and generating an SPSS dataset from that input. The csv module understands several common dialects of comma-separated files and allows you to control details of the formats, but we don't need that generality here.
The code loops over the lines in the csv file and then over the values within the line. The csv reader automatically splits the line into fields at the comma boundaries. It regards the line contents following the last comma as an additional empty field, so the code ignores that. The spss.Dataset class is used to generate cases containing the values extracted from the input file. This code would be wrapped in BEGIN PROGRAM/END PROGRAM if run in the regular SPSS syntax stream, but it could also be run in external mode directly from a Python environment. In that mode, which can be much faster than the usual internal mode, no SPSS user interface appears. It is much like using the StatisticsB module included with SPSS Statistics Server, but it runs locally.
Alternatives with the Python approach would be to read the input from the active SPSS dataset and create a new one. The spssdata.Spssdata class could also be used to write the new dataset.
So, which is best? The answer, of course, depends. VARSTOCASES wins if it fits the problem. The choice between an input program and a Python program is partly a matter of taste and skill set, partly whether you want the help of an IDE in creating and debugging the code. Using direct SPSS syntax will generally be faster for passing cases, although for some problems the power of the Python language might win out on performance grounds. Performance improvements in the Dataset class are currently underway.
You can even use the input program within a Python program by just using spss.Submit to run that whole block of code. One reason to do that would be for convenient parameterization of the code while getting the speed advantage of native SPSS case passing.
At least the filename and perhaps the output variable name would vary if you use this code repeatedly. There are three general ways to parameterize it. The traditional route is the SPSS macro language, For both VARSTOCASES and the input program, you could define a macro that substitutes the parameters into the syntax. Running it just means calling the macro. Except that you have to load the file containing the macro explicitly for each session.
With Python, you might turn this program into a function of two parameters and then reference those parameters in the code. You would just import the function definition (no location information required) and call it in a short program. You might, in fact, collect a bunch of utility functions in a single module and just import it once in the job. Even without creating a function, though, you could use multiline triple-quoted Python literals and named parameters in the command string to produce readable and flexible code. The Programming and Data Management book downloadable from this site in teh Articles section illustrates this idea.
Finally, you could create a custom dialog box using the Custom Dialog Builder and just put the code in whichever form you chose in that dialog. Here's my little dialog box. You could make your own in a few minutes.
If this isn't enough choices, you could also turn the program into an extension command with traditional SPSS syntax. You can read about that in earlier posts in this blog and on this site. That tidies up the log and makes the program available without any explicit load required.
If you have complex file types including nesting, grouped records, mixed record types, and repeating data, you might want to take a look at Appendix C in the Command Syntax Reference, entitled Defining Complex Files.
In summary, you have many tools in SPSS Statistics for problems such as this. Choosing the right tool for the problem and for your skill set will get the job done with the minimum amount of effort. Don't hesitate to venture into areas of SPSS technology that might be new to you.
IBM SPSS Statistics provides several mechanisms for looping in transformations or over groups within a procedure. The new extension commands expand the looping capabilities.
Using the standard capabilities, you can create loops in several ways.
- Transformations can contain general loops using LOOP, and you can loop over variables with DO REPEAT. Since transformations implicitly run within the case processing loop, you cannot include procedures within these loops.
- SPLIT FILES lets procedures iterate over contiguous subgroups of the case data. The procedure's Viewer output either combines the output for all the groups into a single table or produces a set of separate tables and charts organized by group.
- Python programmability allows Pythonistas to iterate over collections of files or variables, among other things. It is very general but requires Python knowledge.
What none of these methods allows you to do is to apply a set of commands to a collection of inputs and organize the entire set of output by group using regular SPSS syntax. For example, you might want to process a set of data files, run several transformations and procedures, and save all the output for each input file to a separate document. You might also want to save all the transformed datasets. These new commands addresses problems like this.
In order to generalize the split file idea, you first use SPSSINC SPLIT DATASET to make a directory of datasets, one per split value. The native way to do such an operation is to use the XSAVE transformation command with appropriate DO IF conditions for each group. This works well, but it has three problems. First, you have to have an exhaustive list of all the split values. Second, you have to write a lot of code. Third, the number of XSAVE commands that you can use in a single transformation block is limited. Prior to version 18, the limit was ten. For newer versions the limit is 64. So you have to count up and divide your code into separate blocks. Once you have all this working, if a new split value appears, you have to revise the code - if you notice the new value.
SPSSINC SPLIT DATASET eliminates these problems. It figures out what split values occur and generates the requisite syntax, taking into account the XSAVE limit. It lets you choose whether to name the outputs by variable values, labels, or sequential numbers, and it can produce a listing of the files created for use in later processing. No risk of unnoticed new values.Â And the data do not need to be sorted by the split values.
SPSSINC PROCESS FILES addresses the other side of the problem. It accepts an input specification that could be something like a file wildcard, e.g., /mydata/*.sav, or a file that lists the files to process such as produced by SPSSINC SPLIT DATASET.Â Then it applies the contents of a syntax file to each input, i.e., it loops over the inputs. It defines file handles and macros representing the input and output parameters for each file processed, so you can refer to these explicitly in the iterated syntax. It can write an individual Viewer output file for each input file, or it can produce a single Viewer file with all the output. These automatic files get names based on the input file names, but, of course, you can do other things in the syntax file. It can also produce a log listing all the actions taken and whether any serious errors occurred.
SPSSINC PROCESS FILES is not limited to SAV files or even data. It's up to you what you want to loop over.
SPSSINC PROCESS FILES solves the long-standing request for a way to put procedures inside loops. With these new tools, you can now easily create general transformation loops, loops over variables, procedure loops over groups, or entire job loops over arbitrary inputs whether or not you are a Python person.
These commands can be downloaded from SPSS Developer Central (www.spss.com/devcentral). They require at least IBM SPSS Statistics Version 17 and the Python programmability plug-in.
p. s. Both of these commands have dialog box interfaces as well as standard SPSS-style syntax.
I hope you will find these useful.
Python programs written to run in BEGIN PROGRAM blocks are easy to write and add lots of functionality to IBM SPSS Statistics.Â Many users have learned to create these. More users, though, do not know the Python language.
The extension command mechanism provides a way for users of traditional SPSS syntax to run Python programs written by someone else without needing any knowledge of Python.Â But the program must have been written as an extension command.Â While creating extension commands isn't hard, it does require some extra knowledge and work.Â (An article on how to do this is available on Developer Central.)
I have posted a new extension command, SPSSINC PROGRAM, that allows ordinary Python programs to be run with traditional syntax without the author having created an extension command: easy on the author and easy on the user.
Often someone writes and shares a Python program for use via BEGIN PROGRAM that requires some input parameters.Â The BEGIN PROGRAM syntax does not allow for parameters, so the user must modify the program itself to specify these.Â If you know Python, this is not a problem, but many users are uncomfortable doing that, since the Python language is quite different from the traditional SPSS language.
A Python programmer would typically define a function or class with the requisite parameters and just modify the function call.Â An SPSS user might not know how to do that.
SPSSINC PROGRAM solves this problem.Â (Of course, extension commands solve this problem more generally, but they take extra work to create.) Â Suppose I have a program saved as mymodule.mypgm.py that does something to a pair ofÂ SPSS variables , and I want the variable names to beÂ parameters.Â Using SPSSINC PROGRAM, I would write
SPSSINC PROGRAM mymodule.mypgm firstvar secondvar.
firstvar and secondvar would be the parameter values passed to the program.Â SPSSINC PROGRAM ensures that mymodule.mypgm is called,Â makes the parameters available to the program, and handles various error conditions.
Instead of passing the parameters as function arguments, the parameters are set up as if they were a command line.Â The author of the program would access these in the traditional Python way via sys.argv, with the first parameter being the module and program name.Â It's just as if the program were being run from a command shell, except that the parameter values have been passed through the SPSS Universal Parser.Â Comments in the implementation module details the (small) differences this can produce in what the program sees.
So using SPSSINC PROGRAM is very easy on the Python programmer while still letting the user of the program work in the style he or she is comfortable with.Â The package also includes a dialog box built with the Custom Dialog Builder where the user can enter the program name and any parameters.
I have posted to SPSS Developer Central a new Python-based extension command, SPSSINC SUMMARY TTEST, that does t tests when you have only the summary statistics from the samples rather than the full data.Â Besides being useful in its own right, it illustrates some useful techniques in doing programmability computations where you need both scalar computations and some SPSS transformation functions.
This command, which is implemented in Python and includes a dialog box interface built with the Custom Dialog Builder, takes as input the counts, means, and standard deviations of the samples and produces several pivot tables with the t test results, confidence intervals, and equal variance test.Â The output includes asymptotic and exact results for both equal variance and unequal variance cases.
The computations are based on standard formulas, but there are a few tactical issues to work out.Â First, the formulas require only standard algebra except that values from the t and F distribution and inverse distribution functions are required.Â Those are not available from the Python standard library, although they are available in some third-party Python libraries.
These are, of course, readily available in the IBM SPSS Statistics transformation system.Â In order to tap the SPSS functions, it is necessary to write a small SPSS dataset with the input values, run some transformation commands on that dataset and retrieve the values.
The dataset tasks are done most easily with the spss Dataset class.Â That, however, has to run within an spss DataStep.Â But the Submit api used to run the transformations cannot be used within a DataStep.Â Furthermore, the output of the procedure consists of pivot tables, and those can only be produced within a StartProcedure...EndProcedure block.Â And StartProcedure cannot be called when a Dataset is active.
So here's the drill:
- Calculate all the scalar quantities needed for the distribution functions
- Start a DataStep and use the Dataset class to populate a tiny dataset
- End the DataStep and Submit the necessary COMPUTE commands
- Start a DataStep and use the Dataset class to retrieve the results
- End the DataStep and start a StartProcedure block
- Produce the pivot tables and close the procedure block
Once you think your way through the constraints, all this takes only a very small amount of code.Â You can look at the source to see the details.Â Although I didn't use it for this example, the spssaux3.py module available from Developer Central includes a function, getComputations, that simplifies the task of getting computational results from SPSS into your Python code.Â It takes as input a set of values and a set of commands and returns a sequence of results.
There is one other interesting issue with implementing this command.Â The command syntax allows for carrying out a sequence of t tests, so all the input parameters can be lists.Â The intermediate calculated values are, therefore, also sequences of values.Â The most straightforward way to process these would simply be to loop the whole process described above.Â While that would work, I wanted to avoid creating and destroying many datasets, and, more importantly, I wanted all the output to appear as a single procedure in the SPSS Viewer.Â That means doing all the preliminary calculations; then generating an SPSS dataset with one row for each test, and then iterating over all those rows to produce the output in a single StartProcedure block.
That gets rather tedious, because you have to first initialize a whole bunch of lists for the intermediate variables, and all the formulas then require subscripts everywhere.Â Ugly.Â Instead, I took advantage of Python's dynamic and flexible class structure.
First, I defined a completely empty class.
Useless, right? But Python allows you to add variables (attributes) to class instances dynamically, so in my loop I could just write
c.var1 = ...
where c is an instance of class C. No tedious list initialization.
Now to deal with the list nature of the computations, in my outer look I also assigned a variable to stand for the list element. So the code starts with this.
c = 
for i in range(numtests):
d = c[i]
Then the computational lines look like this.
d.diff = mean1[i] - mean2[i]
so no subscripting is required on the intermediate results. (I could have packaged up the inputs in the same way but left them subscripted since that relates directly to the inputs).
Nowhere was it necessary to write lengthy definition or initialization code, and the computations are clearer than if they were littered all over with subscripts.
I'd like to acknowledge the assistance of Marta GarcÃa-Granero with the statistical computations and the original inspiration for producing this procedure.
By the way, the test for equal variance is not the Levene test, because that test requires the absolute deviations from the mean, which cannot be computed from the summary statistics.