The IDLE IDE (Integrated Development Environment) is installed with the Python language and can be used with SPSS, but I am often asked to recommend a better one. I'll tell you what I do and offer some tips on using it with IBM SPSS Statistics. This is not an in-depth review of any of these editors.
IDLE has the virtue of always being present if Python is installed, and Python scripting (not programmability) within Statistics uses it by default, but there are much better choices available. Many of the alternatives are free, and many are multi-platform. I will focus on Windows, though, since that's what I know best. If you want to follow up on any of these, you can find them easily via Google, so I won't add urls. Any Python IDE or even the plain Python command line can drive Statistics in external mode where your Python program invokes and drives Statistics without the usual user interface. Using one in internal mode is tougher.
In the free and simple category, I use Pythonwin, which as the name suggests is Windows only. It comes with the win32 COM extensions, but you don't have to get into those to use it. As you can see from the image below, it's a pretty plain window, but it provides syntax coloring, pop-up help, syntax checking and other niceties. The output from any commands you run appears right in the interactive window, but you can open a program window and run a complete program. That separates the input and output.
The Pythonwin IDE
Pythonwin is quite easy to learn, and it's free. It gives you lots more help than the Statistics syntax window, which knows about Statistics syntax but not about Python syntax. I have used this in training classes, and attendees pick it up quite quickly. It has a debugger, but it's hard to use and not especially powerful. It's best for trying out bits of code and running small programs. You can develop a program or Python script with it and then transfer it to the Statistics syntax window later.
If you are going to do a lot of Python work or you want to debug code running in internal mode, it is worth getting a more powerful tool. Some people like Pydev, the Eclipse plug-in for Python, but I find that way too busy and complicated for my needs. Stani is another popular free editor.
I do not claim to know all the available IDEs, but what I like best of what I have tried is Wing IDE. It's a commercial product but inexpensive. I chose it originally several years ago because it was tops in a comparative review of a number of IDEs and was used by some Python gurus I knew. It's interface is a lot more complicated than IDLE or Pythonwin, because it has many tool and debugging windows, but I have used it in teaching, and students catch on more quickly than I originally expected. It separates the output into its own window, and provides much deeper help on command completion than Pythonwin. Here's a picture. It's scrunched down, but you should be able to get the general idea.
The user interface is pretty configurable. I usually have many modules open at the same time and the Debug window open on a second monitor.
Wing really shines when it comes to debugging, but there are a number of settings and a little extra code that really amp up what you can do. If you are running Statistics in external mode, then the obvious debugging features don't require any special settings. Debugging in internal mode, where you start from the Statistics syntax window and invoke a saved Python module is a really powerful way to to work, but you need to do a few special things to make this work. The setup is that you are starting from the syntax window (or an INSERTed syntax file), and the Python code you want to debug is in a saved module that is open in Wing.
First, you need to adjust some settings in Wing (some of this information is available from the Wing help material.)
- Go to Edit>Preferences>Debugger>External/Remote. Check the box Enable Passive Listen
- In Debugger I/O you may want to tinker with the settings for Debug I/O Encoding and Debug Shell Encoding if you are working with characters beyond 7-bit ascii.
- Open the file wingdbstub.py in the editor. You will find it in the Wing installation directory. It's heavily commented to guide you through the settings, but all you need to do is to find the kEmbedded line and change it to read
kEmbedded = 1
- Make sure that the line that sets the value of WINGHOME points to where you have installed WING. (This is usually automatically done by the install process.)
- Save wingdbstub.py and copy it also to wherever you keep your Python modules.
The steps above are a one-time thing. The next part is what you do to your code whenever you want to debug. I always insert this code into a new module and just toggle it on or off as needed. You can do that using the Source>Comment Region Toggle menu item. (The exact menu wording varies with different versions of Wing.
Insert the following code someplace where it will be executed each time you want to debug the code. That implies that you should not just put this at the top of a module so that it is executed on module import, because you need to execute the code each time you debug something. Putting it outside executing code means it only gets executed once - when the module is imported.
if wingdbstub.debugger != None:
This code uses a try block so that if you forget to comment it out before distributing a module, your code will still work even though the users don't have Wing. The rest of it ensures that the debugger is listening and can grab control when a breakpoint is encountered. (This is more complicated than the usual debugging situation because of the way Statistics and programmability/scripting are architected.)
Once this block of code has been executed, if a breakpoint has been set, the program will pause at that point and give control to the editor. You can examine or change variables, step through the code from that point or do any other useful debugging operations. Just click Debug to continue execution to the next breakpoint or the end of the code.
This facility is incredibly useful, but there are a few things you can't do this way. If you change code that has already been imported in your Statistics session and rerun the module, the old code will still be used, because import is a no-op if repeated. You can use the Python reload function to get a new version, but that isn't entirely reliable. And any existing variables are not automatically replaced unless the relevant code is reexecuted.
It's safer to restart Statistics after you change the code (and don't forget to save the file in the editor!). Because of the way Python scripts (not programs) are executed, you might not have to restart Statistics for those, but if your changes don't seem to be working, it might be because you need to restart.
Finally, if you are not sure whether the debugger is hooked in, watch the icon at the left of the Wing status bar. It will change color when the debugger is active.
Debugger not watching
You may be thinking that having adopted a better IDE, you would like to use it in place of IDLE when you create or edit a script from within Statistics by using File>New>Script or File>Open>Script. How well this works depends on the IDE, but it is possible.
The IDE configuration is specified in the file clientscriptingcfg.ini, which lives in your Statistics installat ion directory. First, save a copy of the file in case of trouble. Then open the standard version in an editor that understands Unicode and Unix-style line endings. You might be surprised to learn that Wordpad on Windows works.
In the section marked [Python], change the EDITOR_PATH line to point to your chosen IDE. Mine looks like this
EDITOR_PATH=C:\Wing IDE 3.2\bin\wing.exe
The second step is to change the EDITOR_ARGS line to read
Save the file and restart Statistics, and your IDE should come up from the menus. For Wordpad, be sure to save it as a Unicode Text Document, not as rtf. For the Open action, the file to be edited will not be opened, but you can just open it from the editor directly. If you are using Wing, it will remember the file the next time (along with any other files you opened), so this isn't too inconvenient.
You can't do this for programs, so you would have to edit them independently in the IDE.
In summary, using a good IDE will help you get the most out of programmability and maximize your productivity.
IBM SPSS Statistics contains many tools for data management. This post discusses several different ways to solve an example data management problem using technology from different areas of the product. The purpose of this post is to help you decide where to look when creating a problem solution considering the available functionality and other characteristics.
The problem to be solved here is to convert a text file of comma-separated id values such as postal codes into an SPSS dataset with one item per line. This might then be used as a lookup table.
The input data for this example look like this. It is a text file with each line containing a comma-separated string of values. Some maximum line length or number of values is assumed to be known.
The goal is to produce a dataset where each value is a separate case.
Solution 1. The first approach is to read the text using GET DATA into a set of variables, say, x, y, z - as many as needed. Then use the VARSTOCASES command to restructure these cases into a new dataset. The syntax would simply be
VARSTOCASES /make output FROM x y z
Discussion. VARSTOCASES has many features for more complicated problems of this type such as repeating groups, nonvarying variables and id variables. The restructure data wizard on the Data menu will walk you through this specification, but personally I find the syntax easier to figure out than the wizard in this case. CASESTOVARS does the reverse operation. NULL=DROP means here that if there were fewer than three values on an input line, the empty values would not contribute output cases.
So, game over, what could be easier? Common data manipulation problems often have a special command available for their solution. (AGGREGATE is another of my favorites) You might not be so lucky as to have your problem fit exactly into an existing command, so it's worth looking at more general mechanisms. The next solution uses an INPUT PROGRAM.
Solution 2. Input programs are a powerful but little-known mechanism to build cases in a very general way. They take as input raw data described by one or more DATA LIST commands and produce an SPSS dataset that might have a quite different structure. Here is an input program to solve our problem. I've inserted it as an image in order to take advantage of the syntax coloring of the SPSS Syntax Editor introduced in Version 17.
An Input Program to Restructure Variables into Cases
Discussion. This program defines its input with a Data List command for the csv file. It loops over each input record emitting the value of the input up to the next comma as a case containing a single variable named output. This continues for the input record until all the text has been consumed. Then it breaks and proceeds to the next input record.
Input programs can deal with hierarchical input, repeating groups, and self-defining data formats, among other things. Why are they not better known? I'll offer several reasons.
- There is no user interface for input programs. For many users, if there is no gui, the feature doesn't exist.
- The documentation in the Command Syntax Reference is not very good. It is scattered through the separate commands, so it is hard to grasp what input programs can do and how to use them.
- In the database era, data often come in rectangular database tables that are manipulated with SQL or in spreadsheets, so an input program is not needed. Those are the domain of the GET DATA command.
You usually need to experiment with input programs to figure out the solution. Once you get the pattern, though, you can easily solve other similar problems. One SPSS feature that you might not realize is actually an input program is the Make New Dataset with Cases custom dialog box available from the files section of this site. You specify in the dialog the number of variables and cases and a few other options, and it creates a dataset of random numbers. If you look into the dialog syntax, you will see that it generates an input program.
Solution 3. The next approach is to use Python programmability. Python being a very general purpose language, there are many variations that would work (hence the 1/2 in the title). I'll start by getting the data by using the csv module in the Python standard library and generating an SPSS dataset from that input. The csv module understands several common dialects of comma-separated files and allows you to control details of the formats, but we don't need that generality here.
The code loops over the lines in the csv file and then over the values within the line. The csv reader automatically splits the line into fields at the comma boundaries. It regards the line contents following the last comma as an additional empty field, so the code ignores that. The spss.Dataset class is used to generate cases containing the values extracted from the input file. This code would be wrapped in BEGIN PROGRAM/END PROGRAM if run in the regular SPSS syntax stream, but it could also be run in external mode directly from a Python environment. In that mode, which can be much faster than the usual internal mode, no SPSS user interface appears. It is much like using the StatisticsB module included with SPSS Statistics Server, but it runs locally.
Alternatives with the Python approach would be to read the input from the active SPSS dataset and create a new one. The spssdata.Spssdata class could also be used to write the new dataset.
So, which is best? The answer, of course, depends. VARSTOCASES wins if it fits the problem. The choice between an input program and a Python program is partly a matter of taste and skill set, partly whether you want the help of an IDE in creating and debugging the code. Using direct SPSS syntax will generally be faster for passing cases, although for some problems the power of the Python language might win out on performance grounds. Performance improvements in the Dataset class are currently underway.
You can even use the input program within a Python program by just using spss.Submit to run that whole block of code. One reason to do that would be for convenient parameterization of the code while getting the speed advantage of native SPSS case passing.
At least the filename and perhaps the output variable name would vary if you use this code repeatedly. There are three general ways to parameterize it. The traditional route is the SPSS macro language, For both VARSTOCASES and the input program, you could define a macro that substitutes the parameters into the syntax. Running it just means calling the macro. Except that you have to load the file containing the macro explicitly for each session.
With Python, you might turn this program into a function of two parameters and then reference those parameters in the code. You would just import the function definition (no location information required) and call it in a short program. You might, in fact, collect a bunch of utility functions in a single module and just import it once in the job. Even without creating a function, though, you could use multiline triple-quoted Python literals and named parameters in the command string to produce readable and flexible code. The Programming and Data Management book downloadable from this site in teh Articles section illustrates this idea.
Finally, you could create a custom dialog box using the Custom Dialog Builder and just put the code in whichever form you chose in that dialog. Here's my little dialog box. You could make your own in a few minutes.
If this isn't enough choices, you could also turn the program into an extension command with traditional SPSS syntax. You can read about that in earlier posts in this blog and on this site. That tidies up the log and makes the program available without any explicit load required.
If you have complex file types including nesting, grouped records, mixed record types, and repeating data, you might want to take a look at Appendix C in the Command Syntax Reference, entitled Defining Complex Files.
In summary, you have many tools in SPSS Statistics for problems such as this. Choosing the right tool for the problem and for your skill set will get the job done with the minimum amount of effort. Don't hesitate to venture into areas of SPSS technology that might be new to you.
IBM SPSS Statistics provides several mechanisms for looping in transformations or over groups within a procedure. The new extension commands expand the looping capabilities.
Using the standard capabilities, you can create loops in several ways.
- Transformations can contain general loops using LOOP, and you can loop over variables with DO REPEAT. Since transformations implicitly run within the case processing loop, you cannot include procedures within these loops.
- SPLIT FILES lets procedures iterate over contiguous subgroups of the case data. The procedure's Viewer output either combines the output for all the groups into a single table or produces a set of separate tables and charts organized by group.
- Python programmability allows Pythonistas to iterate over collections of files or variables, among other things. It is very general but requires Python knowledge.
What none of these methods allows you to do is to apply a set of commands to a collection of inputs and organize the entire set of output by group using regular SPSS syntax. For example, you might want to process a set of data files, run several transformations and procedures, and save all the output for each input file to a separate document. You might also want to save all the transformed datasets. These new commands addresses problems like this.
In order to generalize the split file idea, you first use SPSSINC SPLIT DATASET to make a directory of datasets, one per split value. The native way to do such an operation is to use the XSAVE transformation command with appropriate DO IF conditions for each group. This works well, but it has three problems. First, you have to have an exhaustive list of all the split values. Second, you have to write a lot of code. Third, the number of XSAVE commands that you can use in a single transformation block is limited. Prior to version 18, the limit was ten. For newer versions the limit is 64. So you have to count up and divide your code into separate blocks. Once you have all this working, if a new split value appears, you have to revise the code - if you notice the new value.
SPSSINC SPLIT DATASET eliminates these problems. It figures out what split values occur and generates the requisite syntax, taking into account the XSAVE limit. It lets you choose whether to name the outputs by variable values, labels, or sequential numbers, and it can produce a listing of the files created for use in later processing. No risk of unnoticed new values.Â And the data do not need to be sorted by the split values.
SPSSINC PROCESS FILES addresses the other side of the problem. It accepts an input specification that could be something like a file wildcard, e.g., /mydata/*.sav, or a file that lists the files to process such as produced by SPSSINC SPLIT DATASET.Â Then it applies the contents of a syntax file to each input, i.e., it loops over the inputs. It defines file handles and macros representing the input and output parameters for each file processed, so you can refer to these explicitly in the iterated syntax. It can write an individual Viewer output file for each input file, or it can produce a single Viewer file with all the output. These automatic files get names based on the input file names, but, of course, you can do other things in the syntax file. It can also produce a log listing all the actions taken and whether any serious errors occurred.
SPSSINC PROCESS FILES is not limited to SAV files or even data. It's up to you what you want to loop over.
SPSSINC PROCESS FILES solves the long-standing request for a way to put procedures inside loops. With these new tools, you can now easily create general transformation loops, loops over variables, procedure loops over groups, or entire job loops over arbitrary inputs whether or not you are a Python person.
These commands can be downloaded from SPSS Developer Central (www.spss.com/devcentral). They require at least IBM SPSS Statistics Version 17 and the Python programmability plug-in.
p. s. Both of these commands have dialog box interfaces as well as standard SPSS-style syntax.
I hope you will find these useful.
I have posted to SPSS Developer Central a new Python-based extension command, SPSSINC SUMMARY TTEST, that does t tests when you have only the summary statistics from the samples rather than the full data.Â Besides being useful in its own right, it illustrates some useful techniques in doing programmability computations where you need both scalar computations and some SPSS transformation functions.
This command, which is implemented in Python and includes a dialog box interface built with the Custom Dialog Builder, takes as input the counts, means, and standard deviations of the samples and produces several pivot tables with the t test results, confidence intervals, and equal variance test.Â The output includes asymptotic and exact results for both equal variance and unequal variance cases.
The computations are based on standard formulas, but there are a few tactical issues to work out.Â First, the formulas require only standard algebra except that values from the t and F distribution and inverse distribution functions are required.Â Those are not available from the Python standard library, although they are available in some third-party Python libraries.
These are, of course, readily available in the IBM SPSS Statistics transformation system.Â In order to tap the SPSS functions, it is necessary to write a small SPSS dataset with the input values, run some transformation commands on that dataset and retrieve the values.
The dataset tasks are done most easily with the spss Dataset class.Â That, however, has to run within an spss DataStep.Â But the Submit api used to run the transformations cannot be used within a DataStep.Â Furthermore, the output of the procedure consists of pivot tables, and those can only be produced within a StartProcedure...EndProcedure block.Â And StartProcedure cannot be called when a Dataset is active.
So here's the drill:
- Calculate all the scalar quantities needed for the distribution functions
- Start a DataStep and use the Dataset class to populate a tiny dataset
- End the DataStep and Submit the necessary COMPUTE commands
- Start a DataStep and use the Dataset class to retrieve the results
- End the DataStep and start a StartProcedure block
- Produce the pivot tables and close the procedure block
Once you think your way through the constraints, all this takes only a very small amount of code.Â You can look at the source to see the details.Â Although I didn't use it for this example, the spssaux3.py module available from Developer Central includes a function, getComputations, that simplifies the task of getting computational results from SPSS into your Python code.Â It takes as input a set of values and a set of commands and returns a sequence of results.
There is one other interesting issue with implementing this command.Â The command syntax allows for carrying out a sequence of t tests, so all the input parameters can be lists.Â The intermediate calculated values are, therefore, also sequences of values.Â The most straightforward way to process these would simply be to loop the whole process described above.Â While that would work, I wanted to avoid creating and destroying many datasets, and, more importantly, I wanted all the output to appear as a single procedure in the SPSS Viewer.Â That means doing all the preliminary calculations; then generating an SPSS dataset with one row for each test, and then iterating over all those rows to produce the output in a single StartProcedure block.
That gets rather tedious, because you have to first initialize a whole bunch of lists for the intermediate variables, and all the formulas then require subscripts everywhere.Â Ugly.Â Instead, I took advantage of Python's dynamic and flexible class structure.
First, I defined a completely empty class.
Useless, right? But Python allows you to add variables (attributes) to class instances dynamically, so in my loop I could just write
c.var1 = ...
where c is an instance of class C. No tedious list initialization.
Now to deal with the list nature of the computations, in my outer look I also assigned a variable to stand for the list element. So the code starts with this.
c = 
for i in range(numtests):
d = c[i]
Then the computational lines look like this.
d.diff = mean1[i] - mean2[i]
so no subscripting is required on the intermediate results. (I could have packaged up the inputs in the same way but left them subscripted since that relates directly to the inputs).
Nowhere was it necessary to write lengthy definition or initialization code, and the computations are clearer than if they were littered all over with subscripts.
I'd like to acknowledge the assistance of Marta GarcÃa-Granero with the statistical computations and the original inspiration for producing this procedure.
By the way, the test for equal variance is not the Levene test, because that test requires the absolute deviations from the mean, which cannot be computed from the summary statistics.
Well, now that SPSS is part of IBM, I suppose we will need a new name for Developer Central and this blog.Â In the meantime, though, here is news about recent items on Developer Central.
- With the release of Version 18 of IBM SPSS Statistics and the Developer product, easy-to-install versions of the Python and R materials are posted.Â In particular, look for the R Essentials link on the main page or from the Plugins page.Â It installs the R Plugin, the correct version of R, and a bunch of example R integrations as bundles.Â It's much easier to get going with this now.
- The spssaux2 module has a new function, setMacrosFromData that reads columns in the active dataset and creates macro definitions from them.
- The SPSSINC TRANS extension command is available.Â It makes it easy to run a standard or user-written Python function to transform the case data just about the way you would if the function were built in to the regular SPSS transformation system.Â (I've written about that in an earlier posting here.)
- In connection with this, the extendedTransforms.py module has several new functions and some tweaks to make it work well with SPSSINC TRANS.Â The new functions are mode and multmode for calculating the mode across variables within a case, packDummies and extractDummies for packing values of dummy variables into bits of a single variable and extracting them out later, and translatechar for mapping a set of characters in a string into another set of characters.
- Albert-Jan Roskam has contributed a module named state.py that will prune unused variables out of a dataset.
- The SPSSINC MERGE TABLES and SPSSINC CENSOR TABLES extension commands now run much faster.Â (I profiled them and found a few scripting apis that were unexpectedly slow and changed the code to minimize their use).Â If you have V18 the speedup is even greater, taking advantage of some improvements in that version.Â MERGE TABLES also has some convenience tweaks that make it easier to merge significance tests from Custom Tables into the main table.Â V18 has another way to do this built into the CTABLES command, but some people prefer the older style of the tables.
- The SPSSINC TRANSLATE OUTPUT command is available.Â It provides a framework for translating the outline and pivot table text for languages where we don't already provide a translation.
- The customstylefunctions module included with the SPSSINC MODIFY TABLES extension command has some new custom functions.Â You can now move around rows, columns, and layers by using moveRowDimension, transpose, moveLayersToColumns, moveLayersToRows, moveColumnsToLayers, moveRowsToLayers, moveColumnsToRows, and moveRowsToColumns.
- The Programming and Data Management book and the article on writing extension commands have been updated for version 18.
All of this material is free once you have the SPSS base product.
In my last post I wrote about my new extension command, SPSSINC TRANS.Â That command makes it very easy to apply Python functions to the case data by handling all the data passing, variable creation, etc, so you just have to write one line of Python code to call the function.
I have now posted a substantial rework of the initial beta version.Â As the saying goes, plan to throw one away: you will anyway.Â The difficult part of designing and implementing this command was getting the Python function expression through the SPSS Universal Parser, which doesn't speak Python, and then taking it apart and setting up the requisite connections with the data.
My first version was based on regular expressions to extract the parameters and PASW Statistics variable names.Â That worked well enough for what I originally had in mind, although the re's were a bit complicated.Â But as I explored the sorts of functions that would be useful with this facility, the problem got more complicated.
- I wanted to support functions that did not have named parameters for everything.Â The original implementation required function parameters to be specified in the style
parm=variable.Â But many of the built-in Python functions only accept positional arguments.
- I wanted to support lists as parameters so that a bunch of variables could be passed in as a single parameter.
- I wanted to support other more complicated expressions as parameters.
As I thought this over, I realized that instead of my trying to parse Python code, I should let Python do it.Â Python has a compile function that can compile an expression such as a function call.Â This is then evaluated using the eval function.Â Just what I needed.Â So I ripped out all the original code that sort of parsed the function call expression and used compile to set it up.
It took me a little while to get the hang of how to use compile and what it produces - not the best documentation you might find.Â The issues were how to know what to import to make the function call valid and how to figure out which parameter values needed to be satisfied by Statistics variables.Â And a little bit of error handling code to help the user when something isn't right.
Got all that worked out, so now the command is much more general, and the implementation code is shorter and more robust.Â Should have thought of this the first time.Â And because function parameters can now be more general expressions, I axed the ASINTEGER subcommand in favor of just using int(x) in the parameter expression if that capability was needed, which would be infrequent anyway.
Because the code has to pass through the Universal Parser, there are still some expressions that will not work, but you can quote the entire function call expression and be protected from that if needed.
The new version, still considered a beta, is now posted to Developer Central.
So, once again, my hat is off to the Python designers: just about everything in the language is open to use in ordinary program code.Â Just a bit more work on the documentation, please.
One of my frustrations with programmability is the learning barrier.Â On the one hand, the resources and capabilities available through Python programmability combined with the plug-in for PASW Statistics are tremendous.Â Often a problem can be solved in a few lines of Python code that would take a page of code in PASW Statistics or be practically impossible.Â On the other, the Python language is very different from the Statistics syntax language, so many of those who would benefit from the capabilities Python brings to Statistics are frustrated.Â Python is an easy language to learn, but it's a programming language, and many of our users are not programmers.
Starting with version 16, we began to broaden the circle of people who could use programmability with the introduction of extension commands.Â These give traditional-style syntax to Python or R programs.Â That makes functionality packaged this way accessible to users who don't know Python.
Starting with version 17, we broadened the circle further with the Custom Dialog Builder, which makes it easy to create a point-and-click interface for programs, extension commands, or even traditional syntax.
With version 18, we have greatly simplified installation of these features, which has always been the hardest part, through the Python and R Essentials installers and the ability to build bundles that gather all the dependencies and make it really easy to install a new module or command.
All of these tools are available to users, not just SPSS staff.Â Anyone with the skills can produce something useful and package it for themselves or consumers who might benefit but don't have the skills or time to create it themselves.
This is great for packages, but there is a lot more useful Python code that isn't packaged this way and remains inaccessible to users who would benefit from it.Â One thing I have really wanted us to do is to make it possible to call Python functions in the SPSS transformation language just like the built-in transformation functions.
That's an architecturally hard problem that I hope we can solve.Â But the latest module I have created and posted to Developer Central, SPSSINC_TRANS.py goes a long way towards doing this.Â This new extension command, which also has a dialog box interface, makes it easy for nonprogrammers to apply almost any Python function or class to the cases in the active dataset.
In order to produce a transformation of variables in the active dataset using the Python plugin, you have to write code to create the variable definition, access the cases, do the calculations, and save the result to the active dataset.
With this new command, all you have to do is call a function that does the calculation.Â The rest is handled for you.Â Yes, you still have to know about the function, its parameters, and how to write the one line of code that calls it, but that's all.Â You might need to persuade a producer to write a Python function for you, but any consumer can then use it.
There are many very useful functions in the extendedTransforms.py module available from Developer Central.Â I'll illustrate the usage of theÂ SPSSINC TRANS extension command with the datetimetostr function in that module.Â datetimetostr makes it easy to create a string with a date and/or time in almost any format.Â Statistics already has many date/time formats, but users are always asking for others.Â They could be built out of transformation syntax in most cases with enough effort, but datetimetostr makes this really easy.
The function takes two parameters: a pattern that describes the format you want and a Statistics date/time value and produces a string from it according to the pattern.Â The pattern notation is described in the extendedTransforms module as part of the function documentation, but here's an example.
My pattern is %A, %B %d, %Y.Â That means
- day name (%A) followed by a comma and a blank
- month name (%B) followed by a space
- day-of-month (%d) followed by a comma and a blank
- four-digit year (%Y).
This would turn the date value 02/09/1955 (a date value as formatted by Statistics using mm/dd/yyyy) into Wednesday, February 09, 1955.Â (These names can be localized, too.)
Here is the command to apply this function to a variable named bdate, creating a new string variable named datestring
SPSSINC TRANS result = datestring type=30
extendedTransforms.datetimetostr(value=bdate, pattern='''%A, %B %d, %Y''').
That specifies to create a variable named datestring as a string of length 30 and to call the datetimetostr function passing in the value of the Statistics variable bdate for each case, storing the output in datestring.Â (Yes, I know the triple quotes for the pattern parameter value are a little weird, but it's necessary in order to accommodate the way that literals work in both Statistics and Python.)
So, once the user has discovered this function, learned the parameter names, and read about the pattern language, that's the end of the learning task.
If you download and install this new command, you can read the details by running
SPSSINC TRANS /HELP.
Or you can use the dialog and the dialog help that go with it.Â The dialog isn't nearly as elaborate as the Compute dialog, but it should be enough to get you going.
So, producers and consumers, what do you think?Â Post your comments here.
p.s. At this writing, the new command is a beta version.Â Use it with caution, and please report any bugs you find.
Although SPSS translates the user interface and output to many languages, there is often a need for some output in other languages.Â The translator package provides tools for user-created translations.
This package, which can be downloaded from SPSS Developer Central, provides code that can read a set of translation definition files and replace text in pivot tables, headings, titles, and outline contents with the matching translation.Â Details of this process are explained in the documentation in the package.
What I want to write about here is the interesting situation of straddling extension commands and autoscripts that arises in this package.
My first approach was based on autoscripts.Â Autoscripts are Python or Basic scripts that are triggered by events such as the creation of a pivot table.Â The Base autoscript, if there is one,Â is called for every event, and other autoscript files can be attached to particular table types (See Edit>Options>Scripts for details.)Â Python autoscripts, though, are called by using import rather than by invoking a specific function in the script file.
That's fine, but this package also provides an extension command named SPSSINC TRANSLATE OUTPUT.Â The extension command provides syntax that lets the job stream translate selected output when the syntax is executed rather than when a table is created.Â This is much easier to install compared to type-specific autoscripts.Â It also has the advantage of less overhead in many cases.Â An autoscript has to be loaded afresh each time the triggering event occurs.Â That means reloading the translation dictionaries each time.Â (The package documentation discusses how to minimize this startup cost.)Â The syntax version, though, can translate the entire output or the selected types in one invocation.Â It only loads the translation dictionaries once per command, although it will load incremental dictionaries as required when table types are encountered.
So mostly the same code needs to work for both autoscripting and the extension command.Â Since the extension command passes control to Python by calling the Run method in the module while the autoscript executes by using import, making both work is a little tricky.
Importing in Python actually means executing the contents of the imported module.Â Code inside a function or class would get defined but not executed by import as def and class are executable statements that define their objects.Â There is code needed for processing the parsed input for the extension command form, but we want to minimize the overhead of that when running as an autoscript.
The solution is simple.Â The file SPSSINC_TRANSLATE_OUTPUT.py has the Run method required for handling the parsed input and processes the syntax.Â It is not used when an autoscript is run.Â Run then calls code in translator.py, which is also the module used for autoscripting and has most of the code for this package.
When translator.py is imported, its last line,
is executed, because it is outside any function definition.Â This starts the translation process.
When the extension command is run, it calls dotrans but does not set the value of the importing parameter.Â dotrans itself has a default value for this parameter of False.Â Combining this with the
api allows dotrans to figure out what to do.Â When imported and it is an autoscript, it gets called and does the translation.Â When imported by the Run method in SPSSINC_TRANSLATE_OUTPUT.py, it is not executed until the Run method has set up the proper parameters and called dotrans.
There are other ways to solve this problem, but this solution shows some of the advantages of being able to use the same Python module both as an autoscript and as a collection of callable functions and classes.
A common need with SPSS Statistics is to produce some statistical results and then use them for further processing. We sometimes call that "output as input". This is very straightforward if you can do it with AGGREGATE or procedures that produce results as SPSS datasets, but it is possible to do this for anything that SPSS Statistics can put in a pivot table.
One way to retrieve output for reuse is to write a Basic or Python script.Â The output would be produced as pivot table(s) in the Viewer. Then the script would search through the Viewer document and find the desired table.Â Using something like the GetValueAt
api of the Datacells
class (Python) or the ValueAt
property of the Data Cells object (Basic), a program can retrieve cell values.Â The script might be kicked off via a SCRIPT command.
This works, but it can be tricky to program, and it is roundabout and inefficient.Â And in distributed mode with SPSS Statistics Server, this solution is unavailable.
Back in SPSS version 12 we introduced the Output Management System (OMS), but many users have only a vague understanding of this powerful mechanism.Â It provides a much easier and more efficient solution to grabbing output.Â Combining OMS and programmability, the output could still be processed by a Python or Basic script, but it could also be retrieved by a Python or .NET program - instead of a Python script - by using the XML workspace.Â This allows for better synchronization, and it works in either local or distributed mode.
OMS is a listener for the output.Â It is not built in to particular procedures.Â In fact, the procedure does not even know that something is listening.Â Rather, you start OMS listening for particular objects.Â When an object of interest comes along, OMS keeps a copy.Â When you stop the listener, the captured objects are written to memory, to a dataset, or to a file and are available to your SPSS syntax or programmability code.
You tell OMS what to listen for by selecting the types of objects, most often TABLES, and, if desired, the particular types of tables you want, such as a crosstabs table.Â You stop the listener with the OMSEND command.Â The OMS command specifies what the output format should be - including XML, SAV, HTML, text, Excel, Word, and PDF, and where to write it.
If you write the output to a dataset, then you can activate that dataset and apply standard SPSS commands to it.Â You can also access the dataset with Python programmability whether or not is is activated using the Dataset class in the spss module.
A more general mechanism is to have OMS write to the XML workspace.Â This is an in-memory structure that can be read by Python or .NET code.Â The OMS command assigns a name to the workspace item it creates.Â Then the program code can retrieve all or a selected part of that item using the GetXmlUtf16 Python api.Â (Similarly for .NET).Â You write an XPath expression to say which part of the xml you want.
XML and XPath are very powerful but can be a bit intimidating, so we have provided some Python helper functions to make it easy.Â In the spssaux module, which is installed with the Python plug-in, there is a function createXmlOutput that takes care of the OMS wrapper and writing to the workspace.Â All you give it is the command syntax you want and the identifiers for the type of table you are interested in.
Correspondingly, getValuesFromXmlWorkspace can retrieve specific information from the workspace item created by the first function.Â You use the visible properties of the table to determine what is to be retrieved.Â And then you are off to the races.
Here is an example.Â Let's say you want to run a regression and do something if the R Square statistic is too small.Â The R Square is in the Model Summary table.Â Here's an example of that table.
So the task is to run the regression and retrieve the second column of this table.Â Here is a little Python program to do this.Â It expects that the cars.sav data file shipped with SPSS Statistics is the active dataset.
import spss, spssaux
cmd="""REGRESSION /DEPENDENT mpg
/METHOD=ENTER horse weight."""
tag, errorlevel = \
Rsquare =spssaux.getValuesFromXmlWorkspace(tag, 'Model Summary',
colCategory="R Square", cellAttrib="number")
if Rsquare < .7:
Let's walk through this code.
- The cmd= line is the syntax to run to create the output we want to harvest.Â It could be more than one command.
- The createXmlOutput call runs the command, specifying that we are interested in the Model Summary table of the Regression command.Â It returns two values: a tag to use when retrieving output, and an error code, which is ignored in this example.
- The getValuesFromXmlWorkspace call uses the tag and the OMS table subtype along with specifying the part of the table we want.Â Looking at the example table, we see a column label that can be used for retrieval.Â That column will have its value stored as both text and a number in the xml, so we specify that we want the number form.Â The function returns a list of the things that matched.Â Here we take the first and only element.
We know what to retrieve just by looking at the labels in the table.Â In this example, just identifying the column is enough, but you can also specify row labels.Â Some tables are too complicated for this approach, but a great many things can be done using this simple model.Â The spssaux module also has a createDatasetOutput function that works in a similar way but creates an SPSS dataset instead of xml.Â Values would be retrieved from that dataset with the Dataset class or a cursor object.
Note that this table was not retrieved from the Viewer.Â It was captured by the OMS listener and placed in the workspace, from which the Python code extracted it.
Inside these Python functions, OMS commands and XPath expressions were generated, but you don't need to learn those technologies in order to benefit from them.
In my last post, I wrote about the productivity achievable with Python, telling the story of creating the SPSSINC TURF extension command and dialog box. Well, when the cat's away, the mice will play. This post is about scalability and optimizing the TURF algorithm,
The TURF algorithm is computationally explosive. It has to compute a number of set unions that grows very rapidly as the problem size grows and harvest the best. Apart from the number of cases, which affects the time required to compute a set union and the amount of memory required, the size is determined mainly by the number of variables, N, and the maximum number of variables that can be combined. e.g., best three variables, depth.
If we hold the depth constant at 10, i.e., find the best combination of up to 10 variables, the number of unions required grows like this as we increase the number of variables.
Looking in another dimension, fixing the number of variables at 24 and varying the depth, the union count grows like this.
48 variables and a depth of 24 would require 156,861,290,196,829 unions!
This clearly can get out of hand with what seem to be reasonable problems! I added code to precalculate and report the number of unions required and syntax to set a limit on problem size to make the user better informed, but that is not enough.
In calculating the set unions, I was careful to make the code pretty efficient. That works well. But I found that as the number of sets got into the millions, the algorithm stalled and eventually exhausted the machine memory and failed.
Some experimentation showed that the set calculations completed in a reasonable amount of time, but finding the best combinations was very slow. A little optimization of that part of the Python code sped it up by a factor of 4. But I could see that what was killing the code was my strategy of first accumulating the reach count for all the sets and then picking out the best ones. Even though each reach statistic saved added only a small object to the list, the number of such objects was coming to dominate memory and time usage.
Handling the result list had initially seemed to me to be a trivial part of the process, but it clearly is not as the size grows. So I needed to change the code to only keep reach counts for combinations that have a chance at making the final list. Each new count needs to be checked against the counts already computed. Then the code should discard that count if it was dominated by the others, and it should replace a count if it is better than the worst already in the list.
Doing this efficiently requires an entirely different data structure for keeping the list of counts. A heap is a data structure that has two useful properties for this problem. First, the smallest element is always at the head of the list, and, second, elements can be added or deleted quickly while maintaining the heap property.
Python provides a heap data structure in the heapq module in the standard library. For this problem, I actually needed to keep a list of heaps, one for each different number of combinations, but I could use the heapq functions for each one. One other problem is that the heap items are a single number, and I needed to keep something a little more complicated (the count and a list of the variables that produced it). Because Python does not use strong types, I could easily create an object that acted like an integer for comparison purposes but held all the information I needed. A Python motto is, "if it walks like a duck and talks like a duck, it's a duck".
With these changes, the result management parts of the algorithm now run in a constant and small amount of memory and the harvesting of the best combinations is very fast. A test problem that previously died after consuming 1.5GB of memory now runs in 30MB - and finishes. Of course, constructing and counting all those set unions can still take a long time, but that's the nature of the problem. I am not going to try that problem requiring 156 trillion sets.
There are several points to this story.
- You may find a need to optimize in places you didn't expect: testing and measuring is important.
- Producing an initial version and putting it out for real world use can quickly flush out the places that need work. Being able to respond quickly and outside annual product release cycles as we can on Developer Central is a great help. It can change how one builds software.
- Python provides a rich set of technology that can be tapped without the need to invent or reinvent the basics.
You can download the SPSSINC TURF module from the Files section of this site. It requires at least SPSS Statistics Version 17 and the Python plug-in. And it's free.