In my last post, I wrote about the productivity achievable with Python, telling the story of creating the SPSSINC TURF extension command and dialog box. Well, when the cat's away, the mice will play. This post is about scalability and optimizing the TURF algorithm,
The TURF algorithm is computationally explosive. It has to compute a number of set unions that grows very rapidly as the problem size grows and harvest the best. Apart from the number of cases, which affects the time required to compute a set union and the amount of memory required, the size is determined mainly by the number of variables, N, and the maximum number of variables that can be combined. e.g., best three variables, depth.
If we hold the depth constant at 10, i.e., find the best combination of up to 10 variables, the number of unions required grows like this as we increase the number of variables.
Looking in another dimension, fixing the number of variables at 24 and varying the depth, the union count grows like this.
48 variables and a depth of 24 would require 156,861,290,196,829 unions!
This clearly can get out of hand with what seem to be reasonable problems! I added code to precalculate and report the number of unions required and syntax to set a limit on problem size to make the user better informed, but that is not enough.
In calculating the set unions, I was careful to make the code pretty efficient. That works well. But I found that as the number of sets got into the millions, the algorithm stalled and eventually exhausted the machine memory and failed.
Some experimentation showed that the set calculations completed in a reasonable amount of time, but finding the best combinations was very slow. A little optimization of that part of the Python code sped it up by a factor of 4. But I could see that what was killing the code was my strategy of first accumulating the reach count for all the sets and then picking out the best ones. Even though each reach statistic saved added only a small object to the list, the number of such objects was coming to dominate memory and time usage.
Handling the result list had initially seemed to me to be a trivial part of the process, but it clearly is not as the size grows. So I needed to change the code to only keep reach counts for combinations that have a chance at making the final list. Each new count needs to be checked against the counts already computed. Then the code should discard that count if it was dominated by the others, and it should replace a count if it is better than the worst already in the list.
Doing this efficiently requires an entirely different data structure for keeping the list of counts. A heap is a data structure that has two useful properties for this problem. First, the smallest element is always at the head of the list, and, second, elements can be added or deleted quickly while maintaining the heap property.
Python provides a heap data structure in the heapq module in the standard library. For this problem, I actually needed to keep a list of heaps, one for each different number of combinations, but I could use the heapq functions for each one. One other problem is that the heap items are a single number, and I needed to keep something a little more complicated (the count and a list of the variables that produced it). Because Python does not use strong types, I could easily create an object that acted like an integer for comparison purposes but held all the information I needed. A Python motto is, "if it walks like a duck and talks like a duck, it's a duck".
With these changes, the result management parts of the algorithm now run in a constant and small amount of memory and the harvesting of the best combinations is very fast. A test problem that previously died after consuming 1.5GB of memory now runs in 30MB - and finishes. Of course, constructing and counting all those set unions can still take a long time, but that's the nature of the problem. I am not going to try that problem requiring 156 trillion sets.
There are several points to this story.
- You may find a need to optimize in places you didn't expect: testing and measuring is important.
- Producing an initial version and putting it out for real world use can quickly flush out the places that need work. Being able to respond quickly and outside annual product release cycles as we can on Developer Central is a great help. It can change how one builds software.
- Python provides a rich set of technology that can be tapped without the need to invent or reinvent the basics.
You can download the SPSSINC TURF module from the Files section of this site. It requires at least SPSS Statistics Version 17 and the Python plug-in. And it's free.
I have posted to SPSS Developer Central a new Python-based extension command, SPSSINC SUMMARY TTEST, that does t tests when you have only the summary statistics from the samples rather than the full data.Â Besides being useful in its own right, it illustrates some useful techniques in doing programmability computations where you need both scalar computations and some SPSS transformation functions.
This command, which is implemented in Python and includes a dialog box interface built with the Custom Dialog Builder, takes as input the counts, means, and standard deviations of the samples and produces several pivot tables with the t test results, confidence intervals, and equal variance test.Â The output includes asymptotic and exact results for both equal variance and unequal variance cases.
The computations are based on standard formulas, but there are a few tactical issues to work out.Â First, the formulas require only standard algebra except that values from the t and F distribution and inverse distribution functions are required.Â Those are not available from the Python standard library, although they are available in some third-party Python libraries.
These are, of course, readily available in the IBM SPSS Statistics transformation system.Â In order to tap the SPSS functions, it is necessary to write a small SPSS dataset with the input values, run some transformation commands on that dataset and retrieve the values.
The dataset tasks are done most easily with the spss Dataset class.Â That, however, has to run within an spss DataStep.Â But the Submit api used to run the transformations cannot be used within a DataStep.Â Furthermore, the output of the procedure consists of pivot tables, and those can only be produced within a StartProcedure...EndProcedure block.Â And StartProcedure cannot be called when a Dataset is active.
So here's the drill:
- Calculate all the scalar quantities needed for the distribution functions
- Start a DataStep and use the Dataset class to populate a tiny dataset
- End the DataStep and Submit the necessary COMPUTE commands
- Start a DataStep and use the Dataset class to retrieve the results
- End the DataStep and start a StartProcedure block
- Produce the pivot tables and close the procedure block
Once you think your way through the constraints, all this takes only a very small amount of code.Â You can look at the source to see the details.Â Although I didn't use it for this example, the spssaux3.py module available from Developer Central includes a function, getComputations, that simplifies the task of getting computational results from SPSS into your Python code.Â It takes as input a set of values and a set of commands and returns a sequence of results.
There is one other interesting issue with implementing this command.Â The command syntax allows for carrying out a sequence of t tests, so all the input parameters can be lists.Â The intermediate calculated values are, therefore, also sequences of values.Â The most straightforward way to process these would simply be to loop the whole process described above.Â While that would work, I wanted to avoid creating and destroying many datasets, and, more importantly, I wanted all the output to appear as a single procedure in the SPSS Viewer.Â That means doing all the preliminary calculations; then generating an SPSS dataset with one row for each test, and then iterating over all those rows to produce the output in a single StartProcedure block.
That gets rather tedious, because you have to first initialize a whole bunch of lists for the intermediate variables, and all the formulas then require subscripts everywhere.Â Ugly.Â Instead, I took advantage of Python's dynamic and flexible class structure.
First, I defined a completely empty class.
Useless, right? But Python allows you to add variables (attributes) to class instances dynamically, so in my loop I could just write
c.var1 = ...
where c is an instance of class C. No tedious list initialization.
Now to deal with the list nature of the computations, in my outer look I also assigned a variable to stand for the list element. So the code starts with this.
c = 
for i in range(numtests):
d = c[i]
Then the computational lines look like this.
d.diff = mean1[i] - mean2[i]
so no subscripting is required on the intermediate results. (I could have packaged up the inputs in the same way but left them subscripted since that relates directly to the inputs).
Nowhere was it necessary to write lengthy definition or initialization code, and the computations are clearer than if they were littered all over with subscripts.
I'd like to acknowledge the assistance of Marta GarcÃa-Granero with the statistical computations and the original inspiration for producing this procedure.
By the way, the test for equal variance is not the Levene test, because that test requires the absolute deviations from the mean, which cannot be computed from the summary statistics.
Transformation commands in IBM SPSS Statistics fall basically into two categories: those that are executed when the flow logic gets to them, and those that are executed immediately when read, i.e., before other transformations are executed.
Transformation commands that define variable properties such as VARIABLE LABELS, VALUE LABELS, and MISSING VALUES are generally executed as soon as they are read. That means that they may be executed out of order compared to the execution of the job stream, and they can't be controlled by transformation logic such as DO IF or DO REPEAT. So you can't define properties conditionally, and you can't iterate over them.
Usually you can ignore this behavior, but here's a case where it matters. Suppose you want to create syntax that will eliminate all the variable labels in a dataset (never mind why you might want to do this - this is just an example). The first try might be this do repeat loop.
But you get this puzzling error message.
Error # 4530. Command name: variable label
>This command is not allowed inside the DO REPEAT/ END REPEAT facility.
That's because metadata commands are not subject to the flow control of a loop. In addition, this is a little ugly, because you had to know the names of the first and last variables (in file order) in order even to write the loop command. So this syntax isn't general.
The second try works, but it's way too ugly to contemplate. I couldn't ever bear to write out the whole command.
It works, because you can separate variable specifications with /, and if there is no label in a section, the label is empty, i.e., removed.
This is too painful to write, but it offers a clue. If we could only generate that slash-separated list automatically, we could feed it to this command. That's where SPSSINC SELECT VARIABLES
comes in. This is an extension command available from the SPSS Community (
www.ibm.com/developerworks/spssdevcentral if you aren't already reading this from the site) and requires the Python Essentials. This command allows you to define an SPSS macro consisting of a list of variables that meet various criteria. It could be an explicit list, but you could filter on variable type, patterns in names, custom attributes, and/or measurement level. With no filtering, it would select all the variables in the dataset. The macro text replaces the macro name reference when it is encountered in the syntax.
So we're almost there. If I run
spssinc select variables macroname = "!all". (note the quotes)
I get a list of all the variables that I could feed to the VARIABLE LABELS command. But that won't quite work, because I need those slashes to separate the items in the list. This will do the trick.
(It is conventional to start macro names with "!" so that they won't collide with variable names or syntax constructs.)
Now all I have to do is this.
This gives me syntax that will work for any data file regardless of the variable names.
In order to use this command, you have to install the extension command (and the Essentials if you don't already have that), but then you can forget that it is an extension. It works just like the built-in commands. The more you can generalize your syntax, the less work you have to do, and the less the chance of errors creeping in.
A final note: ths example is entirely serendipitous. I created this command for selecting subsets of the variables fof automating analysis tasks. Here it is being used to select all the variables. And I thought the only useful separators between the selected variables would be a blank for ordinary variable lists and + for CTABLES expressions, but here / is required.
Custom dialog boxes (created with the Custom Dialog Builder) can be installed interactively from the IBM SPSS Statistics Utilities menu (Install Custom Dialog in Version 17 or Custom Dialogs>Install Custom Dialog in Version 18).Â This works fine, but if you have a lot of dialogs to install or need to do this on many computers, it gets rather tedious.Â Fortunately you can automate this process with a simple bat file.Â Here's how.
A custom dialog in Version 17 is an spd
file. Â Open that in Winzip or any application that can read zip format.
There will be two or more files there: one has extension .xml
, another .properties
. Â There may also be htm
and image files, and if there are translations, there will be multiple .properties files.
I'll assume that the target system has SPSS installed in directory c:\x.
There will be a directory named c:\x\ext\lib.
In that directory, make a subdirectory whose name is the root filename you see in the zip. Â If the zip file has files named dialog1.properties etc, that would be a subdirectory named dialog1.
Copy all of the files to that new directory.
The next time SPSS is started, the dialog will show up.
If the dialog is implemented as a Python or R extension command, those files will not be in the spd but need to be copied to the extensions subdirectory of the SPSS installation or other appropriate location. Â They would typically be an xml file defining the syntax and one or more Python or R files.
If the target system is running Vista and SPSS is installed under Program Files, you might need admin permissions to run your bat file.
Version 18 can use another format known as spe that bundles all the implementation files together along with some descriptive information, but you can similarly extract and copy the files. Â It will include an spd file as above along with the implementation files.
Note that if you are using distributed mode (IBM SPSS Server) , the implementation files will be needed on the Server computer.
IBM SPSS Statistics Version 18 introduced a new variable property: role. The role can be Input, Target, Both, None, Split, or Partition. This new metadata comes from IBM SPSS Modeler and is useful in abstracting and generalizing jobs.
Roles are normally set by the user. Currently, these simply
make initial settings in some dialog boxes. But if the roles are set correctly,
it becomes possible to automate and raise the level of abstraction of repetitive
tasks. For example, you might need to produce a standard set of
analyses/reports across a variety of datasets that have a similar structure
but vary in the exact variables they contain or other details. By abstracting
the logic of a job to use roles, measurement levels, custom attributes and other
variable properties, you can reduce the number of versions of a job that need to be
developed and maintained.
This can save time and reduce the number of errors.I have
seen customer sites where there are huge numbers of job files - syntax,
templates, macros, scripts, etc - that are very similar but duplicated and
modified, because the variables coming in are a little different or the coding
of variables is a little different. Once you build a big set of jobs like this,
making improvements or bug fixes becomes a nightmare, not to mention the extra time it takes to do things this way.The long-standing macro facility provides some
possibilities for abstraction, but it is static and can't use the metadata in a dataset. In contrast, the SPSSINC SELECT VARIABLES command
allows you to define sets of variables based on the metadata rather than just a
hard-coded list of names. It can use explicit names, patterns in names (all the variable names that contain AGE), measurement level,
type (numeric vs string), custom attributes, and, finally, role, to define sets
of variables that can be used in the job. These sets are embodied in, yes, macros. Of course, you could write your own
code to use this sort of information, but SELECT VARIABLES can do a lot of this
without the need to learn programmability.
And it has a dialog box interface as shown here. For example, suppose you have a mostly standard questionnaire that is used in many
studies, but it has a few custom questions that vary from study to study, or some variables are sometimes omitted. You
need to produce tabulations and estimate similar models for these studies. By
intelligent use of the metadata, including role, you can perhaps have one master
job rather than dozens. This leaves the analyst or researcher free to focus on
the brain work part of the job rather than the tedious mechanical and error prone
If you have a data supplier who collects and prepares your datasets, you can instruct them on what roles and custom attributes should be defined. Then your analysis syntax can at least in part be based on these properties.
Custom attributes, first introduced in SPSS version 14 can also hold metadata such as questionnaire text, interviewer instructions, measurement units, or anything else that is useful in documenting your data or in programmatically manipulating it. In syntax, these can be created with the VARIABLE ATTRIBUTE command. (There is also a DATAFILE ATTRIBUTE command.) Roles can be defined with the VARIABLE ROLE command. Attributes and Roles can also be defined in the Data Editor or the Define Variable Properties dialog. They all persist with the saved data. These can be used in Modeler, too.In summary, it's all about
generalization and automation. Role is just one more attribute that can be used
in this effort. SPSSINC SELECT VARIABLES
can be obtained from the SPSS Community and requires the Python Programmability plugin/essentials.
Case-control matching is a popular technique used to pair records in the "case" sample with similar records in a typically much larger "control" sample based on a set of key variables. This post discusses the FUZZY extension command for SPSS Statistics that implements this technique and some recent enhancements to it.
In this discussion I will refer to records rather than what SPSS usually calls cases in order to avoid confusion with case as in case control.
FUZZY takes two datasets as input (the demander and supplier datasets), matches the records according to a set of BY variables, and provides various ways of writing the output. It does not have a dialog box interface, but running FUZZY /HELP displays the complete syntax. Matches can be required to be exact on all variables, or a tolerance or "fuzz" factor, which could be zero, can be specified for each matching variable. (String matches can only have fuzz 0.)
Matches are not always possible: missing vaues or blank strings in a BY variable preclude matching. There might not be a close enough record in the supplier dataset to pair with a demander record. You might run out of eligible supplier records before the demander records are all paired. Unpaired output is set to the system missing value or, for strings, to blank.
FUZZY proceeds by finding for each demander record all of the supplier records that are close enough on the BY variables. This requires a lot of comparisons! It then proceeds through the demander records and picks a supplier record at random from all those eligible for that record. (You can request multiple supplier records for each demander if needed.) No attempt is made to find the closest eligible record, since there is no measure of closeness across the set of BY variables and for other reasons.
If sampling from the suppliers is done without replacement, which is the default behavior, then using a supplier record makes it unavailable for matching in later demander records and could result in a later record having no match. With fuzzy matches, it could be that both records would have been matched if the order were reversed.
While one would generally want to specify an exact match for categorical variables, at least those with nominal measurement level, continuous variables such as income or age might require some fuzz. New output from FUZZY can help to diagnose which BY specifications cause a record to go unmatched. Here is a table produced by FUZZY that shows how the BY criteria restrict the matches.
In this example, we are matching two datasets about vehicles on the variables origin and cylinder. We require an exact match on origin, but allow a difference of up to 3 in the number of cylinders as shown in the first column.
Th first row records the number of comparisons between demander and supplier records testing for an exact match. 95% of these comparisons were not exact matches. Comparing each demander record to each supplier record, only 5% matched exactly.
Next, the table shows that among comparisons after removing the 5% that matched exactly, 85% did not match on origin. Then, considering only the records where there was an exact match on origin, 75% of the comparisons did not match on cylinder. Each row of the table is based on the comparisons that passed, i.e., were within tolerance, on all of the preceding rows.
This table, which precedes the actual pairing step, can be useful in finding the variables that are most important in preventing matches. You may need to increase the tolerance, or you may just be out of luck if there are insufficient matches. The table does not tell you how successful the actual pairing step will be, because supplier records will be used up as the pairing pass proceeds, but it gives insight into how the variables filter the matches. The tries and rejection columns are only produced when you specify fuzz, but you could set all the fuzz values to 0 to see the results with exact matching.
The next table shows the distribution of eligible matches for the pairing pass (this example is based on a very small dataset). It shows how many eligible records there were for each demander record in the pairing pass. It shows that there were two demander records for which there were zero eligible supplier records, three where there was only one, and one where there were two to ten eligibles. This gives you a good idea of how rich the supplier dataset is in matchables, but it doesn't say anything about which variables have the biggest effects on pairing.
If you are lucky, all your demander records will find a match, but if they don't what can you do? Recall that the pairing stage is first come, first serve. With fuzzy matching, reordering the demander cases might work better. While FUZZY can't find an optimal order, one more output feature can help you improve the results. Specifying DRAWPOOLSIZE=varname will add a variable to the demander dataset that records how many eligible supplier records there were for each demander record. You can then study the characteristics of the demander records where suppliers are scarce to see where the supplier dataset is too thin. A good start, though, to improving the pairing percentage is to sort the demander dataset by the newly created poolsize variable. That puts the least paired demander records first in line for a match when you rerun the process and will generally reduce the number of unmatched cases if that is possible.
There are statistical issues regarding how you choose the BY variables that are not addressed here. Searching the web for something like "case control matching" will turn up numerous references.
FUZZY is available from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) and requires the Python Essentials for your version of SPSS Statistics.
Modified on by JonPeck
The SPSS Statistics Custom tables (CTABLES) procedure can test equality of column proportions and column means. The test results are displayed either in a separate table or, using APA style, in the main table. Users commonly want to highlight significant differences, which is not possible using built-in functionality. This article explains how to do this.
Setup: In order to use the approach described here, you need the Python Essentials, including the SPSSINC MERGE TABLES and SPSSINC MODIFY TABLES extension commands. If these are not already installed, they can be obtained via the SPSS Community website. Details of what is installed by default vary across versions of Statistics. If your installation of SPSSINC MODIFY TABLES does not have the custom function described below, get a newer version from the website.
Here is the syntax and output from CTABLES testing equality of column proportions, The example uses the employee data.sav file shipped with Statistics. Our goal is to show the significant differences in the main table and highlight them..
You might first ask why we did not start with the single table output in APA style, since we want to show significance in the main table. There are two reasons: first, in my experience, nobody likes the APA style, and, second, it won't help us to color the significant cells anyway.
If you have Statistics version 22 or later, you might think that the new OUTPUT MODIFY command or Table Style subdialog could handle this task, but it can't, because the significance values do not occur in the test table with their own column. While OUTPUT MODIFY is a fabulous tool for unconditional or conditional formatting of tables and is very easy to use, this problem is too complicated for it.
We should, before proceeding further, pause to clarify what we mean by coloring significant cells. Significance is a property of pairs of cells, not of individual cells. In the test table above, the columns are lettered A to C, and the key of the the row with a significantly smaller column proportion is shown in the category of the larger proportion. Without the letter codes, you would know, for example, that educational level 15 is significantly larger for the Clerical category than for some other job categories, but you would not know which ones. So we can't just replace the letter codes with a color without a loss of information.
We can, however, color cells corresponding to the significance codes while retaining the specific information about which categories they refer to. There are two steps to the process. The first step is to merge the test table into the main table. In order to do that, we need to use the SPSSINC MERGE TABLES extension command. MERGE TABLES works by joining columns (or rows) of tables based on the column labels, possibly adjusting for extra terms appearing in only one. In this example, we will join the tables based on the job category labels - Clerical, Custodial, and Manager, ignoring the Count and letter codes that also appear. Fortunately the default settings in MERGE TABLES do this automatically.
Using the MERGE TABLES dialog box under Utilities > Merge Viewer Tables, the dialog would look like this.
We didn't specify table types, so the command uses the most recent pair of tables in the Viewer. We are merging by column based on the Count statistic. In this case, that is the only statistic, but if the table also contained, say, percentages, specifying Count would single out the Count columns for the merge. MERGE TABLES has many other features for more complex merging. Check out the help if your tables are too complex for a simple merge.
Here is the result of the merge.
As you can see, the letter codes from the secondary tables now appear in the corresponding cells in the main table, and the caption from the secondary table has been attached to the main table. Now we want to go further and highlight the cells that have a letter code. This is where the SPSSINC MODIFY TABLES (Utilities > Modify Table Appearance) extension command comes in.
MODIFY TABLES can apply conditional or unconditional styling to table cells and labels. It has built-in syntax for selecting cells and bolding, foreground and background coloring, and other properties, but here we need to go a little further. We need to select the cells that contain a letter code and then apply styling. MODIFY TABLES allows for small plug-in functions to extend its capabilities, and, fortunately, there is a plug-in function included with the command to do exactly that. Here is what the MODIFY TABLES dialog looks like for this task.
We specified that all cells should be selected - note the odd notation - and that the action is to apply the function shown in the subdialog. In this example we accept the default background color of yellow, but you can specify red, blue, and green color values between 0 and 255 as parameters to get different colors.
To see the full set of functions in the customstylefunctions.py module included with the command, open that file in a plain text editor. The comments describe how to use each function, so you don't need to know Python in order to use them. Note, though, that the function and parameter names are case sensitive.
You can write your own custom functions, which are usually very short, if you know a little bit of Python.
To use a different background color, say, red, the custom function text would be
customstylefunctions.colorIfEndsWithAtoZLetter(r=255, g=0, b=0)
Finally, here is the result. All the cells that have a letter in the value now have a yellow background. We could have done something similar with bolding the cell text or applying a different styling.
In summary, by applying the SPSSINC MERGE TABLES and SPSSINC MODIFY TABLES to the CTABLES output containing the test results in a separate table, we have highlighted the significant differences while preserving the details from the secondary table.
p.s. If you are wondering about writing your own custom functions, this is the entire code for the function described here other than the standard function signature. First it calculates the color based on the parameters specified. Then, as it is fed the table data cells, it checks the value for a letter code and sets the background color if one is found.
Modified on by JonPeck
The SPSS Statistics catalog of table types has over 500 entries (and all Custom Tables count as just one), but sometimes the table you need is not available out of the box. This post is about a technique for combining portions of two tables to get the one you really wanted by using multiple procedures, OMS, and the STATS TABLE CALC extension command.
Suppose you have data such as the following consisting of five Likert scale measures and a category variable. (Realistically you would have many more cases).
You are interested in displaying a table of means for each group such as the following, but you would like to include the category significance level for each variable in that table.
Here is the table as it would be produced by CTABLES. It shows the mean score for each variable by category, but you need an ANOVA calculation to get the pvalues.
It's easy to use ONEWAY to calculate the ANOVA statistics. It produces this output.
What you want, though, is the numbers in the last column of that table added as a new column in the custom table. Enter STATS TABLE CALC. This extension command allows you to do calculations with the cells of a table in the Viewer as input and put the results in that table. If you have at least Statistics version 21 these can be in new rows or columns. In earlier versions of Statistics you can replace values in existing cells. TABLE CALC works like post computes in Custom Tables, but it can be used with almost any kind of table, and it has a much more powerful expression language.
Our problem here, though, is that we want the calculations to take values from the ANOVA table and add them to the Custom Table. TABLE CALC can do this by plugging in a small custom function. Here is how you could specify this using the dialog box for TABLE CALC, which is on the Utilities menu once this extension command is installed. I will explain the custom function below.
In the dialog I specified the table to modify by using its OMS subtype and other parameters. Specifying the target column as -1 means to insert it after the last column. The key here is the Formula field, which specifies that the values to be inserted are obtained by calling the function addsig in the module getsigs.py. The parameter "arg" being passed to that function will be automatically filled with information the function needs to do its job.
Before we run TABLE CALC, though, we need to capture the ANOVA table. We do that by using OMS, the Output Management System, to write that table into the XMLWORKSPACE. The sequence of commands is thus OMS, ONEWAY, OMSEND, CTABLES, and finally STATS TABLE CALC. Here is the OMS and ONEWAY code.
Now here is the magical Python function addsig I wrote and saved in the file getsigs.py.
The addsig function is called for each row in the table as the column is being added. arg is a dictionary with various useful variables in it. Details can be displayed by running STATS TABLE CALC /HELP. The code checks to see whether this is the first call for this table. If so, it retrieves the significance column of the anova table from the xmlworkspace where it was written by OMS. Note that it uses the xmlworkspace tag specified in the OMS command to find the table. That list of values is saved in arg so that it will be available each time addsig is called.
The function returns the saved significance value referring to the row (arg["roworcol"]) of the current call. That becomes the cell value for the new cell.
Here is what the table looks like after TABLE CALC completes. TABLE CALC has also set the format to show four decimals in the significance column.
This could be packaged in a custom dialog box with the variables and category variable as controls, but here is the full set of syntax used for this example.
STATS TABLE CALC is included in the current version of the Python Essentials, but it can be downloaded from the Extension Commands Collection on the SPSS Community site. The Python Essentials are either included with the Statistics installation or available through the same site.
A common need with SPSS Statistics is to produce some statistical results and then use them for further processing. We sometimes call that "output as input". This is very straightforward if you can do it with AGGREGATE or procedures that produce results as SPSS datasets, but it is possible to do this for anything that SPSS Statistics can put in a pivot table.
One way to retrieve output for reuse is to write a Basic or Python script.Â The output would be produced as pivot table(s) in the Viewer. Then the script would search through the Viewer document and find the desired table.Â Using something like the GetValueAt
api of the Datacells
class (Python) or the ValueAt
property of the Data Cells object (Basic), a program can retrieve cell values.Â The script might be kicked off via a SCRIPT command.
This works, but it can be tricky to program, and it is roundabout and inefficient.Â And in distributed mode with SPSS Statistics Server, this solution is unavailable.
Back in SPSS version 12 we introduced the Output Management System (OMS), but many users have only a vague understanding of this powerful mechanism.Â It provides a much easier and more efficient solution to grabbing output.Â Combining OMS and programmability, the output could still be processed by a Python or Basic script, but it could also be retrieved by a Python or .NET program - instead of a Python script - by using the XML workspace.Â This allows for better synchronization, and it works in either local or distributed mode.
OMS is a listener for the output.Â It is not built in to particular procedures.Â In fact, the procedure does not even know that something is listening.Â Rather, you start OMS listening for particular objects.Â When an object of interest comes along, OMS keeps a copy.Â When you stop the listener, the captured objects are written to memory, to a dataset, or to a file and are available to your SPSS syntax or programmability code.
You tell OMS what to listen for by selecting the types of objects, most often TABLES, and, if desired, the particular types of tables you want, such as a crosstabs table.Â You stop the listener with the OMSEND command.Â The OMS command specifies what the output format should be - including XML, SAV, HTML, text, Excel, Word, and PDF, and where to write it.
If you write the output to a dataset, then you can activate that dataset and apply standard SPSS commands to it.Â You can also access the dataset with Python programmability whether or not is is activated using the Dataset class in the spss module.
A more general mechanism is to have OMS write to the XML workspace.Â This is an in-memory structure that can be read by Python or .NET code.Â The OMS command assigns a name to the workspace item it creates.Â Then the program code can retrieve all or a selected part of that item using the GetXmlUtf16 Python api.Â (Similarly for .NET).Â You write an XPath expression to say which part of the xml you want.
XML and XPath are very powerful but can be a bit intimidating, so we have provided some Python helper functions to make it easy.Â In the spssaux module, which is installed with the Python plug-in, there is a function createXmlOutput that takes care of the OMS wrapper and writing to the workspace.Â All you give it is the command syntax you want and the identifiers for the type of table you are interested in.
Correspondingly, getValuesFromXmlWorkspace can retrieve specific information from the workspace item created by the first function.Â You use the visible properties of the table to determine what is to be retrieved.Â And then you are off to the races.
Here is an example.Â Let's say you want to run a regression and do something if the R Square statistic is too small.Â The R Square is in the Model Summary table.Â Here's an example of that table.
So the task is to run the regression and retrieve the second column of this table.Â Here is a little Python program to do this.Â It expects that the cars.sav data file shipped with SPSS Statistics is the active dataset.
import spss, spssaux
cmd="""REGRESSION /DEPENDENT mpg
/METHOD=ENTER horse weight."""
tag, errorlevel = \
Rsquare =spssaux.getValuesFromXmlWorkspace(tag, 'Model Summary',
colCategory="R Square", cellAttrib="number")
if Rsquare < .7:
Let's walk through this code.
- The cmd= line is the syntax to run to create the output we want to harvest.Â It could be more than one command.
- The createXmlOutput call runs the command, specifying that we are interested in the Model Summary table of the Regression command.Â It returns two values: a tag to use when retrieving output, and an error code, which is ignored in this example.
- The getValuesFromXmlWorkspace call uses the tag and the OMS table subtype along with specifying the part of the table we want.Â Looking at the example table, we see a column label that can be used for retrieval.Â That column will have its value stored as both text and a number in the xml, so we specify that we want the number form.Â The function returns a list of the things that matched.Â Here we take the first and only element.
We know what to retrieve just by looking at the labels in the table.Â In this example, just identifying the column is enough, but you can also specify row labels.Â Some tables are too complicated for this approach, but a great many things can be done using this simple model.Â The spssaux module also has a createDatasetOutput function that works in a similar way but creates an SPSS dataset instead of xml.Â Values would be retrieved from that dataset with the Dataset class or a cursor object.
Note that this table was not retrieved from the Viewer.Â It was captured by the OMS listener and placed in the workspace, from which the Python code extracted it.
Inside these Python functions, OMS commands and XPath expressions were generated, but you don't need to learn those technologies in order to benefit from them.
IBM SPSS Statistics contains many tools for data management. This post discusses several different ways to solve an example data management problem using technology from different areas of the product. The purpose of this post is to help you decide where to look when creating a problem solution considering the available functionality and other characteristics.
The problem to be solved here is to convert a text file of comma-separated id values such as postal codes into an SPSS dataset with one item per line. This might then be used as a lookup table.
The input data for this example look like this. It is a text file with each line containing a comma-separated string of values. Some maximum line length or number of values is assumed to be known.
The goal is to produce a dataset where each value is a separate case.
Solution 1. The first approach is to read the text using GET DATA into a set of variables, say, x, y, z - as many as needed. Then use the VARSTOCASES command to restructure these cases into a new dataset. The syntax would simply be
VARSTOCASES /make output FROM x y z
Discussion. VARSTOCASES has many features for more complicated problems of this type such as repeating groups, nonvarying variables and id variables. The restructure data wizard on the Data menu will walk you through this specification, but personally I find the syntax easier to figure out than the wizard in this case. CASESTOVARS does the reverse operation. NULL=DROP means here that if there were fewer than three values on an input line, the empty values would not contribute output cases.
So, game over, what could be easier? Common data manipulation problems often have a special command available for their solution. (AGGREGATE is another of my favorites) You might not be so lucky as to have your problem fit exactly into an existing command, so it's worth looking at more general mechanisms. The next solution uses an INPUT PROGRAM.
Solution 2. Input programs are a powerful but little-known mechanism to build cases in a very general way. They take as input raw data described by one or more DATA LIST commands and produce an SPSS dataset that might have a quite different structure. Here is an input program to solve our problem. I've inserted it as an image in order to take advantage of the syntax coloring of the SPSS Syntax Editor introduced in Version 17.
An Input Program to Restructure Variables into Cases
Discussion. This program defines its input with a Data List command for the csv file. It loops over each input record emitting the value of the input up to the next comma as a case containing a single variable named output. This continues for the input record until all the text has been consumed. Then it breaks and proceeds to the next input record.
Input programs can deal with hierarchical input, repeating groups, and self-defining data formats, among other things. Why are they not better known? I'll offer several reasons.
- There is no user interface for input programs. For many users, if there is no gui, the feature doesn't exist.
- The documentation in the Command Syntax Reference is not very good. It is scattered through the separate commands, so it is hard to grasp what input programs can do and how to use them.
- In the database era, data often come in rectangular database tables that are manipulated with SQL or in spreadsheets, so an input program is not needed. Those are the domain of the GET DATA command.
You usually need to experiment with input programs to figure out the solution. Once you get the pattern, though, you can easily solve other similar problems. One SPSS feature that you might not realize is actually an input program is the Make New Dataset with Cases custom dialog box available from the files section of this site. You specify in the dialog the number of variables and cases and a few other options, and it creates a dataset of random numbers. If you look into the dialog syntax, you will see that it generates an input program.
Solution 3. The next approach is to use Python programmability. Python being a very general purpose language, there are many variations that would work (hence the 1/2 in the title). I'll start by getting the data by using the csv module in the Python standard library and generating an SPSS dataset from that input. The csv module understands several common dialects of comma-separated files and allows you to control details of the formats, but we don't need that generality here.
The code loops over the lines in the csv file and then over the values within the line. The csv reader automatically splits the line into fields at the comma boundaries. It regards the line contents following the last comma as an additional empty field, so the code ignores that. The spss.Dataset class is used to generate cases containing the values extracted from the input file. This code would be wrapped in BEGIN PROGRAM/END PROGRAM if run in the regular SPSS syntax stream, but it could also be run in external mode directly from a Python environment. In that mode, which can be much faster than the usual internal mode, no SPSS user interface appears. It is much like using the StatisticsB module included with SPSS Statistics Server, but it runs locally.
Alternatives with the Python approach would be to read the input from the active SPSS dataset and create a new one. The spssdata.Spssdata class could also be used to write the new dataset.
So, which is best? The answer, of course, depends. VARSTOCASES wins if it fits the problem. The choice between an input program and a Python program is partly a matter of taste and skill set, partly whether you want the help of an IDE in creating and debugging the code. Using direct SPSS syntax will generally be faster for passing cases, although for some problems the power of the Python language might win out on performance grounds. Performance improvements in the Dataset class are currently underway.
You can even use the input program within a Python program by just using spss.Submit to run that whole block of code. One reason to do that would be for convenient parameterization of the code while getting the speed advantage of native SPSS case passing.
At least the filename and perhaps the output variable name would vary if you use this code repeatedly. There are three general ways to parameterize it. The traditional route is the SPSS macro language, For both VARSTOCASES and the input program, you could define a macro that substitutes the parameters into the syntax. Running it just means calling the macro. Except that you have to load the file containing the macro explicitly for each session.
With Python, you might turn this program into a function of two parameters and then reference those parameters in the code. You would just import the function definition (no location information required) and call it in a short program. You might, in fact, collect a bunch of utility functions in a single module and just import it once in the job. Even without creating a function, though, you could use multiline triple-quoted Python literals and named parameters in the command string to produce readable and flexible code. The Programming and Data Management book downloadable from this site in teh Articles section illustrates this idea.
Finally, you could create a custom dialog box using the Custom Dialog Builder and just put the code in whichever form you chose in that dialog. Here's my little dialog box. You could make your own in a few minutes.
If this isn't enough choices, you could also turn the program into an extension command with traditional SPSS syntax. You can read about that in earlier posts in this blog and on this site. That tidies up the log and makes the program available without any explicit load required.
If you have complex file types including nesting, grouped records, mixed record types, and repeating data, you might want to take a look at Appendix C in the Command Syntax Reference, entitled Defining Complex Files.
In summary, you have many tools in SPSS Statistics for problems such as this. Choosing the right tool for the problem and for your skill set will get the job done with the minimum amount of effort. Don't hesitate to venture into areas of SPSS technology that might be new to you.