Modified by JonPeck
SPSS Statistics was, as far as I know, the first commercial software to deliver an integration with the R statistical language. It first appeared in SPSS 16, over six years ago at this writing, complementing the Python language integration that first appeared in SPSS 14. This post reviews the rationale and developments in this feature.
R has become the computational language of the statistics profession. It's the way a new statistical algorithm is first published, and the R library contains a vast collection of statistical functions. The R jungle contains many gems, but it has drawbacks, too. This isn't the place to discuss all the good and the bad, but suffice it to say that using R directly imposes a style of doing statistical analysis based on a programming model that does not always suit an analyst, and the output from a R package is usually not in a format suitable for publication. And, while there are various point and click interfaces that can be added on for some R packages, serious usage requires the user to learn the R language, which is not easy.
SPSS Statistics and, as of Version 16, SPSS Modeler, bring to bear the ease of use of these products and their output presentation capabilities that allow a user to work with these products while still tapping the power and packages of R. While a user can write programs in the R language that run within the Statistics or Modeler program, typically the SPSS user takes advantage of R packages that have already been integrated using the published apis and tools for this purpose without the need to learn R or deal with it directly. R packages can extend the statistical capabilities of these products without sacrificing the benefits of SPSS software. The R connection requires an extra installation step (R itself and, via the SPSS Community website, the R Essentials), but all the pieces for this are free. Statistics and Modeler can be a great way to deploy the functionality of R.
Organizations and individuals can do their own, private integrations of R packages, but the SPSS Community site provides a means of sharing integrations with everyone. Instructions for sharing are on the front page of the site. For SPSS Statistics, you can start here to see what has been shared. With Statistics version 22 or later, you can also download and install package integrations from the Utilities menu within Statistics without even visiting the site.
The image also shows extensions implemented in Python.
As of this writing, there are 25 R packages that have been integrated by IBM and 10 contributed by users. Package integrations generally include a dialog box interface produced by the Statistics Custom Dialog Builder and traditional SPSS syntax for the package. They produce their output as SPSS pivot tables and R graphic images that appear in the Statistics Viewer along with other output produced by native SPSS commands. For packages that are included in the R Essentials, the dialog and output are usually translated into all the languages that Statistics itself provides.
For Modeler 16, an adaptation of the Custom Dialog Builder is included, and nodes can build models and provide code to be used with those models for scoring. Using the new Hadoop integration, the scoring can be performed on the Hadoop cluster with big performance benefits. Similar to Statistics, the user of R- based nodes sees the same behavior that comes with native nodes.
Producing a package integration for Statistics is usually easy for someone who knows the R language. It can be as simple as adding a line to fetch the data. Usually the integration will convert plain text R output to one or more pivot tables, and it may create new datasets. That takes a little longer, but it is still typically a few hours to a few days. The apis for all this are covered in detail in the help installed with the R plug-in: Help > Programmability > R Plug-In. Since the package integrations are usually distributed in source form, they can serve as examples. Integration creators can also do translations, but not many are prepared to handle this. The SPSS forums are a good place to ask questions about this technology.
There is a white paper that discusses the benefits and technology of using R with Statistics that provides more details,
In sum, the R integrations for Statistics and Modeler allow access to the large R library but package it in a form that fits in with the native capabilities of these products. It's a win for everyone, and it's all free.
Modified by JonPeck
The SPSS Statistics Custom tables (CTABLES) procedure can test equality of column proportions and column means. The test results are displayed either in a separate table or, using APA style, in the main table. Users commonly want to highlight significant differences, which is not possible using built-in functionality. This article explains how to do this.
Setup: In order to use the approach described here, you need the Python Essentials, including the SPSSINC MERGE TABLES and SPSSINC MODIFY TABLES extension commands. If these are not already installed, they can be obtained via the SPSS Community website. Details of what is installed by default vary across versions of Statistics. If your installation of SPSSINC MODIFY TABLES does not have the custom function described below, get a newer version from the website.
Here is the syntax and output from CTABLES testing equality of column proportions, The example uses the employee data.sav file shipped with Statistics. Our goal is to show the significant differences in the main table and highlight them..
You might first ask why we did not start with the single table output in APA style, since we want to show significance in the main table. There are two reasons: first, in my experience, nobody likes the APA style, and, second, it won't help us to color the significant cells anyway.
If you have Statistics version 22 or later, you might think that the new OUTPUT MODIFY command or Table Style subdialog could handle this task, but it can't, because the significance values do not occur in the test table with their own column. While OUTPUT MODIFY is a fabulous tool for unconditional or conditional formatting of tables and is very easy to use, this problem is too complicated for it.
We should, before proceeding further, pause to clarify what we mean by coloring significant cells. Significance is a property of pairs of cells, not of individual cells. In the test table above, the columns are lettered A to C, and the key of the the row with a significantly smaller column proportion is shown in the category of the larger proportion. Without the letter codes, you would know, for example, that educational level 15 is significantly larger for the Clerical category than for some other job categories, but you would not know which ones. So we can't just replace the letter codes with a color without a loss of information.
We can, however, color cells corresponding to the significance codes while retaining the specific information about which categories they refer to. There are two steps to the process. The first step is to merge the test table into the main table. In order to do that, we need to use the SPSSINC MERGE TABLES extension command. MERGE TABLES works by joining columns (or rows) of tables based on the column labels, possibly adjusting for extra terms appearing in only one. In this example, we will join the tables based on the job category labels - Clerical, Custodial, and Manager, ignoring the Count and letter codes that also appear. Fortunately the default settings in MERGE TABLES do this automatically.
Using the MERGE TABLES dialog box under Utilities > Merge Viewer Tables, the dialog would look like this.
We didn't specify table types, so the command uses the most recent pair of tables in the Viewer. We are merging by column based on the Count statistic. In this case, that is the only statistic, but if the table also contained, say, percentages, specifying Count would single out the Count columns for the merge. MERGE TABLES has many other features for more complex merging. Check out the help if your tables are too complex for a simple merge.
Here is the result of the merge.
As you can see, the letter codes from the secondary tables now appear in the corresponding cells in the main table, and the caption from the secondary table has been attached to the main table. Now we want to go further and highlight the cells that have a letter code. This is where the SPSSINC MODIFY TABLES (Utilities > Modify Table Appearance) extension command comes in.
MODIFY TABLES can apply conditional or unconditional styling to table cells and labels. It has built-in syntax for selecting cells and bolding, foreground and background coloring, and other properties, but here we need to go a little further. We need to select the cells that contain a letter code and then apply styling. MODIFY TABLES allows for small plug-in functions to extend its capabilities, and, fortunately, there is a plug-in function included with the command to do exactly that. Here is what the MODIFY TABLES dialog looks like for this task.
We specified that all cells should be selected - note the odd notation - and that the action is to apply the function shown in the subdialog. In this example we accept the default background color of yellow, but you can specify red, blue, and green color values between 0 and 255 as parameters to get different colors.
To see the full set of functions in the customstylefunctions.py module included with the command, open that file in a plain text editor. The comments describe how to use each function, so you don't need to know Python in order to use them. Note, though, that the function and parameter names are case sensitive.
You can write your own custom functions, which are usually very short, if you know a little bit of Python.
To use a different background color, say, red, the custom function text would be
customstylefunctions.colorIfEndsWithAtoZLetter(r=255, g=0, b=0)
Finally, here is the result. All the cells that have a letter in the value now have a yellow background. We could have done something similar with bolding the cell text or applying a different styling.
In summary, by applying the SPSSINC MERGE TABLES and SPSSINC MODIFY TABLES to the CTABLES output containing the test results in a separate table, we have highlighted the significant differences while preserving the details from the secondary table.
p.s. If you are wondering about writing your own custom functions, this is the entire code for the function described here other than the standard function signature. First it calculates the color based on the parameters specified. Then, as it is fed the table data cells, it checks the value for a letter code and sets the background color if one is found.
Suppose you want to generate all two-way crosstabs for a set of variables in your dataset but without creating any of the redundant transposed tables (V2 BY V1 and V1 by V2). You could, of course, write a lot of CROSSTABS commands, but that would be very tedious and error prone. You could also write an SPSS Statistics macro that looped over all the variable pairs. Instead, we will explore using the SPSSINC PROGRAM extension command and generalizing the solution.
It's easy to write a program with a list of variables that generates all the necessary CROSSTAB syntax, but editing an embedded variable list is problematic: it requires a bit of Python knowledge on the part of the user, and it makes the program specific to a particular usage. The SPSSINC PROGRAM command is designed to allow arguments such as a list of variable names to be passed into a function using the standard Python command line mechanism to retrieve the arguments. Here is the first version of the program followed by a usage example. These examples ignore the 20-table limit in CROSSTABS and do not check for bad usage such as specifying only one input variable as that is beside the point here, but such checks could easily be added.
The program defines a function named manytabs. It retrieves its arguments from the list sys.argv. Because the first item in the argument list is the name of the program, the variable names start with the second item. In order to run the program, the user runs SPSSINC PROGRAM with the first parameter being the name of the program to run followed by two or more variable names. The author could have created an extension command for this program, but this program operates essentially like a built-in command without putting in that extra although modest effort.
In this example, the program is in-line in the syntax window. However, the program could be a function in a Python module, in which case the user would include the module name in the first parameter, e.g., mylibrary.manytabs. Although this program does nothing that couldn't be done with an SPSS macro, it has the advantage that if it is in an external module, there is no need to explicitly load the module in order to run it. As long as the module is on the Python search path, it will be loaded automatically when needed.
To run this program, the user has to list all the variables to be crosstabbed. We can make it smarter by making it use variable metadata such as patterns in the variable names or the variable measurement level to select the variables. In the second version, the program automatically crosstabs all of the variables of specified measurement levels. The command parameters are now measurement levels instead of variable names.
This time the variable list is built by filtering the set of variables by their measurement levels. (In a real version, we would issue a message and stop if no variables matched the specified types rather than mysteriously producing no output or a puzzling message.) The program is now more general in a certain respect, but it is no more complicated than it was before. This version could not be duplicated by a reasonable SPSS macro.
If the program arguments are more complex, say a combination of variable names or types and other crosstab specifications, the author could use keywords and standard library Python tools such as the argparse module to interpret the request. The extension command mechanism would be a better solution if the parameters are complex, however.
The SPSSINC PROGRAM extension command is available in the Extension Commands collection of the SPSS Community website or directly here.
In summary, this simple mechanism allows you to use Python programs as SPSS Statistics commands that take parameters using standard Python command line conventions with no extra work.
Modified by JonPeck
The SPSS Statistics catalog of table types has over 500 entries (and all Custom Tables count as just one), but sometimes the table you need is not available out of the box. This post is about a technique for combining portions of two tables to get the one you really wanted by using multiple procedures, OMS, and the STATS TABLE CALC extension command.
Suppose you have data such as the following consisting of five Likert scale measures and a category variable. (Realistically you would have many more cases).
You are interested in displaying a table of means for each group such as the following, but you would like to include the category significance level for each variable in that table.
Here is the table as it would be produced by CTABLES. It shows the mean score for each variable by category, but you need an ANOVA calculation to get the pvalues.
It's easy to use ONEWAY to calculate the ANOVA statistics. It produces this output.
What you want, though, is the numbers in the last column of that table added as a new column in the custom table. Enter STATS TABLE CALC. This extension command allows you to do calculations with the cells of a table in the Viewer as input and put the results in that table. If you have at least Statistics version 21 these can be in new rows or columns. In earlier versions of Statistics you can replace values in existing cells. TABLE CALC works like post computes in Custom Tables, but it can be used with almost any kind of table, and it has a much more powerful expression language.
Our problem here, though, is that we want the calculations to take values from the ANOVA table and add them to the Custom Table. TABLE CALC can do this by plugging in a small custom function. Here is how you could specify this using the dialog box for TABLE CALC, which is on the Utilities menu once this extension command is installed. I will explain the custom function below.
In the dialog I specified the table to modify by using its OMS subtype and other parameters. Specifying the target column as -1 means to insert it after the last column. The key here is the Formula field, which specifies that the values to be inserted are obtained by calling the function addsig in the module getsigs.py. The parameter "arg" being passed to that function will be automatically filled with information the function needs to do its job.
Before we run TABLE CALC, though, we need to capture the ANOVA table. We do that by using OMS, the Output Management System, to write that table into the XMLWORKSPACE. The sequence of commands is thus OMS, ONEWAY, OMSEND, CTABLES, and finally STATS TABLE CALC. Here is the OMS and ONEWAY code.
Now here is the magical Python function addsig I wrote and saved in the file getsigs.py.
The addsig function is called for each row in the table as the column is being added. arg is a dictionary with various useful variables in it. Details can be displayed by running STATS TABLE CALC /HELP. The code checks to see whether this is the first call for this table. If so, it retrieves the significance column of the anova table from the xmlworkspace where it was written by OMS. Note that it uses the xmlworkspace tag specified in the OMS command to find the table. That list of values is saved in arg so that it will be available each time addsig is called.
The function returns the saved significance value referring to the row (arg["roworcol"]) of the current call. That becomes the cell value for the new cell.
Here is what the table looks like after TABLE CALC completes. TABLE CALC has also set the format to show four decimals in the significance column.
This could be packaged in a custom dialog box with the variables and category variable as controls, but here is the full set of syntax used for this example.
STATS TABLE CALC is included in the current version of the Python Essentials, but it can be downloaded from the Extension Commands Collection on the SPSS Community site. The Python Essentials are either included with the Statistics installation or available through the same site.
Modified by JonPeck
The SPSS Community website has a wealth of resources for users of SPSS Statistics and other SPSS products. In particular, there are many extension commands you can download and use for free. Extension commands typically consist of a dialog box interface, traditional SPSS syntax, and an implementation in Python or R. Once installed, they are nearly indistinguishable from built-in functionality. The biggest problem we hear about with these is that people (1) don't know they exist, or (2) can't find one they are looking for. The newly released Statistics version 22 has a solution.
In Statistics 22, you can use Utilities > Extension Bundles > Download and Install Extension Bundles to do exactly that without visiting the website. That menu item goes to the Extension Commands collection on the website, finds everything available, and displays a list of the available items. This is a portion of what you might see
At this writing, there are 85 items in the list. You can sort this list by clicking on any heading, and you can search for keywords and text by using the box at the top. Items not developed by IBM are marked with an asterisk. (See the SPSS Community home page on how you can contribute your own items.)
On the right side of the dialog, the version number of the item on the website and, if it is installed, the installed version number are shown along with a selection checkbox. The Prerequisites column shows whether you already have any required plugins installed or not and meet the minimum version requirement.
You can check the boxes for the updates or new items you want. Clicking Select all updates checks the boxes for any installed item that is not the latest version.
Version numbers have three parts. The first part is incremented for a major change, the second for a minor change, and the third for a bug fix.
The last step is to choose whether to download and install the selected items or just download them. If you are doing the installing or updating for a computer that does not have Internet access or a user who does not have enough permissions to update or to keep a shared collection of extensions, choose the Download but do not install button and later use Utility > Extension Bundles > Install Local Extension Bundle on the target computer.
If you are installing an R-based extension that requires packages beyond what is installed with Base R, the installer will attempt to download and install those packages at the same time. If it cannot because of insufficient permissions or other problems, you will need to install those packages manually later even if the extension command install is successful. You can see a list of any required packages for an installed extension via Utilities > Extension Bundles > View Installed Extension Bundles and clicking on the descriptive text. If the summary line in the Downloads dialog indicates that there are additional installation requirements, look for a file with installation instructions in the directory where the extension is installed.
This new mechanism for extension commands makes it much easier to find and maintain items from the SPSS Community (www.ibm.com/developerworks/spssdevcentral) Extension Commands collection. All of these items, however, continue to be available directly from the website.
We are sometimes asked why we use Python as the main programmability and scripting language in IBM SPSS Statistics. Here's a quote from a story dated October 17, 2011 InfoWorld Developer_World article here
The article is entitled From PHP to Perl: What's hot, what's not in scripting languages.
Hot scripting language: Python
In a sense, the tipping point for Python came when the housing market crashed.
those stuck trying to decode bond market prospectuses to figure out who
got paid what when the bankruptcy dominoes were done falling, one thing
became clear: English is a weaselly language, and some weaselly folks
revel in its ambiguities to profit from complicated derivatives.
one smart group that offered a compelling solution: Force every bond to
come with a Python program that defined who got paid and when they were
paid. While they may have been a bit too hopeful about the power of
computer languages, the proposal spoke to Python's growing acceptance in
the greater world of smart people who aren't computer geeks. Relatively
easy to pick up, Python is becoming increasingly popular in economic
research, science departments, and biology labs.
I'm not so sure about the bond market idea(!), but Python excels in its combination of clarity, flexibility, and power. And, case in point, statistics and data analysis. SPSS began embedding Python with version 14 way back in 2005. As of October, the TIOBE index shows Python as the 8th most popular programming language overall FWIW. While this is only a partial view of language usage, it confirms again that Python is a class A language.
Case-control matching is a popular technique used to pair records in the "case" sample with similar records in a typically much larger "control" sample based on a set of key variables. This post discusses the FUZZY extension command for SPSS Statistics that implements this technique and some recent enhancements to it.
In this discussion I will refer to records rather than what SPSS usually calls cases in order to avoid confusion with case as in case control.
FUZZY takes two datasets as input (the demander and supplier datasets), matches the records according to a set of BY variables, and provides various ways of writing the output. It does not have a dialog box interface, but running FUZZY /HELP displays the complete syntax. Matches can be required to be exact on all variables, or a tolerance or "fuzz" factor, which could be zero, can be specified for each matching variable. (String matches can only have fuzz 0.)
Matches are not always possible: missing vaues or blank strings in a BY variable preclude matching. There might not be a close enough record in the supplier dataset to pair with a demander record. You might run out of eligible supplier records before the demander records are all paired. Unpaired output is set to the system missing value or, for strings, to blank.
FUZZY proceeds by finding for each demander record all of the supplier records that are close enough on the BY variables. This requires a lot of comparisons! It then proceeds through the demander records and picks a supplier record at random from all those eligible for that record. (You can request multiple supplier records for each demander if needed.) No attempt is made to find the closest eligible record, since there is no measure of closeness across the set of BY variables and for other reasons.
If sampling from the suppliers is done without replacement, which is the default behavior, then using a supplier record makes it unavailable for matching in later demander records and could result in a later record having no match. With fuzzy matches, it could be that both records would have been matched if the order were reversed.
While one would generally want to specify an exact match for categorical variables, at least those with nominal measurement level, continuous variables such as income or age might require some fuzz. New output from FUZZY can help to diagnose which BY specifications cause a record to go unmatched. Here is a table produced by FUZZY that shows how the BY criteria restrict the matches.
In this example, we are matching two datasets about vehicles on the variables origin and cylinder. We require an exact match on origin, but allow a difference of up to 3 in the number of cylinders as shown in the first column.
Th first row records the number of comparisons between demander and supplier records testing for an exact match. 95% of these comparisons were not exact matches. Comparing each demander record to each supplier record, only 5% matched exactly.
Next, the table shows that among comparisons after removing the 5% that matched exactly, 85% did not match on origin. Then, considering only the records where there was an exact match on origin, 75% of the comparisons did not match on cylinder. Each row of the table is based on the comparisons that passed, i.e., were within tolerance, on all of the preceding rows.
This table, which precedes the actual pairing step, can be useful in finding the variables that are most important in preventing matches. You may need to increase the tolerance, or you may just be out of luck if there are insufficient matches. The table does not tell you how successful the actual pairing step will be, because supplier records will be used up as the pairing pass proceeds, but it gives insight into how the variables filter the matches. The tries and rejection columns are only produced when you specify fuzz, but you could set all the fuzz values to 0 to see the results with exact matching.
The next table shows the distribution of eligible matches for the pairing pass (this example is based on a very small dataset). It shows how many eligible records there were for each demander record in the pairing pass. It shows that there were two demander records for which there were zero eligible supplier records, three where there was only one, and one where there were two to ten eligibles. This gives you a good idea of how rich the supplier dataset is in matchables, but it doesn't say anything about which variables have the biggest effects on pairing.
If you are lucky, all your demander records will find a match, but if they don't what can you do? Recall that the pairing stage is first come, first serve. With fuzzy matching, reordering the demander cases might work better. While FUZZY can't find an optimal order, one more output feature can help you improve the results. Specifying DRAWPOOLSIZE=varname will add a variable to the demander dataset that records how many eligible supplier records there were for each demander record. You can then study the characteristics of the demander records where suppliers are scarce to see where the supplier dataset is too thin. A good start, though, to improving the pairing percentage is to sort the demander dataset by the newly created poolsize variable. That puts the least paired demander records first in line for a match when you rerun the process and will generally reduce the number of unmatched cases if that is possible.
There are statistical issues regarding how you choose the BY variables that are not addressed here. Searching the web for something like "case control matching" will turn up numerous references.
FUZZY is available from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) and requires the Python Essentials for your version of SPSS Statistics.
This is a quick note for people who produce a lot of tables with SPSS Statistics or use a lot of scripting code for formatting.
Statistics Version 20 was released on Tuesday, August 16, 2011. One of the big improvements is in the speed of table production. Five times faster is typical! If you use scripting or tools such as SPSSINC MODIFY TABLES to modify the formatting or do other table operations, these will also run much faster. There is no need to use the (unscriptable) lightweight tables from Version 19 in order to get fast table production.
In Version 20, fast tables are the default option. Only if you need to produce spv files that can be read by older versions of Statistics do you need to change this option. Fast tables support the full formatting and scripting capabilities of the previous table format.
Of course, there are other important improvements in Version 20 such as mapping as a base feature, but I am particularly pleased with this improvement and congratulate the team that worked hard to make this happen.
Users of traditional SPSS Statistics syntax are used to using the macro facility to parameterize blocks of syntax so that it is more flexible and can be varied without having to duplicate and edit the code. However the GGRAPH command, which provides deep access to the capabilities of the graphics engine, specifies the graph using GPL, the graphics specification language of SPSS. And GPL does not work with macro. How, then, can GPL code be parameterized? This post explains how to do this and, in the process, how to build a library of your own graphics specifications that removes the chart definition details from your syntax stream.
First, let's look at the syntax for a bar chart as generated by the Chart Builder. These examples use the employee data.sav file shipped with the software.
- The GGRAPH command is a standard Statistics command and follows all the normal rules for syntax. Macro can be used with it.
- The GPL block contains the actual chart specifications as indicted by the GGRAPH GRAPHSPEC subcommand. As you can see, it looks different from traditional syntax, and it follows different rules. GPL syntax is explained in detail in the Help under the GPL topic. You can do many things with it, but using macro is not one of them.
- This syntax is completely specific to the specification for this chart. To change the title, say, would require manual editing of the GUIDE statement with text.title (or generating a new command with the Chart Builder). Not very good for production work.
If you can't use macro to generalize this, what can you do? I'll show you how to use Python programmability not only to replace macro but to build a library of chart definitions that can be shared among different syntax streams. First, lets see how we could parameterize the title of the chart. (Real problems will want to do more, but the idea is the same.) Here is the first version.
- The entire GGRAPH command and the GPL code are assigned to the variable cmd inside the BEGIN PROGRAM block. The text of the command is the same as before except for the title line. In that line, in place of the title, we have the notation
That means to insert the value of the variable thetitle there. It's just like macro substitution here except that it works! (I also added a COMMENT line to the GPL.) The substitution is triggered by the notation above, and the values to substitute come from the
at the end.
- The value of thetitle is set at the top of the program block. The value is enclosed in triple quotes, so it could be multiple lines or text that contained quote characters.
- The last line of this program uses the spss.Submit function to run the command whose syntax is in cmd.
Using this mechanism, we have generalized the command to allow for any title text. A real problem would usually have more than one substitution parameter, but the logic is the same. Refer to the parameter by name in the appropriate part of the GPL and assign a name at the top. You might also need a little code to parameterize the axis labeling based on variable labels. That's easy to do, but I won't explain that here.
This mechanism requires that you install the Python Essentials available (for free) from this site.
So now we have solved the problem of generalizing the GPL, but having generalized this command, we might want to use it in other job streams. Duplicating the code is always a bad idea. Python lets us remove the code from the job stream and just refer to it. It's something like the Statistics INSERT command, but it is more flexible.
Here is the third version of the code where the GGRAPH and GPL code has been removed from the job stream.
- Now in the program code, we import a library named chartlib and then call a function in that library passing in the title. chartlib could contain many functions that define different sorts of charts (or do other things). Now improvements can be made once in chartlib and used by all the job streams that import it.
- The import statement did not say where to find chartlib. Python has an elaborate strategy for finding imported modules. Refer to the Python documentation for the full story, but for now, we will just put chartlib.py in the extensions subdirectory of the SPSS Statistics installation. Python will find it there.
What remains is to see what the chartlib module looks like. Here it is. It looks almost identical to our first parameterized version, except that the chart code has been moved inside a function named mybarchart
that has one parameter for the title. Everything is indented under the function declaration, which starts with def
. The line after def is the docstring, which should be used to document the function.
By putting this code inside a function, we open the door to defining many functions in this same module and selecting the one we want in the Statistics syntax stream, passing in any desired parameters.
Summarizing, by parameterizing the GPL code and moving it into a function in our library module, we have generalized the code and made it easy to maintain and share across different job streams. Although this posting is motivated by the need to parameterize GPL, these techniques can be used with any Statistics code.
There is one more thing we could do to completely hide the Python code in the job stream. We could use the SPSSINC PROGRAM extension command to provide standard Statistics code for passing the parameters and invoking the relevant function. I'll leave that for another time, but you can get that extension command from this site and read about it in the module you download.
Recently this problem was posed on the SPSSX-L listserv (linked on the SPSS Community site): Count the number of distinct values in a set of variables for each case. This led to a lively discussions of alternative solutions. Most used traditional syntax. I used Python programmability.
Datasets sometimes need to be restructured between wide form
, where there are multiple measurements on a concept expressed as different variables within a case, and long form
, where each repeated measurement is a separate case. Commonly, repeated measures, for example, test scores at different points of time for a subject, are stored in wide form, and IBM SPSS Statistics procedures that focus on repeated measures designs tend to require this form. Most data transformations and simple summary statistics, however, are intended for long form. Thus an easy way to convert between these forms is important.
IBM SPSS Statistics provides the commands CASESTOVARS
to convert between these forms. These are used by the Restructure Data Wizard , which appears on the Data menu.
SPSS Statistics has a number of transformation functions that work across variables within a case, such as mean, median, and sum, and the COUNT command that counts occurrences of a particular value, but there is no built-in function for counting distinct values. By converting these data from wide form to long form, the problem can be solved with traditional syntax.
The traditional syntax solution, then, has these steps (mainly worked out by David Marso) The syntax can all be generated using the GUI.
- Save the dataset if there have been changes.
- Use VARSTOCASES to replace the active dataset with one containing an id variable (from the original data or generated) and one variable representing the variables over which the calculation will be done. Call that new variable Z. The number of records is M * NV, where M is the number of IDs and NV is the number of variables over which the count is required.
- Use AGGREGATE with the ID variable and Z as the break variables. Use N as the statistic. Now the number of records in this new dataset is M * average number of distinct values.
- Activate the new dataset and use AGGREGATE again on it breaking just on ID, and use N as the statistic. This results in one record per ID with the count of distinct values. The dataset set is M records.
- Get the original dataset, and use MATCH FILES to add this value back to it.
This works, but it is quite a few data passes, and it takes some study to understand the code.
The programmability solution is much simpler. It uses the SPSSINC TRANS
extension command along with a two-line Python program to arrive at the same result. Here is the Python program followed by the command syntax. The program explanation
The SPSSINC TRANS explanation
- The "*args" signature in the countThem function means that args will be a list of the arguments (variable values in this case) passed to the function when it is called, so the function can handle any number of variables.
- set(args) creates a set from that list. Since a set can only contain the same item once, it will contain only the distinct values. If I pass it the list [1,2,1], the set will contain two members: 1 and 2.
- The length of the set, returned by the len function, is the number of members and hence the return value is the number of distinct values. This includes None, if there were any SYSMIS values. That could easily be excluded if they should not be counted.
- The program defined in the begin program block remains available throughout the session even though the program block has terminated, so once it is defined, it can be called elsewhere in that session.
- SPSSINC TRANS is an extension command implemented in Python that applies the code in the FORMULA subcommand to the cases in the active dataset and stores the result in the RESULT variable. It fetches the variable values referenced in the formula and calls whatever function was specified.
- Since countThem is called without any module qualifier, the extension command first tries to find that function in the items that have been defined in begin program blocks. If not found there, it looks for a function built in to Python, e.g., min, max, sum, ... If you have a function defined in some other module, you can reference it as modulename.funcname, and the command will load that module and then call the indicated function.
- The entire formula is quoted so that it will not be digested by the Statistics parser but rather passed as is to the command. Variable names are case sensitive.
- SPSSINC TRANS can return more than one value, hence creating multiple variables, and it has a number of other features that can be found in its syntax or dialog box help (Transform>Programmability Transformation).
So which is better? The traditional syntax does not require any programmability knowledge and works on quite elderly versions of Statistics, but it is rather complicated and takes many data passes. The Python approach requires a little knowledge of programmability and requires at least version 17 (2008) and the Python plugin, but, given that, it is easy to read and takes only one data pass. That data pass, however, will be slower than a native data pass.
The Python Essentials and the SPSSINC TRANS extension command can be downloaded from the SPSS Community site (www.ibm.com/developerworks/spssdevcentral if you are not reading this on the site).