Case-control matching is a popular technique used to pair records in the "case" sample with similar records in a typically much larger "control" sample based on a set of key variables. This post discusses the FUZZY extension command for SPSS Statistics that implements this technique and some recent enhancements to it.
In this discussion I will refer to records rather than what SPSS usually calls cases in order to avoid confusion with case as in case control.
FUZZY takes two datasets as input (the demander and supplier datasets), matches the records according to a set of BY variables, and provides various ways of writing the output. It does not have a dialog box interface, but running FUZZY /HELP displays the complete syntax. Matches can be required to be exact on all variables, or a tolerance or "fuzz" factor, which could be zero, can be specified for each matching variable. (String matches can only have fuzz 0.)
Matches are not always possible: missing vaues or blank strings in a BY variable preclude matching. There might not be a close enough record in the supplier dataset to pair with a demander record. You might run out of eligible supplier records before the demander records are all paired. Unpaired output is set to the system missing value or, for strings, to blank.
FUZZY proceeds by finding for each demander record all of the supplier records that are close enough on the BY variables. This requires a lot of comparisons! It then proceeds through the demander records and picks a supplier record at random from all those eligible for that record. (You can request multiple supplier records for each demander if needed.) No attempt is made to find the closest eligible record, since there is no measure of closeness across the set of BY variables and for other reasons.
If sampling from the suppliers is done without replacement, which is the default behavior, then using a supplier record makes it unavailable for matching in later demander records and could result in a later record having no match. With fuzzy matches, it could be that both records would have been matched if the order were reversed.
While one would generally want to specify an exact match for categorical variables, at least those with nominal measurement level, continuous variables such as income or age might require some fuzz. New output from FUZZY can help to diagnose which BY specifications cause a record to go unmatched. Here is a table produced by FUZZY that shows how the BY criteria restrict the matches.
In this example, we are matching two datasets about vehicles on the variables origin and cylinder. We require an exact match on origin, but allow a difference of up to 3 in the number of cylinders as shown in the first column.
Th first row records the number of comparisons between demander and supplier records testing for an exact match. 95% of these comparisons were not exact matches. Comparing each demander record to each supplier record, only 5% matched exactly.
Next, the table shows that among comparisons after removing the 5% that matched exactly, 85% did not match on origin. Then, considering only the records where there was an exact match on origin, 75% of the comparisons did not match on cylinder. Each row of the table is based on the comparisons that passed, i.e., were within tolerance, on all of the preceding rows.
This table, which precedes the actual pairing step, can be useful in finding the variables that are most important in preventing matches. You may need to increase the tolerance, or you may just be out of luck if there are insufficient matches. The table does not tell you how successful the actual pairing step will be, because supplier records will be used up as the pairing pass proceeds, but it gives insight into how the variables filter the matches. The tries and rejection columns are only produced when you specify fuzz, but you could set all the fuzz values to 0 to see the results with exact matching.
The next table shows the distribution of eligible matches for the pairing pass (this example is based on a very small dataset). It shows how many eligible records there were for each demander record in the pairing pass. It shows that there were two demander records for which there were zero eligible supplier records, three where there was only one, and one where there were two to ten eligibles. This gives you a good idea of how rich the supplier dataset is in matchables, but it doesn't say anything about which variables have the biggest effects on pairing.
If you are lucky, all your demander records will find a match, but if they don't what can you do? Recall that the pairing stage is first come, first serve. With fuzzy matching, reordering the demander cases might work better. While FUZZY can't find an optimal order, one more output feature can help you improve the results. Specifying DRAWPOOLSIZE=varname will add a variable to the demander dataset that records how many eligible supplier records there were for each demander record. You can then study the characteristics of the demander records where suppliers are scarce to see where the supplier dataset is too thin. A good start, though, to improving the pairing percentage is to sort the demander dataset by the newly created poolsize variable. That puts the least paired demander records first in line for a match when you rerun the process and will generally reduce the number of unmatched cases if that is possible.
There are statistical issues regarding how you choose the BY variables that are not addressed here. Searching the web for something like "case control matching" will turn up numerous references.
FUZZY is available from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) and requires the Python Essentials for your version of SPSS Statistics.
This is a quick note for people who produce a lot of tables with SPSS Statistics or use a lot of scripting code for formatting.
Statistics Version 20 was released on Tuesday, August 16, 2011. One of the big improvements is in the speed of table production. Five times faster is typical! If you use scripting or tools such as SPSSINC MODIFY TABLES to modify the formatting or do other table operations, these will also run much faster. There is no need to use the (unscriptable) lightweight tables from Version 19 in order to get fast table production.
In Version 20, fast tables are the default option. Only if you need to produce spv files that can be read by older versions of Statistics do you need to change this option. Fast tables support the full formatting and scripting capabilities of the previous table format.
Of course, there are other important improvements in Version 20 such as mapping as a base feature, but I am particularly pleased with this improvement and congratulate the team that worked hard to make this happen.
Recently this problem was posed on the SPSSX-L listserv (linked on the SPSS Community site): Count the number of distinct values in a set of variables for each case. This led to a lively discussions of alternative solutions. Most used traditional syntax. I used Python programmability.
Datasets sometimes need to be restructured between wide form
, where there are multiple measurements on a concept expressed as different variables within a case, and long form
, where each repeated measurement is a separate case. Commonly, repeated measures, for example, test scores at different points of time for a subject, are stored in wide form, and IBM SPSS Statistics procedures that focus on repeated measures designs tend to require this form. Most data transformations and simple summary statistics, however, are intended for long form. Thus an easy way to convert between these forms is important.
IBM SPSS Statistics provides the commands CASESTOVARS
to convert between these forms. These are used by the Restructure Data Wizard , which appears on the Data menu.
SPSS Statistics has a number of transformation functions that work across variables within a case, such as mean, median, and sum, and the COUNT command that counts occurrences of a particular value, but there is no built-in function for counting distinct values. By converting these data from wide form to long form, the problem can be solved with traditional syntax.
The traditional syntax solution, then, has these steps (mainly worked out by David Marso) The syntax can all be generated using the GUI.
- Save the dataset if there have been changes.
- Use VARSTOCASES to replace the active dataset with one containing an id variable (from the original data or generated) and one variable representing the variables over which the calculation will be done. Call that new variable Z. The number of records is M * NV, where M is the number of IDs and NV is the number of variables over which the count is required.
- Use AGGREGATE with the ID variable and Z as the break variables. Use N as the statistic. Now the number of records in this new dataset is M * average number of distinct values.
- Activate the new dataset and use AGGREGATE again on it breaking just on ID, and use N as the statistic. This results in one record per ID with the count of distinct values. The dataset set is M records.
- Get the original dataset, and use MATCH FILES to add this value back to it.
This works, but it is quite a few data passes, and it takes some study to understand the code.
The programmability solution is much simpler. It uses the SPSSINC TRANS
extension command along with a two-line Python program to arrive at the same result. Here is the Python program followed by the command syntax. The program explanation
The SPSSINC TRANS explanation
- The "*args" signature in the countThem function means that args will be a list of the arguments (variable values in this case) passed to the function when it is called, so the function can handle any number of variables.
- set(args) creates a set from that list. Since a set can only contain the same item once, it will contain only the distinct values. If I pass it the list [1,2,1], the set will contain two members: 1 and 2.
- The length of the set, returned by the len function, is the number of members and hence the return value is the number of distinct values. This includes None, if there were any SYSMIS values. That could easily be excluded if they should not be counted.
- The program defined in the begin program block remains available throughout the session even though the program block has terminated, so once it is defined, it can be called elsewhere in that session.
- SPSSINC TRANS is an extension command implemented in Python that applies the code in the FORMULA subcommand to the cases in the active dataset and stores the result in the RESULT variable. It fetches the variable values referenced in the formula and calls whatever function was specified.
- Since countThem is called without any module qualifier, the extension command first tries to find that function in the items that have been defined in begin program blocks. If not found there, it looks for a function built in to Python, e.g., min, max, sum, ... If you have a function defined in some other module, you can reference it as modulename.funcname, and the command will load that module and then call the indicated function.
- The entire formula is quoted so that it will not be digested by the Statistics parser but rather passed as is to the command. Variable names are case sensitive.
- SPSSINC TRANS can return more than one value, hence creating multiple variables, and it has a number of other features that can be found in its syntax or dialog box help (Transform>Programmability Transformation).
So which is better? The traditional syntax does not require any programmability knowledge and works on quite elderly versions of Statistics, but it is rather complicated and takes many data passes. The Python approach requires a little knowledge of programmability and requires at least version 17 (2008) and the Python plugin, but, given that, it is easy to read and takes only one data pass. That data pass, however, will be slower than a native data pass.
The Python Essentials and the SPSSINC TRANS extension command can be downloaded from the SPSS Community site (www.ibm.com/developerworks/spssdevcentral if you are not reading this on the site).
We are sometimes asked why we use Python as the main programmability and scripting language in IBM SPSS Statistics. Here's a quote from a story dated October 17, 2011 InfoWorld Developer_World article here
The article is entitled From PHP to Perl: What's hot, what's not in scripting languages.
Hot scripting language: Python
In a sense, the tipping point for Python came when the housing market crashed.
those stuck trying to decode bond market prospectuses to figure out who
got paid what when the bankruptcy dominoes were done falling, one thing
became clear: English is a weaselly language, and some weaselly folks
revel in its ambiguities to profit from complicated derivatives.
one smart group that offered a compelling solution: Force every bond to
come with a Python program that defined who got paid and when they were
paid. While they may have been a bit too hopeful about the power of
computer languages, the proposal spoke to Python's growing acceptance in
the greater world of smart people who aren't computer geeks. Relatively
easy to pick up, Python is becoming increasingly popular in economic
research, science departments, and biology labs.
I'm not so sure about the bond market idea(!), but Python excels in its combination of clarity, flexibility, and power. And, case in point, statistics and data analysis. SPSS began embedding Python with version 14 way back in 2005. As of October, the TIOBE index shows Python as the 8th most popular programming language overall FWIW. While this is only a partial view of language usage, it confirms again that Python is a class A language.
IBM SPSS Statistics provides several mechanisms for looping in transformations or over groups within a procedure. The new extension commands expand the looping capabilities.
Using the standard capabilities, you can create loops in several ways.
- Transformations can contain general loops using LOOP, and you can loop over variables with DO REPEAT. Since transformations implicitly run within the case processing loop, you cannot include procedures within these loops.
- SPLIT FILES lets procedures iterate over contiguous subgroups of the case data. The procedure's Viewer output either combines the output for all the groups into a single table or produces a set of separate tables and charts organized by group.
- Python programmability allows Pythonistas to iterate over collections of files or variables, among other things. It is very general but requires Python knowledge.
What none of these methods allows you to do is to apply a set of commands to a collection of inputs and organize the entire set of output by group using regular SPSS syntax. For example, you might want to process a set of data files, run several transformations and procedures, and save all the output for each input file to a separate document. You might also want to save all the transformed datasets. These new commands addresses problems like this.
In order to generalize the split file idea, you first use SPSSINC SPLIT DATASET to make a directory of datasets, one per split value. The native way to do such an operation is to use the XSAVE transformation command with appropriate DO IF conditions for each group. This works well, but it has three problems. First, you have to have an exhaustive list of all the split values. Second, you have to write a lot of code. Third, the number of XSAVE commands that you can use in a single transformation block is limited. Prior to version 18, the limit was ten. For newer versions the limit is 64. So you have to count up and divide your code into separate blocks. Once you have all this working, if a new split value appears, you have to revise the code - if you notice the new value.
SPSSINC SPLIT DATASET eliminates these problems. It figures out what split values occur and generates the requisite syntax, taking into account the XSAVE limit. It lets you choose whether to name the outputs by variable values, labels, or sequential numbers, and it can produce a listing of the files created for use in later processing. No risk of unnoticed new values.Â And the data do not need to be sorted by the split values.
SPSSINC PROCESS FILES addresses the other side of the problem. It accepts an input specification that could be something like a file wildcard, e.g., /mydata/*.sav, or a file that lists the files to process such as produced by SPSSINC SPLIT DATASET.Â Then it applies the contents of a syntax file to each input, i.e., it loops over the inputs. It defines file handles and macros representing the input and output parameters for each file processed, so you can refer to these explicitly in the iterated syntax. It can write an individual Viewer output file for each input file, or it can produce a single Viewer file with all the output. These automatic files get names based on the input file names, but, of course, you can do other things in the syntax file. It can also produce a log listing all the actions taken and whether any serious errors occurred.
SPSSINC PROCESS FILES is not limited to SAV files or even data. It's up to you what you want to loop over.
SPSSINC PROCESS FILES solves the long-standing request for a way to put procedures inside loops. With these new tools, you can now easily create general transformation loops, loops over variables, procedure loops over groups, or entire job loops over arbitrary inputs whether or not you are a Python person.
These commands can be downloaded from SPSS Developer Central (www.spss.com/devcentral). They require at least IBM SPSS Statistics Version 17 and the Python programmability plug-in.
p. s. Both of these commands have dialog box interfaces as well as standard SPSS-style syntax.
I hope you will find these useful.
Users of traditional SPSS Statistics syntax are used to using the macro facility to parameterize blocks of syntax so that it is more flexible and can be varied without having to duplicate and edit the code. However the GGRAPH command, which provides deep access to the capabilities of the graphics engine, specifies the graph using GPL, the graphics specification language of SPSS. And GPL does not work with macro. How, then, can GPL code be parameterized? This post explains how to do this and, in the process, how to build a library of your own graphics specifications that removes the chart definition details from your syntax stream.
First, let's look at the syntax for a bar chart as generated by the Chart Builder. These examples use the employee data.sav file shipped with the software.
- The GGRAPH command is a standard Statistics command and follows all the normal rules for syntax. Macro can be used with it.
- The GPL block contains the actual chart specifications as indicted by the GGRAPH GRAPHSPEC subcommand. As you can see, it looks different from traditional syntax, and it follows different rules. GPL syntax is explained in detail in the Help under the GPL topic. You can do many things with it, but using macro is not one of them.
- This syntax is completely specific to the specification for this chart. To change the title, say, would require manual editing of the GUIDE statement with text.title (or generating a new command with the Chart Builder). Not very good for production work.
If you can't use macro to generalize this, what can you do? I'll show you how to use Python programmability not only to replace macro but to build a library of chart definitions that can be shared among different syntax streams. First, lets see how we could parameterize the title of the chart. (Real problems will want to do more, but the idea is the same.) Here is the first version.
- The entire GGRAPH command and the GPL code are assigned to the variable cmd inside the BEGIN PROGRAM block. The text of the command is the same as before except for the title line. In that line, in place of the title, we have the notation
That means to insert the value of the variable thetitle there. It's just like macro substitution here except that it works! (I also added a COMMENT line to the GPL.) The substitution is triggered by the notation above, and the values to substitute come from the
at the end.
- The value of thetitle is set at the top of the program block. The value is enclosed in triple quotes, so it could be multiple lines or text that contained quote characters.
- The last line of this program uses the spss.Submit function to run the command whose syntax is in cmd.
Using this mechanism, we have generalized the command to allow for any title text. A real problem would usually have more than one substitution parameter, but the logic is the same. Refer to the parameter by name in the appropriate part of the GPL and assign a name at the top. You might also need a little code to parameterize the axis labeling based on variable labels. That's easy to do, but I won't explain that here.
This mechanism requires that you install the Python Essentials available (for free) from this site.
So now we have solved the problem of generalizing the GPL, but having generalized this command, we might want to use it in other job streams. Duplicating the code is always a bad idea. Python lets us remove the code from the job stream and just refer to it. It's something like the Statistics INSERT command, but it is more flexible.
Here is the third version of the code where the GGRAPH and GPL code has been removed from the job stream.
- Now in the program code, we import a library named chartlib and then call a function in that library passing in the title. chartlib could contain many functions that define different sorts of charts (or do other things). Now improvements can be made once in chartlib and used by all the job streams that import it.
- The import statement did not say where to find chartlib. Python has an elaborate strategy for finding imported modules. Refer to the Python documentation for the full story, but for now, we will just put chartlib.py in the extensions subdirectory of the SPSS Statistics installation. Python will find it there.
What remains is to see what the chartlib module looks like. Here it is. It looks almost identical to our first parameterized version, except that the chart code has been moved inside a function named mybarchart
that has one parameter for the title. Everything is indented under the function declaration, which starts with def
. The line after def is the docstring, which should be used to document the function.
By putting this code inside a function, we open the door to defining many functions in this same module and selecting the one we want in the Statistics syntax stream, passing in any desired parameters.
Summarizing, by parameterizing the GPL code and moving it into a function in our library module, we have generalized the code and made it easy to maintain and share across different job streams. Although this posting is motivated by the need to parameterize GPL, these techniques can be used with any Statistics code.
There is one more thing we could do to completely hide the Python code in the job stream. We could use the SPSSINC PROGRAM extension command to provide standard Statistics code for passing the parameters and invoking the relevant function. I'll leave that for another time, but you can get that extension command from this site and read about it in the module you download.
Modified by JonPeck
The SPSS Statistics Custom tables (CTABLES) procedure can test equality of column proportions and column means. The test results are displayed either in a separate table or, using APA style, in the main table. Users commonly want to highlight significant differences, which is not possible using built-in functionality. This article explains how to do this.
Setup: In order to use the approach described here, you need the Python Essentials, including the SPSSINC MERGE TABLES and SPSSINC MODIFY TABLES extension commands. If these are not already installed, they can be obtained via the SPSS Community website. Details of what is installed by default vary across versions of Statistics. If your installation of SPSSINC MODIFY TABLES does not have the custom function described below, get a newer version from the website.
Here is the syntax and output from CTABLES testing equality of column proportions, The example uses the employee data.sav file shipped with Statistics. Our goal is to show the significant differences in the main table and highlight them..
You might first ask why we did not start with the single table output in APA style, since we want to show significance in the main table. There are two reasons: first, in my experience, nobody likes the APA style, and, second, it won't help us to color the significant cells anyway.
If you have Statistics version 22 or later, you might think that the new OUTPUT MODIFY command or Table Style subdialog could handle this task, but it can't, because the significance values do not occur in the test table with their own column. While OUTPUT MODIFY is a fabulous tool for unconditional or conditional formatting of tables and is very easy to use, this problem is too complicated for it.
We should, before proceeding further, pause to clarify what we mean by coloring significant cells. Significance is a property of pairs of cells, not of individual cells. In the test table above, the columns are lettered A to C, and the key of the the row with a significantly smaller column proportion is shown in the category of the larger proportion. Without the letter codes, you would know, for example, that educational level 15 is significantly larger for the Clerical category than for some other job categories, but you would not know which ones. So we can't just replace the letter codes with a color without a loss of information.
We can, however, color cells corresponding to the significance codes while retaining the specific information about which categories they refer to. There are two steps to the process. The first step is to merge the test table into the main table. In order to do that, we need to use the SPSSINC MERGE TABLES extension command. MERGE TABLES works by joining columns (or rows) of tables based on the column labels, possibly adjusting for extra terms appearing in only one. In this example, we will join the tables based on the job category labels - Clerical, Custodial, and Manager, ignoring the Count and letter codes that also appear. Fortunately the default settings in MERGE TABLES do this automatically.
Using the MERGE TABLES dialog box under Utilities > Merge Viewer Tables, the dialog would look like this.
We didn't specify table types, so the command uses the most recent pair of tables in the Viewer. We are merging by column based on the Count statistic. In this case, that is the only statistic, but if the table also contained, say, percentages, specifying Count would single out the Count columns for the merge. MERGE TABLES has many other features for more complex merging. Check out the help if your tables are too complex for a simple merge.
Here is the result of the merge.
As you can see, the letter codes from the secondary tables now appear in the corresponding cells in the main table, and the caption from the secondary table has been attached to the main table. Now we want to go further and highlight the cells that have a letter code. This is where the SPSSINC MODIFY TABLES (Utilities > Modify Table Appearance) extension command comes in.
MODIFY TABLES can apply conditional or unconditional styling to table cells and labels. It has built-in syntax for selecting cells and bolding, foreground and background coloring, and other properties, but here we need to go a little further. We need to select the cells that contain a letter code and then apply styling. MODIFY TABLES allows for small plug-in functions to extend its capabilities, and, fortunately, there is a plug-in function included with the command to do exactly that. Here is what the MODIFY TABLES dialog looks like for this task.
We specified that all cells should be selected - note the odd notation - and that the action is to apply the function shown in the subdialog. In this example we accept the default background color of yellow, but you can specify red, blue, and green color values between 0 and 255 as parameters to get different colors.
To see the full set of functions in the customstylefunctions.py module included with the command, open that file in a plain text editor. The comments describe how to use each function, so you don't need to know Python in order to use them. Note, though, that the function and parameter names are case sensitive.
You can write your own custom functions, which are usually very short, if you know a little bit of Python.
To use a different background color, say, red, the custom function text would be
customstylefunctions.colorIfEndsWithAtoZLetter(r=255, g=0, b=0)
Finally, here is the result. All the cells that have a letter in the value now have a yellow background. We could have done something similar with bolding the cell text or applying a different styling.
In summary, by applying the SPSSINC MERGE TABLES and SPSSINC MODIFY TABLES to the CTABLES output containing the test results in a separate table, we have highlighted the significant differences while preserving the details from the secondary table.
p.s. If you are wondering about writing your own custom functions, this is the entire code for the function described here other than the standard function signature. First it calculates the color based on the parameters specified. Then, as it is fed the table data cells, it checks the value for a letter code and sets the background color if one is found.
Transformation commands in IBM SPSS Statistics fall basically into two categories: those that are executed when the flow logic gets to them, and those that are executed immediately when read, i.e., before other transformations are executed.
Transformation commands that define variable properties such as VARIABLE LABELS, VALUE LABELS, and MISSING VALUES are generally executed as soon as they are read. That means that they may be executed out of order compared to the execution of the job stream, and they can't be controlled by transformation logic such as DO IF or DO REPEAT. So you can't define properties conditionally, and you can't iterate over them.
Usually you can ignore this behavior, but here's a case where it matters. Suppose you want to create syntax that will eliminate all the variable labels in a dataset (never mind why you might want to do this - this is just an example). The first try might be this do repeat loop.
But you get this puzzling error message.
Error # 4530. Command name: variable label
>This command is not allowed inside the DO REPEAT/ END REPEAT facility.
That's because metadata commands are not subject to the flow control of a loop. In addition, this is a little ugly, because you had to know the names of the first and last variables (in file order) in order even to write the loop command. So this syntax isn't general.
The second try works, but it's way too ugly to contemplate. I couldn't ever bear to write out the whole command.
It works, because you can separate variable specifications with /, and if there is no label in a section, the label is empty, i.e., removed.
This is too painful to write, but it offers a clue. If we could only generate that slash-separated list automatically, we could feed it to this command. That's where SPSSINC SELECT VARIABLES
comes in. This is an extension command available from the SPSS Community (
www.ibm.com/developerworks/spssdevcentral if you aren't already reading this from the site) and requires the Python Essentials. This command allows you to define an SPSS macro consisting of a list of variables that meet various criteria. It could be an explicit list, but you could filter on variable type, patterns in names, custom attributes, and/or measurement level. With no filtering, it would select all the variables in the dataset. The macro text replaces the macro name reference when it is encountered in the syntax.
So we're almost there. If I run
spssinc select variables macroname = "!all". (note the quotes)
I get a list of all the variables that I could feed to the VARIABLE LABELS command. But that won't quite work, because I need those slashes to separate the items in the list. This will do the trick.
(It is conventional to start macro names with "!" so that they won't collide with variable names or syntax constructs.)
Now all I have to do is this.
This gives me syntax that will work for any data file regardless of the variable names.
In order to use this command, you have to install the extension command (and the Essentials if you don't already have that), but then you can forget that it is an extension. It works just like the built-in commands. The more you can generalize your syntax, the less work you have to do, and the less the chance of errors creeping in.
A final note: ths example is entirely serendipitous. I created this command for selecting subsets of the variables fof automating analysis tasks. Here it is being used to select all the variables. And I thought the only useful separators between the selected variables would be a blank for ordinary variable lists and + for CTABLES expressions, but here / is required.
Modified by JonPeck
SPSS Statistics was, as far as I know, the first commercial software to deliver an integration with the R statistical language. It first appeared in SPSS 16, over six years ago at this writing, complementing the Python language integration that first appeared in SPSS 14. This post reviews the rationale and developments in this feature.
R has become the computational language of the statistics profession. It's the way a new statistical algorithm is first published, and the R library contains a vast collection of statistical functions. The R jungle contains many gems, but it has drawbacks, too. This isn't the place to discuss all the good and the bad, but suffice it to say that using R directly imposes a style of doing statistical analysis based on a programming model that does not always suit an analyst, and the output from a R package is usually not in a format suitable for publication. And, while there are various point and click interfaces that can be added on for some R packages, serious usage requires the user to learn the R language, which is not easy.
SPSS Statistics and, as of Version 16, SPSS Modeler, bring to bear the ease of use of these products and their output presentation capabilities that allow a user to work with these products while still tapping the power and packages of R. While a user can write programs in the R language that run within the Statistics or Modeler program, typically the SPSS user takes advantage of R packages that have already been integrated using the published apis and tools for this purpose without the need to learn R or deal with it directly. R packages can extend the statistical capabilities of these products without sacrificing the benefits of SPSS software. The R connection requires an extra installation step (R itself and, via the SPSS Community website, the R Essentials), but all the pieces for this are free. Statistics and Modeler can be a great way to deploy the functionality of R.
Organizations and individuals can do their own, private integrations of R packages, but the SPSS Community site provides a means of sharing integrations with everyone. Instructions for sharing are on the front page of the site. For SPSS Statistics, you can start here to see what has been shared. With Statistics version 22 or later, you can also download and install package integrations from the Utilities menu within Statistics without even visiting the site.
The image also shows extensions implemented in Python.
As of this writing, there are 25 R packages that have been integrated by IBM and 10 contributed by users. Package integrations generally include a dialog box interface produced by the Statistics Custom Dialog Builder and traditional SPSS syntax for the package. They produce their output as SPSS pivot tables and R graphic images that appear in the Statistics Viewer along with other output produced by native SPSS commands. For packages that are included in the R Essentials, the dialog and output are usually translated into all the languages that Statistics itself provides.
For Modeler 16, an adaptation of the Custom Dialog Builder is included, and nodes can build models and provide code to be used with those models for scoring. Using the new Hadoop integration, the scoring can be performed on the Hadoop cluster with big performance benefits. Similar to Statistics, the user of R- based nodes sees the same behavior that comes with native nodes.
Producing a package integration for Statistics is usually easy for someone who knows the R language. It can be as simple as adding a line to fetch the data. Usually the integration will convert plain text R output to one or more pivot tables, and it may create new datasets. That takes a little longer, but it is still typically a few hours to a few days. The apis for all this are covered in detail in the help installed with the R plug-in: Help > Programmability > R Plug-In. Since the package integrations are usually distributed in source form, they can serve as examples. Integration creators can also do translations, but not many are prepared to handle this. The SPSS forums are a good place to ask questions about this technology.
There is a white paper that discusses the benefits and technology of using R with Statistics that provides more details,
In sum, the R integrations for Statistics and Modeler allow access to the large R library but package it in a form that fits in with the native capabilities of these products. It's a win for everyone, and it's all free.
As I mentioned last time, some production users spend a lot of time editing and formatting the text in the Viewer outline and object titles. This was a surprise, but I suppose it shouldn't have been. If the Viewer contents are to be read as a document - whether in native format or as exported to PDF or other formats, these items are important.
We think about tables and charts so much that it is easy to forget the outline and titles. So I created the extension command SPSSINC MODIFY OUTPUT for Version 17 to simplify automating this task. It comes with a dialog box interface that appears on the Utilities
menu. The command lets you do these kinds of things
- Select the items to operate on by their type (headings, titles, etc) or OMS subtype (tables) and the text of the outline or item title
- Change the text, incorporating the old text or replacing it
- Apply html or rtf formatting (right-hand pane only, not all object types)
- Sequence number the items using numbers, letters, or roman numerals
- Hide selected items
- Insert page breaks
- Apply a custom Python function to selected items.
The goal of this command and SPSSINC MODIFY TABLES and TEXT is to clean up the output and make it into a presentation document in an automated way without forcing you off to Excel or another application to do this.
Titles and headings can be plain text or well-formed simple html or rtf.
The dialog box for this command looks like this.
This dialog generates the SPSSINC MODIFY OUTPUT COMMAND
It's not the most beautiful dialog I've ever built, but it was done pretty easily with the Custom Dialog Builder and offers most of the functionality available in syntax. When you select the objects based on their text, the text can be selected by literal equality, its start, its end, or by a regular expression. The replacement text can include that, possibly with the addition of formatting, or it can replace it, and it can position a sequence number. For example, the following replacement text might be used.
What does that mean? The html directives say to make "\\1", whatever that is, italic and to put "\\0" in front of it as plain text. \\1 stands for the original text of item. In this example we are also numbering the items, and \\0 refers to the current sequence number. If the original title is "Means", this specification might produce
The command made the text italic and prefixed a sequence number
In this case I chose upper case roman numerals for the sequence number style. I could have chosen lower case roman numerals, upper- or lower-case letters, or just numbers.
The details of what you can do with various kinds of output objects can be found in the dialog box help. A little experimentation will go a long way, too.
If you are working interactively, it's probably not worth the trouble to use this command. You can do most of these actions interactively, but if you are building production jobs, automation is critical, and this command can help to eliminate the drudgery and error-prone editing that might otherwise have to be done by hand.
Since all these changes take place downstream from the output seen by OMS (the Output Management System), any OMS captures will not reflect them even with PDF and the other document formats now available with OMS. But now that we have the OUTPUT EXPORT command, you can create your output, apply the formatting and hiding actions available with this set of commands, and then use OUTPUT EXPORT to export the visible items to PDF and other formats.
Implementing this feature makes heavy use of the extension command mechanism begun in Version 16, and these features from Version 17: the integration of Python programmability and Python scripting, and the Custom Dialog Builder. MODIFY TABLES and MODIFY OUTPUT were not the easiest features to create, but using them can save you a lot of work and eliminate a large percentage of the situations where you needed to write a script.
You can download this extension command, with dialog box interface, from Developer Central. It needs the Python Plug In and at least Version 17.