In Version 20, fast tables are the default option. Only if you need to produce spv files that can be read by older versions of Statistics do you need to change this option. Fast tables support the full formatting and scripting capabilities of the previous table format.
This is a quick note for people who produce a lot of tables with SPSS Statistics or use a lot of scripting code for formatting.
Statistics Version 20 was released on Tuesday, August 16, 2011. One of the big improvements is in the speed of table production. Five times faster is typical! If you use scripting or tools such as SPSSINC MODIFY TABLES to modify the formatting or do other table operations, these will also run much faster. There is no need to use the (unscriptable) lightweight tables from Version 19 in order to get fast table production.
In Version 20, fast tables are the default option. Only if you need to produce spv files that can be read by older versions of Statistics do you need to change this option. Fast tables support the full formatting and scripting capabilities of the previous table format.
Of course, there are other important improvements in Version 20 such as mapping as a base feature, but I am particularly pleased with this improvement and congratulate the team that worked hard to make this happen.
Case-control matching is a popular technique used to pair records in the "case" sample with similar records in a typically much larger "control" sample based on a set of key variables. This post discusses the FUZZY extension command for SPSS Statistics that implements this technique and some recent enhancements to it.
In this discussion I will refer to records rather than what SPSS usually calls cases in order to avoid confusion with case as in case control.
FUZZY takes two datasets as input (the demander and supplier datasets), matches the records according to a set of BY variables, and provides various ways of writing the output. It does not have a dialog box interface, but running FUZZY /HELP displays the complete syntax. Matches can be required to be exact on all variables, or a tolerance or "fuzz" factor, which could be zero, can be specified for each matching variable. (String matches can only have fuzz 0.)
Matches are not always possible: missing vaues or blank strings in a BY variable preclude matching. There might not be a close enough record in the supplier dataset to pair with a demander record. You might run out of eligible supplier records before the demander records are all paired. Unpaired output is set to the system missing value or, for strings, to blank.
FUZZY proceeds by finding for each demander record all of the supplier records that are close enough on the BY variables. This requires a lot of comparisons! It then proceeds through the demander records and picks a supplier record at random from all those eligible for that record. (You can request multiple supplier records for each demander if needed.) No attempt is made to find the closest eligible record, since there is no measure of closeness across the set of BY variables and for other reasons.
If sampling from the suppliers is done without replacement, which is the default behavior, then using a supplier record makes it unavailable for matching in later demander records and could result in a later record having no match. With fuzzy matches, it could be that both records would have been matched if the order were reversed.
While one would generally want to specify an exact match for categorical variables, at least those with nominal measurement level, continuous variables such as income or age might require some fuzz. New output from FUZZY can help to diagnose which BY specifications cause a record to go unmatched. Here is a table produced by FUZZY that shows how the BY criteria restrict the matches.
In this example, we are matching two datasets about vehicles on the variables origin and cylinder. We require an exact match on origin, but allow a difference of up to 3 in the number of cylinders as shown in the first column.
Th first row records the number of comparisons between demander and supplier records testing for an exact match. 95% of these comparisons were not exact matches. Comparing each demander record to each supplier record, only 5% matched exactly.
Next, the table shows that among comparisons after removing the 5% that matched exactly, 85% did not match on origin. Then, considering only the records where there was an exact match on origin, 75% of the comparisons did not match on cylinder. Each row of the table is based on the comparisons that passed, i.e., were within tolerance, on all of the preceding rows.
This table, which precedes the actual pairing step, can be useful in finding the variables that are most important in preventing matches. You may need to increase the tolerance, or you may just be out of luck if there are insufficient matches. The table does not tell you how successful the actual pairing step will be, because supplier records will be used up as the pairing pass proceeds, but it gives insight into how the variables filter the matches. The tries and rejection columns are only produced when you specify fuzz, but you could set all the fuzz values to 0 to see the results with exact matching.
The next table shows the distribution of eligible matches for the pairing pass (this example is based on a very small dataset). It shows how many eligible records there were for each demander record in the pairing pass. It shows that there were two demander records for which there were zero eligible supplier records, three where there was only one, and one where there were two to ten eligibles. This gives you a good idea of how rich the supplier dataset is in matchables, but it doesn't say anything about which variables have the biggest effects on pairing.
If you are lucky, all your demander records will find a match, but if they don't what can you do? Recall that the pairing stage is first come, first serve. With fuzzy matching, reordering the demander cases might work better. While FUZZY can't find an optimal order, one more output feature can help you improve the results. Specifying DRAWPOOLSIZE=varname will add a variable to the demander dataset that records how many eligible supplier records there were for each demander record. You can then study the characteristics of the demander records where suppliers are scarce to see where the supplier dataset is too thin. A good start, though, to improving the pairing percentage is to sort the demander dataset by the newly created poolsize variable. That puts the least paired demander records first in line for a match when you rerun the process and will generally reduce the number of unmatched cases if that is possible.
There are statistical issues regarding how you choose the BY variables that are not addressed here. Searching the web for something like "case control matching" will turn up numerous references.
FUZZY is available from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) and requires the Python Essentials for your version of SPSS Statistics.
Programmability Vs Traditional Syntax in IBM SPSS Statistics: Counting Number of Distinct Values in a Case
Recently this problem was posed on the SPSSX-L listserv (linked on the SPSS Community site): Count the number of distinct values in a set of variables for each case. This led to a lively discussions of alternative solutions. Most used traditional syntax. I used Python programmability.
Datasets sometimes need to be restructured between wide form, where there are multiple measurements on a concept expressed as different variables within a case, and long form, where each repeated measurement is a separate case. Commonly, repeated measures, for example, test scores at different points of time for a subject, are stored in wide form, and IBM SPSS Statistics procedures that focus on repeated measures designs tend to require this form. Most data transformations and simple summary statistics, however, are intended for long form. Thus an easy way to convert between these forms is important.
IBM SPSS Statistics provides the commands CASESTOVARS and VARSTOCASES to convert between these forms. These are used by the Restructure Data Wizard , which appears on the Data menu.
SPSS Statistics has a number of transformation functions that work across variables within a case, such as mean, median, and sum, and the COUNT command that counts occurrences of a particular value, but there is no built-in function for counting distinct values. By converting these data from wide form to long form, the problem can be solved with traditional syntax.
The traditional syntax solution, then, has these steps (mainly worked out by David Marso) The syntax can all be generated using the GUI.
The programmability solution is much simpler. It uses the SPSSINC TRANS extension command along with a two-line Python program to arrive at the same result. Here is the Python program followed by the command syntax.
The program explanation
The Python Essentials and the SPSSINC TRANS extension command can be downloaded from the SPSS Community site (www.ibm.com/developerworks/spssdevcentral if you are not reading this on the site).
We are sometimes asked why we use Python as the main programmability and scripting language in IBM SPSS Statistics. Here's a quote from a story dated October 17, 2011 InfoWorld Developer_World article here.
The article is entitled From PHP to Perl: What's hot, what's not in scripting languages.
Hot scripting language: Python
I'm not so sure about the bond market idea(!), but Python excels in its combination of clarity, flexibility, and power. And, case in point, statistics and data analysis. SPSS began embedding Python with version 14 way back in 2005. As of October, the TIOBE index shows Python as the 8th most popular programming language overall FWIW. While this is only a partial view of language usage, it confirms again that Python is a class A language.
Users of traditional SPSS Statistics syntax are used to using the macro facility to parameterize blocks of syntax so that it is more flexible and can be varied without having to duplicate and edit the code. However the GGRAPH command, which provides deep access to the capabilities of the graphics engine, specifies the graph using GPL, the graphics specification language of SPSS. And GPL does not work with macro. How, then, can GPL code be parameterized? This post explains how to do this and, in the process, how to build a library of your own graphics specifications that removes the chart definition details from your syntax stream.
First, let's look at the syntax for a bar chart as generated by the Chart Builder. These examples use the employee data.sav file shipped with the software.
If you can't use macro to generalize this, what can you do? I'll show you how to use Python programmability not only to replace macro but to build a library of chart definitions that can be shared among different syntax streams. First, lets see how we could parameterize the title of the chart. (Real problems will want to do more, but the idea is the same.) Here is the first version.
This mechanism requires that you install the Python Essentials available (for free) from this site.
So now we have solved the problem of generalizing the GPL, but having generalized this command, we might want to use it in other job streams. Duplicating the code is always a bad idea. Python lets us remove the code from the job stream and just refer to it. It's something like the Statistics INSERT command, but it is more flexible.
Here is the third version of the code where the GGRAPH and GPL code has been removed from the job stream.
By putting this code inside a function, we open the door to defining many functions in this same module and selecting the one we want in the Statistics syntax stream, passing in any desired parameters.
Summarizing, by parameterizing the GPL code and moving it into a function in our library module, we have generalized the code and made it easy to maintain and share across different job streams. Although this posting is motivated by the need to parameterize GPL, these techniques can be used with any Statistics code.
There is one more thing we could do to completely hide the Python code in the job stream. We could use the SPSSINC PROGRAM extension command to provide standard Statistics code for passing the parameters and invoking the relevant function. I'll leave that for another time, but you can get that extension command from this site and read about it in the module you download.
JonPeck 270002VCWN Tags:  python spss_statistics_v17 ibm_spss_statistics programmability 2 Comments 3,210 Visits
IBM SPSS Statistics provides several mechanisms for looping in transformations or over groups within a procedure. The new extension commands expand the looping capabilities.Using the standard capabilities, you can create loops in several ways.
What none of these methods allows you to do is to apply a set of commands to a collection of inputs and organize the entire set of output by group using regular SPSS syntax. For example, you might want to process a set of data files, run several transformations and procedures, and save all the output for each input file to a separate document. You might also want to save all the transformed datasets. These new commands addresses problems like this.
In order to generalize the split file idea, you first use SPSSINC SPLIT DATASET to make a directory of datasets, one per split value. The native way to do such an operation is to use the XSAVE transformation command with appropriate DO IF conditions for each group. This works well, but it has three problems. First, you have to have an exhaustive list of all the split values. Second, you have to write a lot of code. Third, the number of XSAVE commands that you can use in a single transformation block is limited. Prior to version 18, the limit was ten. For newer versions the limit is 64. So you have to count up and divide your code into separate blocks. Once you have all this working, if a new split value appears, you have to revise the code - if you notice the new value.
SPSSINC SPLIT DATASET eliminates these problems. It figures out what split values occur and generates the requisite syntax, taking into account the XSAVE limit. It lets you choose whether to name the outputs by variable values, labels, or sequential numbers, and it can produce a listing of the files created for use in later processing. No risk of unnoticed new values.Â And the data do not need to be sorted by the split values.
SPSSINC PROCESS FILES addresses the other side of the problem. It accepts an input specification that could be something like a file wildcard, e.g., /mydata/*.sav, or a file that lists the files to process such as produced by SPSSINC SPLIT DATASET.Â Then it applies the contents of a syntax file to each input, i.e., it loops over the inputs. It defines file handles and macros representing the input and output parameters for each file processed, so you can refer to these explicitly in the iterated syntax. It can write an individual Viewer output file for each input file, or it can produce a single Viewer file with all the output. These automatic files get names based on the input file names, but, of course, you can do other things in the syntax file. It can also produce a log listing all the actions taken and whether any serious errors occurred.
SPSSINC PROCESS FILES is not limited to SAV files or even data. It's up to you what you want to loop over.
SPSSINC PROCESS FILES solves the long-standing request for a way to put procedures inside loops. With these new tools, you can now easily create general transformation loops, loops over variables, procedure loops over groups, or entire job loops over arbitrary inputs whether or not you are a Python person.
These commands can be downloaded from SPSS Developer Central (www.spss.com/devcentral). They require at least IBM SPSS Statistics Version 17 and the Python programmability plug-in.
p. s. Both of these commands have dialog box interfaces as well as standard SPSS-style syntax.
I hope you will find these useful.
Transformation commands in IBM SPSS Statistics fall basically into two categories: those that are executed when the flow logic gets to them, and those that are executed immediately when read, i.e., before other transformations are executed.
Transformation commands that define variable properties such as VARIABLE LABELS, VALUE LABELS, and MISSING VALUES are generally executed as soon as they are read. That means that they may be executed out of order compared to the execution of the job stream, and they can't be controlled by transformation logic such as DO IF or DO REPEAT. So you can't define properties conditionally, and you can't iterate over them.
Usually you can ignore this behavior, but here's a case where it matters. Suppose you want to create syntax that will eliminate all the variable labels in a dataset (never mind why you might want to do this - this is just an example). The first try might be this do repeat loop.
But you get this puzzling error message.
Error # 4530. Command name: variable label
>This command is not allowed inside the DO REPEAT/ END REPEAT facility.
That's because metadata commands are not subject to the flow control of a loop. In addition, this is a little ugly, because you had to know the names of the first and last variables (in file order) in order even to write the loop command. So this syntax isn't general.
The second try works, but it's way too ugly to contemplate. I couldn't ever bear to write out the whole command.
It works, because you can separate variable specifications with /, and if there is no label in a section, the label is empty, i.e., removed.
This is too painful to write, but it offers a clue. If we could only generate that slash-separated list automatically, we could feed it to this command. That's where SPSSINC SELECT VARIABLES comes in. This is an extension command available from the SPSS Community (www.ibm.com/developerworks/spssdevcentral if you aren't already reading this from the site) and requires the Python Essentials. This command allows you to define an SPSS macro consisting of a list of variables that meet various criteria. It could be an explicit list, but you could filter on variable type, patterns in names, custom attributes, and/or measurement level. With no filtering, it would select all the variables in the dataset. The macro text replaces the macro name reference when it is encountered in the syntax.
So we're almost there. If I run
spssinc select variables macroname = "!all". (note the quotes)
I get a list of all the variables that I could feed to the VARIABLE LABELS command. But that won't quite work, because I need those slashes to separate the items in the list. This will do the trick.
(It is conventional to start macro names with "!" so that they won't collide with variable names or syntax constructs.)
Now all I have to do is this.
This gives me syntax that will work for any data file regardless of the variable names.
In order to use this command, you have to install the extension command (and the Essentials if you don't already have that), but then you can forget that it is an extension. It works just like the built-in commands. The more you can generalize your syntax, the less work you have to do, and the less the chance of errors creeping in.
A final note: ths example is entirely serendipitous. I created this command for selecting subsets of the variables fof automating analysis tasks. Here it is being used to select all the variables. And I thought the only useful separators between the selected variables would be a blank for ordinary variable lists and + for CTABLES expressions, but here / is required.
Sitting on my desk chair, which is an exercise ball, and reading lengthy documents is not like curling up in an armchair by the fireplace with a good book. My daughter gave me a Kindle for Christmas, so I've been experimenting with a better way to read computerized documents.
Newer versions of the Kindle can handle PDF documents. (IIRC, the original Kindle could not.) It is easy to get a PDF such as the
Programming and Data Management book onto the Kindle. Download it to your computer, attach the Kindle and copy it to the documents directory just as you would do any file copy.
Then open the document on the Kindle, get out your magnifying glass, and you are all set :-( Resizing the page to fit the Kindle screen does not produce a happy reading experience. But if you enlarge it via Zoom, you have to scroll the whole page on every line. The trick is to use the Aa key and rotate the page. You then see short pages, but you see whole lines, and, at least to my eyes, the text is reasonably readable.
Much nicer than reading on my computer screen.
The Scripting, Python plugin, and R documents are all PDFs, as is the Command Syntax Reference, so you can have your whole SPSS Statistics documentation arsenal at your fingertips. If you don't know where these files live, open them from the Help menu and use Acrobat to locate them.
I hope this is a useful tip, but I can't help you with the fireplace.
IBM SPSS Statistics Version 18 introduced a new variable property: role. The role can be Input, Target, Both, None, Split, or Partition. This new metadata comes from IBM SPSS Modeler and is useful in abstracting and generalizing jobs.
Roles are normally set by the user. Currently, these simply make initial settings in some dialog boxes. But if the roles are set correctly, it becomes possible to automate and raise the level of abstraction of repetitive tasks. For example, you might need to produce a standard set of analyses/reports across a variety of datasets that have a similar structure but vary in the exact variables they contain or other details. By abstracting the logic of a job to use roles, measurement levels, custom attributes and other variable properties, you can reduce the number of versions of a job that need to be developed and maintained. This can save time and reduce the number of errors.
I have seen customer sites where there are huge numbers of job files - syntax, templates, macros, scripts, etc - that are very similar but duplicated and modified, because the variables coming in are a little different or the coding of variables is a little different. Once you build a big set of jobs like this, making improvements or bug fixes becomes a nightmare, not to mention the extra time it takes to do things this way.
The long-standing macro facility provides some possibilities for abstraction, but it is static and can't use the metadata in a dataset. In contrast, the SPSSINC SELECT VARIABLES command allows you to define sets of variables based on the metadata rather than just a hard-coded list of names. It can use explicit names, patterns in names (all the variable names that contain AGE), measurement level, type (numeric vs string), custom attributes, and, finally, role, to define sets of variables that can be used in the job. These sets are embodied in, yes, macros. Of course, you could write your own code to use this sort of information, but SELECT VARIABLES can do a lot of this without the need to learn programmability. And it has a dialog box interface as shown here.
For example, suppose you have a mostly standard questionnaire that is used in many studies, but it has a few custom questions that vary from study to study, or some variables are sometimes omitted. You need to produce tabulations and estimate similar models for these studies. By intelligent use of the metadata, including role, you can perhaps have one master job rather than dozens. This leaves the analyst or researcher free to focus on the brain work part of the job rather than the tedious mechanical and error prone parts. If you have a data supplier who collects and prepares your datasets, you can instruct them on what roles and custom attributes should be defined. Then your analysis syntax can at least in part be based on these properties.
Custom attributes, first introduced in SPSS version 14 can also hold metadata such as questionnaire text, interviewer instructions, measurement units, or anything else that is useful in documenting your data or in programmatically manipulating it. In syntax, these can be created with the VARIABLE ATTRIBUTE command. (There is also a DATAFILE ATTRIBUTE command.) Roles can be defined with the VARIABLE ROLE command. Attributes and Roles can also be defined in the Data Editor or the Define Variable Properties dialog. They all persist with the saved data. These can be used in Modeler, too.
In summary, it's all about generalization and automation. Role is just one more attribute that can be used in this effort.
SPSSINC SELECT VARIABLES can be obtained from the SPSS Community and requires the Python Programmability plugin/essentials.
In English, we use many different words to describe the same basic objects. In one survey, researchers Dieth and Orton explored which words were used for the place where a farmer might keep his cow, depending on where the speaker resided in England. The results include words like byre, shippon, mistall, cow-stable, cow-house, cow-shed, neat-house or beast-house. We see the same situation in visualization, where a two-dimensional chart with data displayed as a collection of points, using one variable for the horizontal axis and one for the vertical, is variously called a scatterplot, a scatter diagram, a scatter graph, a 2D dotplot or even a star field.There have been a number of attempts to form taxonomies, or categorizations, of visualizations. Most software packages for creating graphics, such as Microsoft Excel focus on the type of graphical element used to display the data and then sub-classify from that. This has one immediate problem in that plots with multiple elements are hard to classify (should we classify a chart with a bars and points as a bar chart, with point additions, or instead classify it as a point char, with bars added?). Other authors have started with the dimensionality of the data (one-dimensional, two-dimensional, etc.) and used that as a basic classification criterion, but that has similar problems.
Visualizations are too numerous, too diverse and too exciting to fit well into a taxonomy that divides and subdivides. In contrast to the evolution of animals and plants, which did occur essentially in a tree-like manner, with branches splitting and sub-splitting, information visualization techniques have been invented more by a compositional approach. We take a polar coordinate system, combine it with bars, and achieve a Rose diagram. We put a network in 3D. We add texture, shape and size mappings to all the above. We split it into panels. This is why a traditional taxonomy of information visualization is doomed to be unsatisfying. It is based on a false analogy with biology and denies the basic process by which visualizations have been created: composition.
Within SPSS we have adopted a different approach â€“ looking at charts and visualizations as a language in which we compose â€œparts of speechâ€ into sentences. This approach was pioneered by Leland Wilkinson in his book The Grammar of Graphics. Consider natural language grammars. A sentence is defined by a number of elements which are connected together using simple rules. A well-formed sentence has a certain structure, but within that structure, you are free to use a wide variety of nouns, verbs, adjectives and the like. In the same way, a visualization can be defined by a collection of â€œparts of graphical speechâ€, so a well-formed visualization will have a structure, but within that structure you are free to substitute a variety of different items for each part of speech. In a language, we can make nonsensical sentences that are well-formed. In the same way, under the graphical grammar, we can define visualizations that are well-formed, but also nonsensical. One reason not to ban such seeming nonsense is that you never know how language is going to change to make something meaningful. A chart that a designer might see no use for today becomes valuable in a unique situation, or for some particular data. â€œThe tasty aged phone whistles a pinkâ€ might be meaningless, but â€œthe sweet young thing sings the bluesâ€ is a useful statement, and grammatically similar. In our grammar-based approach, we have a set of different â€œparts of speechâ€ that we compose:
The core concept behind our approach is that you should be able to take a chart and modify the language to replace one part by a similar part, and have a well defined and potentially useful result. The result is a system where the limits of what you can display are neither based on how well you can do graphical programming, or how well the computer program you use has implemented a feature, but instead is based simply on combining well-known parts into novel systems.