A common need with SPSS Statistics is to produce some statistical results and then use them for further processing. We sometimes call that "output as input". This is very straightforward if you can do it with AGGREGATE or procedures that produce results as SPSS datasets, but it is possible to do this for anything that SPSS Statistics can put in a pivot table.
One way to retrieve output for reuse is to write a Basic or Python script.Â The output would be produced as pivot table(s) in the Viewer. Then the script would search through the Viewer document and find the desired table.Â Using something like the GetValueAt
api of the Datacells
class (Python) or the ValueAt
property of the Data Cells object (Basic), a program can retrieve cell values.Â The script might be kicked off via a SCRIPT command.
This works, but it can be tricky to program, and it is roundabout and inefficient.Â And in distributed mode with SPSS Statistics Server, this solution is unavailable.
Back in SPSS version 12 we introduced the Output Management System (OMS), but many users have only a vague understanding of this powerful mechanism.Â It provides a much easier and more efficient solution to grabbing output.Â Combining OMS and programmability, the output could still be processed by a Python or Basic script, but it could also be retrieved by a Python or .NET program - instead of a Python script - by using the XML workspace.Â This allows for better synchronization, and it works in either local or distributed mode.
OMS is a listener for the output.Â It is not built in to particular procedures.Â In fact, the procedure does not even know that something is listening.Â Rather, you start OMS listening for particular objects.Â When an object of interest comes along, OMS keeps a copy.Â When you stop the listener, the captured objects are written to memory, to a dataset, or to a file and are available to your SPSS syntax or programmability code.
You tell OMS what to listen for by selecting the types of objects, most often TABLES, and, if desired, the particular types of tables you want, such as a crosstabs table.Â You stop the listener with the OMSEND command.Â The OMS command specifies what the output format should be - including XML, SAV, HTML, text, Excel, Word, and PDF, and where to write it.
If you write the output to a dataset, then you can activate that dataset and apply standard SPSS commands to it.Â You can also access the dataset with Python programmability whether or not is is activated using the Dataset class in the spss module.
A more general mechanism is to have OMS write to the XML workspace.Â This is an in-memory structure that can be read by Python or .NET code.Â The OMS command assigns a name to the workspace item it creates.Â Then the program code can retrieve all or a selected part of that item using the GetXmlUtf16 Python api.Â (Similarly for .NET).Â You write an XPath expression to say which part of the xml you want.
XML and XPath are very powerful but can be a bit intimidating, so we have provided some Python helper functions to make it easy.Â In the spssaux module, which is installed with the Python plug-in, there is a function createXmlOutput that takes care of the OMS wrapper and writing to the workspace.Â All you give it is the command syntax you want and the identifiers for the type of table you are interested in.
Correspondingly, getValuesFromXmlWorkspace can retrieve specific information from the workspace item created by the first function.Â You use the visible properties of the table to determine what is to be retrieved.Â And then you are off to the races.
Here is an example.Â Let's say you want to run a regression and do something if the R Square statistic is too small.Â The R Square is in the Model Summary table.Â Here's an example of that table.
So the task is to run the regression and retrieve the second column of this table.Â Here is a little Python program to do this.Â It expects that the cars.sav data file shipped with SPSS Statistics is the active dataset.
import spss, spssaux
cmd="""REGRESSION /DEPENDENT mpg
/METHOD=ENTER horse weight."""
tag, errorlevel = \
Rsquare =spssaux.getValuesFromXmlWorkspace(tag, 'Model Summary',
colCategory="R Square", cellAttrib="number")
if Rsquare < .7:
Let's walk through this code.
- The cmd= line is the syntax to run to create the output we want to harvest.Â It could be more than one command.
- The createXmlOutput call runs the command, specifying that we are interested in the Model Summary table of the Regression command.Â It returns two values: a tag to use when retrieving output, and an error code, which is ignored in this example.
- The getValuesFromXmlWorkspace call uses the tag and the OMS table subtype along with specifying the part of the table we want.Â Looking at the example table, we see a column label that can be used for retrieval.Â That column will have its value stored as both text and a number in the xml, so we specify that we want the number form.Â The function returns a list of the things that matched.Â Here we take the first and only element.
We know what to retrieve just by looking at the labels in the table.Â In this example, just identifying the column is enough, but you can also specify row labels.Â Some tables are too complicated for this approach, but a great many things can be done using this simple model.Â The spssaux module also has a createDatasetOutput function that works in a similar way but creates an SPSS dataset instead of xml.Â Values would be retrieved from that dataset with the Dataset class or a cursor object.
Note that this table was not retrieved from the Viewer.Â It was captured by the OMS listener and placed in the workspace, from which the Python code extracted it.
Inside these Python functions, OMS commands and XPath expressions were generated, but you don't need to learn those technologies in order to benefit from them.
In my last post, I wrote about the productivity achievable with Python, telling the story of creating the SPSSINC TURF extension command and dialog box. Well, when the cat's away, the mice will play. This post is about scalability and optimizing the TURF algorithm,
The TURF algorithm is computationally explosive. It has to compute a number of set unions that grows very rapidly as the problem size grows and harvest the best. Apart from the number of cases, which affects the time required to compute a set union and the amount of memory required, the size is determined mainly by the number of variables, N, and the maximum number of variables that can be combined. e.g., best three variables, depth.
If we hold the depth constant at 10, i.e., find the best combination of up to 10 variables, the number of unions required grows like this as we increase the number of variables.
Looking in another dimension, fixing the number of variables at 24 and varying the depth, the union count grows like this.
48 variables and a depth of 24 would require 156,861,290,196,829 unions!
This clearly can get out of hand with what seem to be reasonable problems! I added code to precalculate and report the number of unions required and syntax to set a limit on problem size to make the user better informed, but that is not enough.
In calculating the set unions, I was careful to make the code pretty efficient. That works well. But I found that as the number of sets got into the millions, the algorithm stalled and eventually exhausted the machine memory and failed.
Some experimentation showed that the set calculations completed in a reasonable amount of time, but finding the best combinations was very slow. A little optimization of that part of the Python code sped it up by a factor of 4. But I could see that what was killing the code was my strategy of first accumulating the reach count for all the sets and then picking out the best ones. Even though each reach statistic saved added only a small object to the list, the number of such objects was coming to dominate memory and time usage.
Handling the result list had initially seemed to me to be a trivial part of the process, but it clearly is not as the size grows. So I needed to change the code to only keep reach counts for combinations that have a chance at making the final list. Each new count needs to be checked against the counts already computed. Then the code should discard that count if it was dominated by the others, and it should replace a count if it is better than the worst already in the list.
Doing this efficiently requires an entirely different data structure for keeping the list of counts. A heap is a data structure that has two useful properties for this problem. First, the smallest element is always at the head of the list, and, second, elements can be added or deleted quickly while maintaining the heap property.
Python provides a heap data structure in the heapq module in the standard library. For this problem, I actually needed to keep a list of heaps, one for each different number of combinations, but I could use the heapq functions for each one. One other problem is that the heap items are a single number, and I needed to keep something a little more complicated (the count and a list of the variables that produced it). Because Python does not use strong types, I could easily create an object that acted like an integer for comparison purposes but held all the information I needed. A Python motto is, "if it walks like a duck and talks like a duck, it's a duck".
With these changes, the result management parts of the algorithm now run in a constant and small amount of memory and the harvesting of the best combinations is very fast. A test problem that previously died after consuming 1.5GB of memory now runs in 30MB - and finishes. Of course, constructing and counting all those set unions can still take a long time, but that's the nature of the problem. I am not going to try that problem requiring 156 trillion sets.
There are several points to this story.
- You may find a need to optimize in places you didn't expect: testing and measuring is important.
- Producing an initial version and putting it out for real world use can quickly flush out the places that need work. Being able to respond quickly and outside annual product release cycles as we can on Developer Central is a great help. It can change how one builds software.
- Python provides a rich set of technology that can be tapped without the need to invent or reinvent the basics.
You can download the SPSSINC TURF module from the Files section of this site. It requires at least SPSS Statistics Version 17 and the Python plug-in. And it's free.
One of the main benefits of programmability is the ability to extend and automate SPSS Statistics capabilities. I'd like to tell you the story of a recent extension effort: the SPSSINC TURF command and dialog.
TURF analysis is Total Unduplicated Reach and Frequency. It is a common technique in market research. Suppose you have a survey about sports viewing popularity. It asks about football, soccer, baseball, basketball, hockey, and other sports. You would like to know how to reach the most viewers with no more than three sports.
You could tabulate with FREQUENCIES the positive responses to each sport. But this doesn't answer the question, because the audiences will overlap. You would like to know the highest reach of combinations of up to three sports eliminating the overlap.
Calculating the TURF requires finding the set union for all combinations up to a certain size of positive responses to the sports and then presenting the best of those combinations. That is a computationally demanding task that grows explosively as the number of questions increases, but it is conceptually simple.
SPSS Statistics does not have a built-in way to do this, so I set out to create an extension command implemented in Python for it: SPSSINC TURF. First, I decided to work with transposed data and the built-in set algebra capabilities of Python. I pass the question dataset and create a set for each question listing the case numbers that have positive responses. That's just a few lines of code.
The trickier part was figuring out how to manage all the set union calculations. It's a set of tree structures for which a little bit of recursion boils the work down to a few lines of code. My first try was getting clumsy, so I went out for a bike ride for a few hours and came back with the algorithm worked out in my head. I believe in the left-brain, right-brain approach: study something intensively; then relax or do something different, and things are much clearer when you return to the subject.
Putting this together, I finished the code, but I was worried that this task would be so computationally demanding that it would be too slow to be useful. As it turned out, though, the approach I took, heavily leveraging Python sets and some other features, runs amazingly fast. And although the sets have to fit in memory, it seems to handle pretty large problems.
I went on to create a dialog box interface using the Version 17 Custom Dialog Builder and extension command syntax using the extension mechanism, which requires a small xml file to define the syntax and uses our extension.py module to handle that interface.
So, what sort of effort did this take? Less than one day, including the bike ride. How much more productive could you be? Taking advantage of the combination of Python and SPSS together along with the CDB and other tools reduced this task to about 225 lines of code plus the dialog and xml.
I posted this to SPSS Developer Central, where it can be downloaded for free. It is written for SPSS Statistics 17, but it will work with version 16 (not including the dialog) with a small change documented in the readme file. One competing product that does this as a main feature sells for a 4-figure price.
The original version posted had a subset of the features I had thought about doing. I wanted to see what interest there might be. Within a few days I had received and implemented a few enhancement requests. By getting the first version out to the world, it was easier to see what additional features users might want. Again, higher productivity by not implementing things that would probably not be used. But maybe I'll do more later.
This experience is typical of many programmability projects, in my experience. Big results for small amounts of work. Of course, I've done this a lot, so I know all the tools and how to approach a problem. Programmability definitely requires an investment in learning the technology, but it's hard to beat the ROI.
In my last blog post I shared some templates that added functionality -- specifically, maps. That is one use of templates; allowing custom features or new and relatively 'untested' features to be used without needing a lot of new user controls, syntax or whatever. A second use is more prosaic, but can be a real time-saver: custom styles.
Styled Mosaic-like Plot
The figure above (click to show full-size version) is an example of how a template can contain not only structural information, but also style choices. The range of possible style options available through templates is wider than that available through standard GUI choices, and the above figure shows some gradient details, as well as re-arranging the legend to be in the middle of the chart and making room for it there.
If you download the template from the zip file linked at the bottom of this post, you can use the following SPSS 17 syntax to use it:
For your data, all you need do is change the data mappings (make sure the BinaryGroup variable really is a categorical variable with two values -- a flag, in other words). Enjoy.
As I mentioned last time, some production users spend a lot of time editing and formatting the text in the Viewer outline and object titles. This was a surprise, but I suppose it shouldn't have been. If the Viewer contents are to be read as a document - whether in native format or as exported to PDF or other formats, these items are important.
We think about tables and charts so much that it is easy to forget the outline and titles. So I created the extension command SPSSINC MODIFY OUTPUT for Version 17 to simplify automating this task. It comes with a dialog box interface that appears on the Utilities
menu. The command lets you do these kinds of things
- Select the items to operate on by their type (headings, titles, etc) or OMS subtype (tables) and the text of the outline or item title
- Change the text, incorporating the old text or replacing it
- Apply html or rtf formatting (right-hand pane only, not all object types)
- Sequence number the items using numbers, letters, or roman numerals
- Hide selected items
- Insert page breaks
- Apply a custom Python function to selected items.
The goal of this command and SPSSINC MODIFY TABLES and TEXT is to clean up the output and make it into a presentation document in an automated way without forcing you off to Excel or another application to do this.
Titles and headings can be plain text or well-formed simple html or rtf.
The dialog box for this command looks like this.
This dialog generates the SPSSINC MODIFY OUTPUT COMMAND
It's not the most beautiful dialog I've ever built, but it was done pretty easily with the Custom Dialog Builder and offers most of the functionality available in syntax. When you select the objects based on their text, the text can be selected by literal equality, its start, its end, or by a regular expression. The replacement text can include that, possibly with the addition of formatting, or it can replace it, and it can position a sequence number. For example, the following replacement text might be used.
What does that mean? The html directives say to make "\\1", whatever that is, italic and to put "\\0" in front of it as plain text. \\1 stands for the original text of item. In this example we are also numbering the items, and \\0 refers to the current sequence number. If the original title is "Means", this specification might produce
The command made the text italic and prefixed a sequence number
In this case I chose upper case roman numerals for the sequence number style. I could have chosen lower case roman numerals, upper- or lower-case letters, or just numbers.
The details of what you can do with various kinds of output objects can be found in the dialog box help. A little experimentation will go a long way, too.
If you are working interactively, it's probably not worth the trouble to use this command. You can do most of these actions interactively, but if you are building production jobs, automation is critical, and this command can help to eliminate the drudgery and error-prone editing that might otherwise have to be done by hand.
Since all these changes take place downstream from the output seen by OMS (the Output Management System), any OMS captures will not reflect them even with PDF and the other document formats now available with OMS. But now that we have the OUTPUT EXPORT command, you can create your output, apply the formatting and hiding actions available with this set of commands, and then use OUTPUT EXPORT to export the visible items to PDF and other formats.
Implementing this feature makes heavy use of the extension command mechanism begun in Version 16, and these features from Version 17: the integration of Python programmability and Python scripting, and the Custom Dialog Builder. MODIFY TABLES and MODIFY OUTPUT were not the easiest features to create, but using them can save you a lot of work and eliminate a large percentage of the situations where you needed to write a script.
You can download this extension command, with dialog box interface, from Developer Central. It needs the Python Plug In and at least Version 17.
US States Average Summer Temperature
For the VizML visualization system used in SPSS products, maps are simply another element that can be used within the grammatical formulation.
Although most people consider a map a very different entity from a bar chart, all that really differs between a bar chart and a map of areas like the one included here is that instead of representing a row of data by a bar, we use a polygon (or set of polygons) on a map. Otherwise their properties ought to be the same -- we can apply color, patterns, labels, transparency. We can set a summary statistic when there are multiple values for each polygon to reflect min, max, mean, median, range, or any of the regular sets of items. We can flip, transpose and panel the charts. Essentially, from the grammatical point of view, if you can do it to a bar chart, you can do it to a map. The only limitation is that whereas the sizes of the bars can be set or determined by data, the map polygons cannot, so setting sizes on the map polygons has no effect.
The above chart can be created within SPSS Statistics and Clementine, using the Graphboard Template feature. The template was created in Viz Designer, but you don't need that tool simply to use a template. Following is a chart showing cell-phone ownership on a per-capita basis for countries throughout the world.
At the end of this post is a zip file containing the templates used to build these two charts. Unzip it on your local machine and then, within the Graphboard Template Chooser dialog, click the "Manage" button to import them. Do that once from any application ,and they will be installed and ready to use in all your graphboard-enabled SPSS applications. All the templates only need a variable with the names in them (Illinois, Alaska etc. or Germany, France, etc.) -- color, labels and transparency can be attached using the optional dialog.
If you have a copy of Viz Designer, you can modify and enhance these templates in many ways. If you don't and you're an XML whiz, you could even try opening the templates in your favorite XML editor and modfiying them directly. The worst that'll happen is a useless template that'll fail to load, or draw strange results. Let us know how you get on if you try that method out ...
Cell Phone Ownership Per Person
And here are the templates, ready to download and install ...
"Pretty" understates what formatting is really about. Making an object more pleasing to the eye is nice, but the point is really about communicating better with the reader of the output. Formatting the output appropriately can improve that process. In previous posts, I have written about ways to format pivot tables: the most important textual output from SPSS. In this post, I will discuss a way to create and format text blocks.
Inexplicably, SPSS has never had a good, built-in, production way of writing a block of text to the Viewer. Syntax can be echoed to a log file, including the COMMENT command, but logs are full of other things and don't work well for communicating with readers, especially if they are not conversant with SPSS syntax. You can, of course, interactively insert a text block in the Viewer, but that's a non-starter for production work. There are scripting apis for text blocks, too, but they require a knowledge of script programming.
For this reason, I created the TEXT extension command. (Recall that extension commands are user-defined commands that can be installed and are then available like regular commands.) TEXT takes a list of literals and creates a new text block object in the Viewer with each literal as a line. You can control the outline text and title information for the block as well. That works with Version 16 or later. With Version 17, though, you can present this text better by adding formatting.
To do that, all you have to do is to create the text as html or rtf and feed it to the TEXT command. Here's an example.
TEXT "<html>The following table is "
"<font size=6><b><i>confidential</i></b> </font>.<br> "
"It must not be shared outside the company</html>"
Here's what the resulting text block looks like.
A Formatted Text Block
The principle is that if the text looks like html or rtf, it will be treated as formatted text; otherwise, it will be considered to be plain. For html, this means starting with <html> in lower case. This example specifies the "confidential" text to be size 6 (html sizes go from 1 to 7), bold, and italic. It also inserts a line break. A web search on html will produce many pages to explain the html markup language. One good introduction is Introduction to Html.
The text in this example is split into several strings. Because it is formatted, the strings are joined without any added spaces or line breaks. Html directives such as <br> would be used to break lines if needed.
Basic html is simple to write, and there are plenty of free html editors you can use, or just use a plain text editor, but keep the html simple. Only basic formatting is supported. Don't use Microsoft Word to create the html: it creates very complex and hard-to-process html. For rtf, you might use a program like WordPad to create the rtf and then copy it into your syntax, but rtf is quite complex, and it makes your syntax very hard to read, so I recommend sticking with html.
The TEXT extension command can be downloaded from SPSS Developer Central. It requires the Python programmability Plug-In. The same download works for Versions 16 and 17, but with Version 16, only plain text can be inserted.
Formatting the Outline and Titles
Rather to my surprise, I learned that some users spend a lot of time structuring and formatting the Viewer outline and headings and titles. I created the extension command SPSSINC MODIFY OUTPUT to eliminate the complexity and drudgery in that work.
It's not well known, but title and heading objects in the Viewer can be formatted with html or rtf in the same way as I just described for text blocks: if the text looks like html or rtf, it is treated that way. The text generated by SPSS procedures, though, doesn't include such formatting. SPSSINC MODIFY OUTPUT can upgrade the text, and it can do other things to output objects. More on that in my next post.
In a previous post, I discussed the SPSSINC MODIFY TABLES command and dialog that can easily automate pivot table formatting. I skipped over things like making totals bold, because they are a piece of cake. (But just in case you were wondering, here's an example for totals in rows of a Custom Table. It also makes the background yellow.)
SPSSINC MODIFY TABLES SUBTYPE="Custom Table"
SELECT = "Total" DIMENSION=ROWS
/STYLES BACKGROUNDCOLOR=255 255 88 TEXTSTYLE = BOLD
But there are some things that are a little harder.
For example, if you are old enough to remember the mainframe, you might want every other row green (or maybe to switch between green and white after every three rows). And a user recently requested a way to suppress all the sections except the last of each Regression output table when doing stepping. For the first request, you could just list all the odd-numbered rows in the SELECT statement. But that's tedious, error prone, and you might encounter a table with more rows than you listed. For the second request, you need to know which output block is the last in order to hide the earlier rows.
SPSSINC MODIFY TABLES, and the companion command SPSSINC MODIFY OUTPUT accommodate this sort of thing with custom plug-in functions. Prior to Version 17, you could have written a fairly long program to do a task like this, but now you need only a few lines of Python code. A custom plug-in function gets called for all the selected rows or columns in the command, and all it has to do is to apply its logic to the cells in question. The customstylefunctions.py file installed with MODIFY TABLES has a number of little functions like this that are both examples and sometimes useful along with details on how to write such functions. Here's the code for striping odd-numbered rows. Blue is the new green, so it actually sets a blue background.
def stripeOddRows(obj, i, j, numrows, numcols, section, more):
"""Color background of odd number rows for data and labels"""
if i % 2 == 1:
obj.SetBackgroundColorAt(i, j, RGB((0,0,200)))
The logic of this code is simply, is this an odd-numbered row? If so, set its background to blue. "%" is the modulus operator in Python, so its value is 1 for every other row. obj is the pivot table object. i and j are the row and column numbers.
To use it with a Custom Tables table, you could write
SPSSINC MODIFY TABLES SUBTYPE="Custom Table"
SELECT = "<<ALL>>" DIMENSION=ROWS
Okay, some of you think the new green is red. You could change the color specification above to RGB((200,0,0)), but then you've got another function to maintain, and soon someone is going to ask for yellow. The solution to this is to give the function some color parameters, and let the user choose when he or she writes the syntax. Less work for the author, and more flexibility for the user.
Here's the code for that function. It's only a little more complicated.
def stripeOddRows2(obj, i, j, numrows, numcols, section, more, custom):
"""stripe odd rows with color parameters
extra parameters are r, g, b"""
if i % 2 == 1:
# retrieve three parameters with defaults and calculate color value
# the first time and add to dictionary
custom["_color"] = RGB((custom.get('r',0), custom.get('g', 0),
custom["_first"] = False
obj.SetBackgroundColorAt(i, j, custom["_color"])
Notice the extra parameter named custom. It gets passed a Python dictionary with the user-specified parameters in it. The defaults, set with get, are 0,0,200 - blue, in case the user didn't supply them.
This time the user specifies these in the CUSTOMFUNCTION keyword. For example,
CUSTOMFUNCTION="customstylefunctions.stripeOddRows2(r=255, g=255, b=88)"
The function code calculates and store the RGB value the first time it is called in the command. In this case, doing the calculation every time would not take any noticeable time, but there might be a saving with a more complicated example.
The point of this is that MODIFY TABLES, by providing this plug-in capability, can greatly extend its functionality while eliminating most of the code you would have had to write before. And it doesn't have to be formatting code. You could do anything you want with the table contents.
In English, we use many different words to describe the same basic objects. In one survey, researchers Dieth and Orton explored which words were used for the place where a farmer might keep his cow, depending on where the speaker resided in England. The results include words like byre, shippon, mistall, cow-stable, cow-house, cow-shed, neat-house or beast-house. We see the same situation in visualization, where a two-dimensional chart with data displayed as a collection of points, using one variable for the horizontal axis and one for the vertical, is variously called a scatterplot, a scatter diagram, a scatter graph, a 2D dotplot or even a star field.
There have been a number of attempts to form taxonomies
, or categorizations, of visualizations. Most software packages for creating graphics, such as Microsoft Excel
focus on the type of graphical element used to display the data and then sub-classify from that. This has one immediate problem in that plots with multiple elements are hard to classify (should we classify a chart with a bars and points as a bar chart, with point additions, or instead classify it as a point char, with bars added?). Other authors have started with the dimensionality of the data (one-dimensional, two-dimensional, etc.) and used that as a basic classification criterion, but that has similar problems.
Visualizations are too numerous, too diverse and too exciting to fit well into a taxonomy that divides and subdivides. In contrast to the evolution of animals and plants, which did occur essentially in a tree-like manner, with branches splitting and sub-splitting, information visualization techniques have been invented more by a compositional approach. We take a polar coordinate system, combine it with bars, and achieve a Rose diagram. We put a network in 3D. We add texture, shape and size mappings to all the above. We split it into panels. This is why a traditional taxonomy of information visualization is doomed to be unsatisfying. It is based on a false analogy with biology and denies the basic process by which visualizations have been created: composition.
Within SPSS we have adopted a different approach â€“ looking at charts and visualizations as a language in which we compose â€œparts of speechâ€ into sentences. This approach was pioneered by Leland Wilkinson in his book The Grammar of Graphics. Consider natural language grammars. A sentence is defined by a number of elements which are connected together using simple rules. A well-formed sentence has a certain structure, but within that structure, you are free to use a wide variety of nouns, verbs, adjectives and the like. In the same way, a visualization can be defined by a collection of â€œparts of graphical speechâ€, so a well-formed visualization will have a structure, but within that structure you are free to substitute a variety of different items for each part of speech. In a language, we can make nonsensical sentences that are well-formed. In the same way, under the graphical grammar, we can define visualizations that are well-formed, but also nonsensical. One reason not to ban such seeming nonsense is that you never know how language is going to change to make something meaningful. A chart that a designer might see no use for today becomes valuable in a unique situation, or for some particular data. â€œThe tasty aged phone whistles a pinkâ€ might be meaningless, but â€œthe sweet young thing sings the bluesâ€ is a useful statement, and grammatically similar. In our grammar-based approach, we have a set of different â€œparts of speechâ€ that we compose:
- data â€“ the variables that are to be used.
- coordinates â€“ the basic system into which data will be displayed, together with any transformations of the coordinate systems, lik polarization, reflection, etc.
- elements â€“ the graphic glyphs used to represent data; points, line, areas,â€¦
- statistics â€“ mathematical and statistical functions used to modify the data as it is drawn into the coordinate frame.
- aesthetics â€“ mappings from data to graphical attributes like color, shape, size, â€¦
- faceting â€“ dividing up a graphic into multiple smaller graphics, also known as paneling, trellis, â€¦
- guides â€“ axes, legends and other items that annotate the main graphic
- interactivity â€“ methods for allowing users to interact with the graphics; drilldown, zooming, tooltips, â€¦
- styles â€“ decorations for the graphic that do not affect its basic structure, but modify the final appearance; fonts, default colors, padding and margins, â€¦
The core concept behind our approach is that you should be able to take a chart and modify the language to replace one part by a similar part, and have a well defined and potentially useful result. The result is a system where the limits of what you can display are neither based on how well you can do graphical programming, or how well the computer program you use has implemented a feature, but instead is based simply on combining well-known parts into novel systems.
I have heard from users that they often need to enhance the formatting of SPSS Statistics pivot tables beyond what can be done with tableLooks. Until recently there have been three ways to do this: tedious manual formatting, exporting to Excel and (tediously) formatting there, and writing a fairly complicated Basic or - starting with version 16 - Python script. Now there's a fourth way that is very powerful and much easier to use. And it doesn't require any programming knowledge.
SPSSINC MODIFY TABLES is an extension command
with a custom dialog box interface that is available from Developer Central for Version 17. I'll explain some of its features with an example.
Here is a crosstab table (from the Crosstabs case study) showing customer satisfaction with different stores controlling by whether they had contact with a store employee or not.
The original crosstab with a tableLook applied
TableLooks are good for static formatting and can be applied automatically, but they apply to whole areas of tables and are static.
The table cells contain counts and residuals. There is a story in this table that takes time to find. To bring out that story, we could highlight the large residuals. You can do that manually cell by cell, but that is tedious and error prone. The SPSSINC MODIFY TABLES command can automate this.
SPSSINC MODIFY TABLES can select cells based on the row or column label text, the indexes, and/or the cell values, and it can apply most of the formatting devices that you could use interactively, even hiding unwanted rows or columns. Here is an example that looks for the cell residuals and creates a red background if the residual is large.
SPSSINC MODIFY TABLES SUBTYPE='Crosstabulation'
DIMENSION=ROWS SELECT='Std. Residual'
/STYLES TEXTSTYLE=BOLD BACKGROUNDCOLOR=255 0 0
I would run this right after the CROSSTAB syntax. It selects tables of OMS type "Crosstabulation" in that output; looks in the rows for the label "Std. Residual", and makes the text bold and the background red (colors are RedGreenBlue numbers between 0 and 255). The APPLYTO keyword specifies that this only happen if the cell value is greater than 2 in absolute value.
Here's the result.
The large residuals are bold and red
The report has been turned into a message: there are issues with Store 2.
We can go further with this command. I'll discuss that in another post, but before I go, I want to mention that this extension command has a custom dialog box available on the Utilities menu after you install it.
You can get the command and dialog interface from the Downloads section of SPSS Developer Central (linked at the top of this site).
I would be interested in hearing about particular formatting requirements you have for tables. I'm betting that most of them can be handled by this command.