In English, we use many different words to describe the same basic objects. In one survey, researchers Dieth and Orton explored which words were used for the place where a farmer might keep his cow, depending on where the speaker resided in England. The results include words like byre, shippon, mistall, cow-stable, cow-house, cow-shed, neat-house or beast-house. We see the same situation in visualization, where a two-dimensional chart with data displayed as a collection of points, using one variable for the horizontal axis and one for the vertical, is variously called a scatterplot, a scatter diagram, a scatter graph, a 2D dotplot or even a star field.
There have been a number of attempts to form taxonomies
, or categorizations, of visualizations. Most software packages for creating graphics, such as Microsoft Excel
focus on the type of graphical element used to display the data and then sub-classify from that. This has one immediate problem in that plots with multiple elements are hard to classify (should we classify a chart with a bars and points as a bar chart, with point additions, or instead classify it as a point char, with bars added?). Other authors have started with the dimensionality of the data (one-dimensional, two-dimensional, etc.) and used that as a basic classification criterion, but that has similar problems.
Visualizations are too numerous, too diverse and too exciting to fit well into a taxonomy that divides and subdivides. In contrast to the evolution of animals and plants, which did occur essentially in a tree-like manner, with branches splitting and sub-splitting, information visualization techniques have been invented more by a compositional approach. We take a polar coordinate system, combine it with bars, and achieve a Rose diagram. We put a network in 3D. We add texture, shape and size mappings to all the above. We split it into panels. This is why a traditional taxonomy of information visualization is doomed to be unsatisfying. It is based on a false analogy with biology and denies the basic process by which visualizations have been created: composition.
Within SPSS we have adopted a different approach â€“ looking at charts and visualizations as a language in which we compose â€œparts of speechâ€ into sentences. This approach was pioneered by Leland Wilkinson in his book The Grammar of Graphics. Consider natural language grammars. A sentence is defined by a number of elements which are connected together using simple rules. A well-formed sentence has a certain structure, but within that structure, you are free to use a wide variety of nouns, verbs, adjectives and the like. In the same way, a visualization can be defined by a collection of â€œparts of graphical speechâ€, so a well-formed visualization will have a structure, but within that structure you are free to substitute a variety of different items for each part of speech. In a language, we can make nonsensical sentences that are well-formed. In the same way, under the graphical grammar, we can define visualizations that are well-formed, but also nonsensical. One reason not to ban such seeming nonsense is that you never know how language is going to change to make something meaningful. A chart that a designer might see no use for today becomes valuable in a unique situation, or for some particular data. â€œThe tasty aged phone whistles a pinkâ€ might be meaningless, but â€œthe sweet young thing sings the bluesâ€ is a useful statement, and grammatically similar. In our grammar-based approach, we have a set of different â€œparts of speechâ€ that we compose:
- data â€“ the variables that are to be used.
- coordinates â€“ the basic system into which data will be displayed, together with any transformations of the coordinate systems, lik polarization, reflection, etc.
- elements â€“ the graphic glyphs used to represent data; points, line, areas,â€¦
- statistics â€“ mathematical and statistical functions used to modify the data as it is drawn into the coordinate frame.
- aesthetics â€“ mappings from data to graphical attributes like color, shape, size, â€¦
- faceting â€“ dividing up a graphic into multiple smaller graphics, also known as paneling, trellis, â€¦
- guides â€“ axes, legends and other items that annotate the main graphic
- interactivity â€“ methods for allowing users to interact with the graphics; drilldown, zooming, tooltips, â€¦
- styles â€“ decorations for the graphic that do not affect its basic structure, but modify the final appearance; fonts, default colors, padding and margins, â€¦
The core concept behind our approach is that you should be able to take a chart and modify the language to replace one part by a similar part, and have a well defined and potentially useful result. The result is a system where the limits of what you can display are neither based on how well you can do graphical programming, or how well the computer program you use has implemented a feature, but instead is based simply on combining well-known parts into novel systems.
I have heard from users that they often need to enhance the formatting of SPSS Statistics pivot tables beyond what can be done with tableLooks. Until recently there have been three ways to do this: tedious manual formatting, exporting to Excel and (tediously) formatting there, and writing a fairly complicated Basic or - starting with version 16 - Python script. Now there's a fourth way that is very powerful and much easier to use. And it doesn't require any programming knowledge.
SPSSINC MODIFY TABLES is an extension command
with a custom dialog box interface that is available from Developer Central for Version 17. I'll explain some of its features with an example.
Here is a crosstab table (from the Crosstabs case study) showing customer satisfaction with different stores controlling by whether they had contact with a store employee or not.
The original crosstab with a tableLook applied
TableLooks are good for static formatting and can be applied automatically, but they apply to whole areas of tables and are static.
The table cells contain counts and residuals. There is a story in this table that takes time to find. To bring out that story, we could highlight the large residuals. You can do that manually cell by cell, but that is tedious and error prone. The SPSSINC MODIFY TABLES command can automate this.
SPSSINC MODIFY TABLES can select cells based on the row or column label text, the indexes, and/or the cell values, and it can apply most of the formatting devices that you could use interactively, even hiding unwanted rows or columns. Here is an example that looks for the cell residuals and creates a red background if the residual is large.
SPSSINC MODIFY TABLES SUBTYPE='Crosstabulation'
DIMENSION=ROWS SELECT='Std. Residual'
/STYLES TEXTSTYLE=BOLD BACKGROUNDCOLOR=255 0 0
I would run this right after the CROSSTAB syntax. It selects tables of OMS type "Crosstabulation" in that output; looks in the rows for the label "Std. Residual", and makes the text bold and the background red (colors are RedGreenBlue numbers between 0 and 255). The APPLYTO keyword specifies that this only happen if the cell value is greater than 2 in absolute value.
Here's the result.
The large residuals are bold and red
The report has been turned into a message: there are issues with Store 2.
We can go further with this command. I'll discuss that in another post, but before I go, I want to mention that this extension command has a custom dialog box available on the Utilities menu after you install it.
You can get the command and dialog interface from the Downloads section of SPSS Developer Central (linked at the top of this site).
I would be interested in hearing about particular formatting requirements you have for tables. I'm betting that most of them can be handled by this command.
On January 6, the open source statistical language R made the New York Times
The Times said,
For statisticians, however, R is particularly useful because it contains a number of built-in mechanisms for organizing data, running calculations on the information and creating graphical representations of data sets.
“I think it addresses a niche market for high-end data analysts that want free, readily available code," said Anne H. Milley, director of technology product marketing at SAS. She adds, “We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”
SPSS has a different attitude.
Starting with Version 16, SPSS offers a free plug-in that lets users run R code within SPSS having full access to the active SPSS Statistics data, and writing its output to the SPSS Statistics Viewer. With Version 17, we began creating dialog box interfaces and SPSS-style syntax for R packages we thought would be interesting to SPSS users. You can see the current list in the SPSS Developer Central
Downloads section. We use the same tools for this that we make available to any user, so the R connection is completely open.
We see R as complementary more than competitive to SPSS Statistics. R is a powerful and flexible programming language for statistics, and there are many, many procedures (packages of functions) for statistics and graphics in it. But it is not an easy language to learn. It helps to know the C language, but it's still a substantial effort. (Bob Muenchen's new book R for SAS and SPSS Users from Springer is a good text.)
We see the SPSS-R connection as a way for users to take advantage of the large number of R packages without the pain part of R. And with the ability to create SPSS pivot tables in the Viewer from R programs, you can get good looking output. Since R is limited to in-memory data, SPSS can select out the data needed for an R analysis and thereby reduce the memory requirement.
As a programming language, R takes a different attitude to communication with the user than SPSS. Running a certain package in R often produces this error message.
Error in optim(rho, f, control = control, hessian = TRUE, method = "BFGS") :
initial value in 'vmmin' is not finite
That is exactly what a programmer needs to know, but it leaves the user clueless. Here is what you see in SPSS using our integration of that package.
Error: SPSSINC HETCOR command was unable to compute the correlations due to data conditions.
This is usually due to some variables being too far from a bivariate normal distribution.
This may leave the programmer clueless, but it's a big hint for the user.
By bringing R and SPSS together, you get the best of both worlds: a large collection of statistical and graphical tools from R with the ease of use, data handling, and output presentation of SPSS Statistics. And, by the way, there are already some user contributions of R package integrations with SPSS available on Developer Central. If you create others of general interest, we invite you to contribute them, too.
Programmability and scripting involve skills that lots of SPSS users don't have. While some users might drop everything and plunge into this new world, others have a day job. What does programmability and scripting do for them?
There will always be a core group of users who can write programs and scripts for their own use. There will be a bigger group who could run programs and scripts written by others. Still bigger is the group who know only SPSS traditional syntax, and biggest of all is the group of point and click users. Being a point and click user should not exclude anyone from the benefits of these new technologies. I don't subscribe to the school that believes that mastering syntax and programming languages should be the price of admission to statistical work.
Starting with SPSS 16 and continuing with SPSS Statistics 17, we have added solutions to let everyone benefit from the new technology. In version 16, we implemented extension commands. They let anyone create SPSS-style syntax that is implemented using programmability and scripting. Well, it's entirely open, but you do have to write a little XML to define the syntax. That's scary to many, but if you look at a few of the extension command examples on Developer Central (www.spss.com/devcentral), you'll see that it isn't that tough. Extension-command syntax written by the user gets checked by the SPSS Universal Parser and, if it conforms, it is passed to a Run function implemented in Python or R. More about the details of that another time.
With Version 17, we added a dead-easy dialog box tool, the Custom Dialog Builder, so an author can create SPSS-style dialog boxes to generate the requisite syntax. The syntax could be an extension command, or it could be built-in SPSS syntax, or a Python program, or an R program. Using this an author could make his or her own version of SPSS commands, perhaps with different defaults or different options, or he or she could extend SPSS using programs and scripts. These dialogs go on the standard SPSS menus, and they can have help that is accessed from the dialog Helpbutton.
The further an author goes in using these features, the bigger the audience that can use the author's functionality.
My favorite example of this is the extension command SPSSINC MODIFY TABLES, aka
Utilties>Modify Table Appearance. This makes it easy to do complex formatting of pivot table output without writing any scripts. Until now, you had to write a script to automate fancy formatting beyond the static tableLooks. That's not in the skill set of many SPSS users, and even for those who know how, there's a lot of tedious coding involved. This new command and dialog can eliminate all that. Those scary programmability and scripting technologies make it easier to do your work - and that makes you more productive.
We give all this functionality away with the Base product. Are we crazy?
Once we decided on the strategy of embedding an existing programming language, we had to pick one.
- It had to be licensable on acceptable terms, preferably open source.
- It had to be well suited to the functionality we wanted to provide.
- It had to be easy to learn but not limiting.
- It had to meet various technical requirements for embedding.
- It had to be available on all our target platforms - both client and server.
- It should be reasonably familiar to the SPSS audience or in the domain of statistical analysis.
So we picked -- R. As it turned out, though, R didn't really meet the first three criteria. Licensing was a problem (overcome in 16 when we integrated it in an architecturally different way). It was well suited for statistical calculation, but it wasn't a good fit for interacting with and controlling SPSS, and it certainly wasn't easy to learn.
Python, though, met allof the criteria except familiarity. There is an index of popularity of programming languages called the TIOBE index. It tracks the top 100(!) languages, and Python is currently number 7 on the list. If you restrict the choices to scripting languages, it is number 3 and is classified in the A group. Furthermore, Python has been gaining popularity in the scientific and statistical computing community. There is a large open source support community around it, and there are many third-party libraries in a wide range of domains.
The other obvious choice would have been Visual Basic. VB is second among scripting languages on the TIOBE index, and it is familiar to many SPSS users through the SaxBasic scripting facility, but it isn't very cross platform, and, frankly, it just isn't a great language. But we did decide to add .NET support on Windows, particularly for Visual Studio developers, and that enables VB.NET.
My personal opinion is that Python is a truly great language. It's easy to learn, very flexible, coherent, and very expressive. It has a coherence and generality rare in languages due to its control by one very gifted language designer working with an active and mature open source community. It has been around for almost 20 years. And there are excellent IDEs both commercial and free.
The downside is that Python is very different from traditional SPSS syntax. They come from different motivations and purposes, and it can be jarring to go back and forth between them. I've done consulting and training, though, with many users and organizations by now, and those who have been willing to invest some effort in mastering this technology have been very successful and satisfied. By opening SPSS, the product, up in this way, users can control and extend it in a way never before possible.
Many in the SPSS user community, though, have yet to make this investment. Python is a programming language, and SPSS users tend not to be programmers. For non-programmers and those who want a single application language, we created the extension command mechanism. Next time I will write about that.
When we started, SPSS was a little ahead of the curve for once, but we think that Python has passed the tipping point and has become widely accepted.
Starting with SPSS version 14, we put a lot of effort into adding several programming languages into SPSS. We've kept this up now through four SPSS versions. What's the point? After all, SPSS already had a rich command language familiar to hordes of users, and it also had SaxBasic scripting. And why did we come up with something so unfamiliar to SPSS users? I'll sketch some of the major motivations here. Tell us what you think about what we have done by commenting on this (and future) posts.
First, though, rest assured that SPSS syntax - even the ugly macro language - and Basic scripting are not going away. In fact, in version 16 where Python scripting (as opposed to Python programs) was introduced, a huge amount of work went into reimplementing the Basic/COM scripting interfaces all over again in the new architecture. If my memory is correct, there were 310 apis (application program interfaces) that had to be reimplemented as well as creating all the new Python ones.
In designing programmability back in verison 14, we started down the road of enhancing the SPSS command language to add more programming features. The SPSS language lacked many important characteristics of modern programming languages, and its style was not what younger programmers expected. It worked well for statistical procedures and data transformations, but many useful things were difficult or impossible to accomplish using it. Some of these lacks could be worked around using SaxBasic, but as a front-end scripting language originally intended mainly as a way to manipulate objects in the Viewer, that was never a great solution.
It was hard to write jobs, other than transformation programs, that could be very general rather than building in a lot of assumptions about the input variable names etc. And it was very hard for a job stream to react to results or characteristics of the data and apply logic to decide what to do next. You might want, for example, to open an arbitrary dataset, inspect the meta data such as variable measurement levels or look for patterns in the variable names, and carry out some analyses automatically based on that. Or you might want to inspect the output from a procedure such as REGRESSION and take some action if the fit is unsatisfactory, or report outlier cases to some agent for review.
We decided after working on the SPSS command language for a while that this was the wrong approach. There were a lot of good, portable, and embedable programming languages around already. By adopting one (or more) of those, we realized that we could make a lot more progress and offer a lot more functionality in a modern style by going that route. And it had the extra advantage that huge libraries of useful code written in those languages could be used immediately within SPSS.
We, therefore, gave up on extending the traditional syntax and put our effort into embedding these languages. That meant both accepting such code in the SPSS input stream and creating a set of apis that allowed that code to communicate with and control SPSS. Hence BEGIN PROGRAM and END PROGRAM and all that followed.
The result of all this, IMO, is the greatest leap forward that the SPSS product has taken in the last 25 years. The great thing about it is when a user asks, "Can I do ...?", the answer is almost always "Yes!" even if it isn't something we had already thought of. The downside is having to learn a programming language whose conventions and structures are wildly different from traditional SPSS syntax. Furthermore, lots of SPSS users are really not programmers and don't want to be, so they struggle to use all this new power. The extension command mechanism introduced in version 16 and extended in 17 was designed to solve that problem.
Next time I'll write about why we picked the languages we did.