How to Add Clustering to a Graph (GPL)

Clustering involves changes to the COORD statement and the ELEMENT statement. The following steps use the GPL shown in The Basics (GPL) as a "baseline" for the changes.

  1. Before modifying the COORD and ELEMENT statements, you need to define an additional categorical variable that will be used for clustering. This is specified by a DATA statement (note the unit.category() function):
    DATA: gender=col(source(s), name("gender"), unit.category())
  2. Now you will modify the COORD statement. If, like the baseline graph, the GPL does not already include a COORD statement, you first need to add one:
    COORD: rect(dim(1,2))

    In this case, the default coordinate system is now explicit.

  3. Next add the cluster function to the coordinate system and specify the clustering dimension. In a 2-D coordinate system, this is the third dimension:
    COORD: rect(dim(1,2), cluster(3))
  4. Now we add the clustering dimension variable to the algebra. This variable is in the 3rd position, corresponding to the clustering dimension specified by the cluster function in the COORD statement:
    ELEMENT: interval(position(summary.mean(jobcat*salary*gender)))

    Note that this algebra looks similar to the algebra for faceting. Without the cluster function added in the previous step, the resulting graph would be faceted. The cluster function essentially collapses the faceting into one axis. Instead of a facet for each gender category, there is a cluster on the x axis for each category.

  5. Because clustering changes the dimensions, we update the GUIDE statement so that it corresponds to the clustering dimension.
    GUIDE: axis(dim(3), label("Gender"))
  6. With these changes, the chart is clustered, but there is no way to distinguish the bars in each cluster. You need to add an aesthetic to distinguish the bars:
ELEMENT: interval(position(summary.mean(jobcat*salary*gender)), color(jobcat))

The complete GPL looks like the following.

SOURCE: s = userSource(id("Employeedata"))
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=jobcat gender salary
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: jobcat=col(source(s), name("jobcat"), unit.category())
  DATA: gender=col(source(s), name("gender"), unit.category())
  DATA: salary=col(source(s), name("salary"))
  COORD: rect(dim(1,2), cluster(3))
  SCALE: linear(dim(2), include(0))
  GUIDE: axis(dim(2), label("Mean Salary"))
  GUIDE: axis(dim(3), label("Gender"))
  ELEMENT: interval(position(summary.mean(jobcat*salary*gender)), color(jobcat))
END GPL.

Following is the graph created from the GPL.

Figure 1. Clustered bar chart
Clustered bar chart

Legend Label

The graph includes a legend, but it has no label by default. To change the label for the legend, you use a GUIDE statement:

GUIDE: legend(aesthetic(aesthetic.color), label("Gender"))