Data visualization with R: How to get and show meaningful metrics for a scrum team
Using data visualizations as a communication tool is already well established in some fields and departments, but it's a relatively recent development in the world of software engineering. At your company, the finance department likely uses data visualizations to represent fiscal information. Take a look at the quarterly earnings report for almost any publicly traded company: it's full of charts to show revenue by quarter, year-over-year earnings, or a plethora of other historic financial data. All are designed to show lots of data points—potentially pages and pages of data points—in a single and easily digestible graphic.
Figure 1 shows a bar chart in Google's quarterly earnings report from 2007.
Figure 1. Google's quarterly earning report, 2007
Compare the bar chart in Figure 1 to a subset of the data it is based upon, in a tabular format, in Figure 2.
Figure 2. Google's quarterly earnings in tabular form
The bar chart is immensely more readable. You can clearly see by the shape that earnings are up and have been steadily going up each quarter. With the color coding, you can see the sources of the earnings. The annotations show both the precise numbers that the color coding represents and the year-to-year percentages.
With the tabular data, you have to read labels on the left, line up the data on the right with those labels, do your own aggregation and comparison, and draw your own conclusions. You have to work to take in the tabular data. In addition, there's a very real possibility that you either don't understand the data and create your own incorrect conclusions, or you tune out completely because it's too much effort to take in the information.
Disciplines other than finance use visualizations to communicate dense amounts of data. Your operations department might use charts to communicate server uptime. The customer support department uses graphs to show call volume. It's time for engineering and web development to get on board with data visualization. We have a huge amount of relevant data, and it's important for us to:
- Be aware of and use our own data to refine and improve what we do.
- Communicate to our stakeholders to demonstrate successes, validate resource needs, or plan tactical roadmaps for the coming year.
Taking advantage of data visualization
I came upon the data visualization revelation several years ago after a good bit of introspection and analysis. Most of the issues I dealt with daily could be traced back to a single root cause: communication. Bugs were a result of the lack of communication about requirements. The same goes for contention over velocity and even what features were completed in a given sprint.
By using data visualizations, I could make my working life considerably less stressful by communicating the salient data that would give the necessary context for a given situation.
I jumped right in and began creating team health reports for all of my stakeholders. The reports covered a range of topics, such as the number of bugs in our backlog over time, number of bugs per product, production incidents by product, even things like what day and time the most code check-ins occurred. (The original intent was to make sure no meetings were scheduled during that time, but the teams were most active checking in because that time was already meeting-free.)
A big part of my team health reports involved a deep dive into the defects the team created because that was one of the main performance indicators of quality of the products we built.
Data visualization is the art and practice of gathering, analyzing, and graphically representing empirical information. Data visualizations are sometimes called information graphics or just charts and graphs. Whatever you call it, the goal of visualizing data is to tell the story in the data. Telling the story is predicated on understanding the data at a deep level and gathering insight from comparisons of data points in the numbers.
I use R as both an analytics tool and as the medium to generate my data visualizations. R is an open source environment and language designed for statistical computing. You can get R at The R Project for Statistical Computing.
The time series chart
The most useful way to visualize bugs is with a time series chart. A time series is a graph that compares changes in values over time. They are generally read left to right with the x-axis representing a measure of time and the y-axis representing the range of values. When constructing my team health report, I used time series charts to represent the state of our defect backlog over time.
Tracking defects over time lets you identify spikes in issues. You can also see larger patterns in workflows, especially when you include more granular details like bug criticality and cross-referencing data such as dates for events (for example, the start and end of iteration). You begin to expose trends, such as:
- When during an iteration do bugs get opened?
- When are most of your blocker bugs opened?
- What iterations produced the highest number of bugs?
This kind of self-evaluation and reflection allows you to identify and focus on your blind spots or areas for improvement. It also allows you to recognize victories in a larger scope that you might miss when viewing the daily numbers without context.
For instance, my organization set a group goal of achieving a certain bug number by the end of the year. The target number was a percent of the total bugs open at the beginning of the year. My peers, the management staff, and I coached the developers on the goal. We created process improvements and won hearts and minds in this goal. At the end of the year, the number of bugs that remained open was about the same as when we had started the year. We were confused and concerned. But, when I summed the daily numbers, it was clear that we had achieved something larger than we had anticipated. We actually opened 33% fewer bugs overall compared to the previous year. This was huge, and we would have easily missed it if we weren't looking at the data with a critical eye to the bigger picture.
Gathering bug data with R
Let's walk through an example to show how you could start to visualize defects over time. Assuming that our
defect data is exported to a flat file named allbugs.csv, you can read in the data
using the read.table function to import the defect data into a data frame named
bugs. First, order the example bug data by date, as in Listing 1.
Listing 1. Ordering bug data by date
bugExport <- "/Applications/MAMP/htdocs/allbugs.csv" bugs <- read.table(bugExport, header=TRUE, sep=",") bugs <- bugs[order(as.Date(bugs$Date,"%m-%d-%Y")),]
The next step is to calculate the total bug count by date, which shows how many new
bugs are opened by day. Pass
bugs$Date into the
table() function builds a data structure of counts of each date in the bugs data frame:
totalbugsByDate <- table(bugs$Date).
The structure of
totalbugsByDate looks like Listing 2.
Listing 2. Structure of
> totalbugsByDate 2014-01-04 2014-01-08 2014-01-09 2014-01-10 2014-01-14 2 4 5 3 1
You can plot this data to get an idea of how many bugs are opened each day. You can
totalbugsByDate variable into R's plot function, as follows:
plot(totalbugsByDate, type="l", main="New Bugs by Date", col="red", ylab="Bugs")
This creates the chart in Figure 3, which shows how many new bugs are opened each day.
Figure 3. New bugs by date
Now that you have a count of how many bugs are generated each day, you can get a
cumulative sum using the
the new bugs opened each day and creates a running sum, updating the total each day.
This lets you generate a trend line for the cumulative count of bugs over time, as
in Listing 3.
Listing 3. Trend line for count of bugs over time
> runningTotalBugs <- cumsum(totalbugsByDate) > > runningTotalBugs 01-04-2014 01-08-2014 01-09-2014 01-10-2014 01-14-2014 01-16-2014 2 6 11 14 15 17
Creating the time series chart
Listing 3 provides exactly what's needed to now plot how your bug backlog grows or
shrinks each day. Pass
runningTotalBugs to the
function. Set the type to
l to signify that you're creating a line chart
and name the chart
Cumulative Defects Over Time. In the plot function,
also turn the axes off so you can draw custom axes for this chart. You want to do this
so you can specify the dates as the x-axis labels.
To draw custom axes, use the
axis() function. The first parameter in the
axis function is a number that tells R where to draw the axis: 1
corresponds to the x-axis at the bottom of the chart, 2 corresponds to the left, 3
corresponds to the top, and 4 corresponds to the right of the chart. Listing 4 shows
Listing 4. The
plot(runningTotalBugs, type="l", xlab="", ylab="", pch=15, lty=1, col="red", main="Cumulative Defects Over Time", axes=FALSE) axis(1, at=1: length(runningTotalBugs), lab= row.names (totalbugsByDate)) axis(2, las=1, at=10*0:max(runningTotalBugs))
The code in Listing 4 creates the time series chart shown in Figure 4.
Figure 4. Cumulative defects over time
The time series chart shows the progressively increasing bug backlog by date.
Let's look at the criticality of the bugs, which shows not just when the bugs are opened, but when the most severe bugs are being opened. When you exported the bug data, you included the Severity field. Severity indicates the level of criticality of each bug. Organizations might have their own classification of severity, but typically there are at least the following categories:
- Severe bugs that prevent the launch of a body of work. These are generally broken functions or missing sections of a widely used feature. They could also be discrepancies with contractually or legally binding features such as closed captioning or digital right protection.
- Bugs that are severe but not so damaging that they gate a release. These could be broken functionality of less used features. The scope of accessibility, or how widely used a feature is, is generally a determining factor when classifying a bug as a blocker or critical.
- Bugs with minimal impact that might not even be noticeable to a user.
To break out the bugs by severity, simply call the
table function, just as
you did to break out bugs by date, but this time add the
column, as in Listing 5.
Listing 5. Calling the
bugsbySeverity <- table(factor(bugs$Date),bugs$Severity)
This creates a data structure that looks like Listing 6.
Listing 6. Data structure
Blocker Critical Minor 01-04-2014 0 0 2 01-08-2014 1 0 3 01-09-2014 3 0 2 01-10-2014 1 0 2 01-14-2014 0 0 1 01-16-2014 1 0 1 01-22-2014 1 0 1
You can then plot this data object. Use the
plot function to create a
chart for one of the columns, then use the
lines() function to draw lines
on the chart for the remaining columns, as in Listing 7.
bugsbySeverity <- table(factor(bugs$Date),bugs$Severity) plot(bugsbySeverity[,3], type="l", xlab="", ylab="", pch=15, lty=1, col="orange", main="New Bugs by Severity and Date", axes=FALSE) lines(bugsbySeverity[,1], type="l", col="red", lty=1) lines(bugsbySeverity[,2], type="l", col="yellow", lty=1) axis(1, at=1: length(runningTotalBugs), lab= row. names(totalbugsByDate)) axis(2, las=1, at=0:max(bugsbySeverity[,3])) legend("topleft", inset=.01, title="Legend", colnames(bugsbySeverity), lty=c(1,1,1), col= c("red", "yellow", "orange"))
This code produces the chart in Figure 5.
Figure 5. New bugs by severity and date
The chart in Figure 5 is great, but what if you want to see the cumulative bugs by severity? Using the R code in Listing 7, instead of plotting out the columns, you can simply plot the cumulative sum of each column, as in Listing 8.
Listing 8. Plotting the cumulative sum of each column
plot(cumsum(bugsbySeverity[,3]), type="l", xlab="", ylab="", pch=15, lty=1, col="orange", main="Running Total of Bugs by Severity", axes=FALSE) lines(cumsum(bugsbySeverity[,1]), type="l", col="red", lty=1) lines(cumsum(bugsbySeverity[,2]), type="l", col="yellow", lty=1) axis(1, at=1: length(runningTotalBugs), lab= row .names(totalbugsByDate)) axis(2, las=1, at=0:max(cumsum(bugsbySeverity[,3]))) legend("topleft", inset=.01, title="Legend", colnames(bugsbySeverity), lty=c(1,1,1), col= c("red", "yellow", "orange"))
The code in Listing 8 produces the chart in Figure 6.
Figure 6. Running total of bugs by severity
This article outlined how to gain insight from studying your data and how to represent that data visually in interesting ways, by using R. Imagine the possibilities when looking at other organizational data such as production incidents or performance data. Just remember that the point is not to visualize for the sake of visualizing, but to use the charts:
- As a communication tool with your team and your stake holders.
- To track progress.
- To highlight accomplishments and areas that need focus.
- Learn about the IBM Watson research project.
- Check out Big Data University for free courses on Hadoop and big data.
- Visit the Apache Hadoop project web site.
- Refer to the Big Data Glossary By Pete Warden, O'Reilly Media, ISBN: 1449314597, 2011.
- Read MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDI, 2004.
- SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions, Eric Friedman et al., Proceedings of the VLDB Endowment, 2(2), 2009. This paper describes the motivation for this new approach to UDFs as well as the implementation within AsterData Systems' nCluster database.
- Refer to the IBM InfoSphere BigInsights Information Center for product documentation.
- Get R: The Project for Statistical Computing.
- Get Hadoop 0.20.1 from Apache.org.