Monday 28 May 2012

Statistical Graphing with R

I have been collecting some metrics from the 8 agile teams that I work with and planned to create some nice visual reports to help identify software development practices that could be improved.

The data was initially in an Excel spreadsheet, so I used its built-in graphing capabilities which had been good enough for previous projects. I quickly found problems that made me want to look for a better solution;

  • Difficult to create multiple graphs with aligned x-axes.
  • Every time I wanted to use the same graph with different data, I’d have to either replace the data or tweak every graph.

So I tried Gnumeric, which has some nice graphing features, but its lack of pivot tables puts it out of the running. Libre Office Calc was much the same as Excel. I tried some Linux graphical plotting applications, the nicest of which was QTIPlot, which solved the common X-axis problem, but its vector output was poor.

Eventually I looked at statistical computing environments and settled on the R Project, a programming language for statistical analysis and graphing. It solved all the problems I was having with GUI based tools and in the process introduced me to a new way of explore my data.

R has a shell, so you can load some data into a variable and then start playing with it. You can easily filter data, apply matrix transformations and feed data through mapping functions.
Once you have the data in the shape you’re interested in, you can run it through one of the built-in functions, which give you lots of standard graph types, or delve into The Comprehensive R Archive Network (CRAN) which is a massive library of user contributed functions for graphing and analysis.

Once you have settled on the transformations and graphs you want to write out, you can write a script that outputs to various file formats, including PDF, SVG and Postscript. At the end of each iteration, I run my script that takes the latest data as a CSV file and outputs all the graphs I need for my report. When I think of new graphs I’d like to include, I add them to my script. It’s easy, reproducible, massively flexible and oh yes… it’s open source too.

Here’s a quick distribution graph (took maybe five minutes)…

This distribution of story estimates (in man-days) shows that;

  • Even numbers are more popular than odd numbers
  • 9, 11, 17 and 19 are never chosen, presumably due to rounding up or down to 10 and 20.
  • There is a general preference for nice small stories.

All this was created with the following R script;

data <- read.table("data.csv", header=TRUE, sep=",")

small_quotes <- data$Original.Estimate[
  data$Original.Estimate <= 576000 & 
  data$Original.Estimate > 0] / 28800

  small_quotes, col=rainbow(40,0.5),
  main=paste("Histogram of ", length(small_quotes),
    "Quote Sizes <= 20 units"),
  breaks=20, xlab="Estimate", ylab="Frequency"

Have fun setting your data free!