When I started this experiment in blogging, I wrote that I am not a natural blogger. I am not the affable, chatty web presence who on a daily or weekly basis shares one's thoughts. I have learned since then what kind of blogger I am. My style is to write little essays that might take weeks to prepare, given the priorities of my day job. I have also found I do enjoy writing the blog as it gives me a chance to share some issue that is top of mind. So here goes:
A few weeks ago, my good colleague displayed a chart in one of his PowerPoint decks entitled something like "How to Understand Murray." The chart was an explanation of probability distributions. It was both flattering and a bit of a wakeup call. As Arthur mentioned and the readers of this blog know, much, if not all, of my writing assumes an understanding of probability and probability distributions (aka probability densities). My experience in discussions with folks from our industry is that most of them have vague memories from some stat class in college and so can generally follow the discussions, but most could use a refresher. I could simply refer readers to a good Wikipedia article, but, instead, let me given a domain-specific example.
Let's go with a topic I wrote about in a previous entry: time to ship. For explanation purpose, let's take a fictional example. Suppose you are starting a project expected to ship in 110 days. That said, we cannot be 100% certain of being ready to ship on exactly that day, no sooner, no later. In fact, it is very unlikely we will exactly hit that day. Maybe we will be ready the day before, or maybe the day before that. Since being ready on or any day before day 110 is success, we can sum up the probabilities on being ready on any of those days to get the probability we really care about. All that said, the probabilities for each of the days matter because we need them in order to get the sum. The set of probabilities for each of those days is the probability distribution or density that is our topic.
Let's look at a simple example.
In this example of a triangular distribution there is 0% probability that the product will be ready before day 91 and we are 100% certain we will be ready before day 120. We think the days become more probable and reach a peak as we approach day 110 and fall off after that. This graph then shows the probability, day by day, of being ready on exactly that day. You might notice that the peak is less than 0.07 (actually 0.06666...). This makes sense since we are assuming that the project may be complete on any of 30 different days and so the densities would be in the neighborhood of 1/30 = 0.03333. In our case, some are above and some below.
These distributions are the basis of calculating the likelihood of outcomes. The principle is very simple: The probability of being ready within some range of dates is the sum of the probabilities of being ready on exactly one of those dates, i.e. , we add up the density values for those days. As I explained above, if we want to compute the probability of being ready on or before day 110, we would add up all of the densities for days 90 to 110 to get 0.7. Using the same reasoning the probability of being ready on some day before day 120 is the sum of all the densities which comes to exactly 1.0, which was one of our going in assumptions. In fact the property that the sum of densities for all possible outcomes equals 1 is a defining property of distributions. Those who want to try this out at home could use this spreadsheet.. For example, can you find on what day, being ready on or before that day is an even bet?
For most development efforts, the overall state of the program (some would say 'health') is characterized by the shape of the distribution, This shape changes every day. Every action the team takes changes the shape. So, one key goal of development analytics would be to track the shape of the distribution throughout the lifecycle, a daunting task. More on this (probably ) in future postings.