To now, this blog has been a series of essays on the theoretical considerations underlying the analytics of development. With this entry, I want to start changing the emphasis to the practicalities of building analytic tools. Going from theory to practice raises all kinds of issues: data content and formats, robustness of algorithms, reinforcing agile practices, .... To start that discussion, lets start with an epic on how an analytic tool for agile teams might work:
A lead of an agile team, call her Shirley. has been asked to deliver a mobile application, with a specified set of features, in time for the next world games, which is one year away. Understanding that the future is uncertain, Shirley treats the time to complete as the random variable. Before committing to the project, she needs an initial distribution of the time to complete the project. With such a distribution, she has a view of the probability of achieving the goal. It is the area under the distribution curve that lies to the left of the target date in Figure 1.
Figure 1. Probability distribution of delivering the Shirley’s Mobile app project
Fortunately she has tool called 'ARaVar' to help her build and maintain this distribution. This tool is federated with her OSLC
agile project environment, Agilista (a fictional product). To use ARaVar, the team estimates the level of effort required for each feature using planning poker. In particular, for each feature’s level of effort the leadership team agrees on three values to enter in Agilista:
- The low (best case) – Assumes all the stars align and the feature comes together easily to meet requirements.
- The high (worse case) – Assumes Mr. Murphy S. Law and Ms. May Hem unexpectedly join the team and inject unexpected challenges and obstacles.
- The nominal (most likely) – Assumes level of effort has the expected mix of good fortune and bad luck.
Behind the scenes, the ARaVar finds these inputs to in Agilista and uses them to define triangular probability distributions
. In particular, AraVar interprets these effort inputs as saying
- There is zero probability that the level of effort will be less than the best case.
- There is zero probability that the level of effort will be greater than the worst case.
- The greatest probability of the level of effort will be at the expected case.
So ARaVar sets the distributions to be zero below the low value and above the high value, with a peak at the expected case. Figure 2 show the resulting triangular distribution, setting the high and low to zero and setting the peak (expected case) so that the total area of the distribution in one.
Figure 2: Typical triangular distribution for each feature.
In the parlance of Bayesian reasoning, this technique provides the subject matter experts a means of arriving at an honest prior, based on current information and informed belief. If the difference between the low and high of the distribution of a feature is large, then the team is expressing its uncertainty of the effort required to deliver the feature. This gives Shirley’s team the opportunity to focus the team on resolving the uncertainties early, progressively de-risking the project.
With this prior estimate in place, Shirley has an idea of how likely it is she can make the commitment and she negotiates the content. What-if analysis in ARaVa provides her with capability to compute the impact of adding, changing or dropping one or more features from the program. Luckily, she does find that one of the relatively uncertain features is more of a nice-to-have than a must-have and adds considerably more risk than value. So she negotiates that feature out of scope for a firmer commitment to an earlier delivery in 11 months as illustrated in Figure 3.
Figure 3: The negotiated delivery commitment: earlier and more predictable.
So Shirley now is in a good place. She has agreement on the scope of the project between her team and her stakeholders. She feels her team has a good chance of delivering on time.
In the Agile fashion, work proceeds by establishing work items to deliver the features. These work items are scheduled for iterations/sprints, on an ongoing basis. As the team completes work items, they not only have less work to complete, but also have a track record of the actual time it takes the team to complete work (called team velocity). From a Bayesian perspective, these constitute important evidence of how well the project is actually executing. ARaVa queries Agilista for the completion status of the features, the work item burndown history, and updated effort-to-complete estimates for the remaining features. ARaVa uses modern predictive algorithms to update the time to complete distribution.
With these ongoing predictions, Shirley can discuss with her team, and external stakeholders, whether the odds of meeting the commitment are improving (as they should) or degrading. If the later is the case, she can use ARaVa to predict the impact of managing content (decommitting features) or adjusting resources. For example, the tool revealed that one feature was very much at risk. In discussion with the stakeholders, it was decided that this feature was necessary and so it was decided that for the next sprint there should be more resources focused on the this feature. Some staff were assigned to the team for just that sprint. With ARaVa, all stakeholders can have a more honest and trustworthy discussion on how best to proceed.
ARaVa does not yet exist, but it is not a dream. IBM Rational and Research are now in the process of developing such a tool for a possible delivery next year. We are calling the project AnDes (for Analytics of Development). AnDes uses state of the art learning algorithms. We do have working versions federated with Rational Team Concert (We did show a preview at last year’s Rational Innovate). In addition to consideration of automating the data collection, we are exploring how it can be applied across a wide range of projects:
- Large to small
- Innovative to complex
- Fully or partially agile.
We are looking for design partners now! Interested? Please let me know at firstname.lastname@example.org.
In my previous couple of blog entries, I used triangular distributions for examples. For many who suffered through (or maybe enjoyed) their stat classes (what are the odds?), this might be a surprising choice. They were taught the default choice would be a Gaussian distribution. For those more attuned with modern business analytics, they are likely to be familiar with triangular distributions. In this entry, I'll briefly the reasoning beyond each of them.
First, as you hopefully recall, both are distributions associate with random variables (Those who don't recall migh benefit from the series of tutorials at The Khan Academy
site). Each are non-negative functions with integral (area under the curve) one. (There are fancier mathematical definitions, but no matter.) Each describes the likelihood of each of set of possible outcomes of some random variable. The difference in shape between Gaussian (aka Normal) and triangular distributions reflects the nature and use of the random variables.
Briefly, normal distributions
are often arise as the histogram
of a set of measurements. They have some central value (called the mean) and some dispersion (called standard deviation) around the mean. Anyone who took a stat class studied these distributions. They show up in a many contexts:
- The distribution resulting from tabulating the histogram of repeated, but imprecise measures of some quantity and then divided the entries by the sum of the measures is often assumed to be normal. The mean of the distribution is the estimator of the actual measure.
Statisticians like the normal distribution for several reasons. First, it is easy to parameterize. If you know the mean. mu (μ), and the standard deviation, sigma (σ), you have completely characterized the distribution. For example, the likelihood of a measurement occurring is often characterized as being within some number of σ's from the mean. Figure 1 shows how this works.
The likelihood of a value falling in a range is given by the area under the curve. For example, the probability of a value of the normally distributed random variable falling within one standard deviation of the mean is 68.2%.
Normal distributions have one really cool feature called the Central Limit Theorem
, which states that under remarkably general conditions, the sum of a set of random variables will be close to normal. Notice, in the previous blog entry, when we added two triangular random variables, the sum appeared smooth and in fact started to look normal.
All that said, I do have have a pet peeve. Normal distributions are overused. Most things in nature and economics are not normally distributed. For example, as as documented in Wikipedia
, these phenomena are nowhere near normal, but are closer to a Pareto distribution:
- The sizes of human settlements (few cities, many hamlets/villages)
- File size distribution of Internet traffic which uses the TCP protocol (many smaller files, few larger ones)
- Hard disk drive error rates
- The values of oil reserves in oil fields (a few large fields, many small fields)
- The length distribution in jobs assigned supercomputers (a few large ones, many small ones)
- The standardized price returns on individual stocks
- Fitted cumulative Pareto distribution to maximum one-day rainfalls
- Sizes of sand particles
- Sizes of meteorites
- Areas burnt in forest fires
- Severity of large casualty losses for certain lines of business such as general liability, commercial auto, and workers compensation.
Getting back to our topic, let's turn to triangular distributions. They are not used to describe a set of measured outcomes from an experiment. They are used to describe what we know or believe about some unknown random variable.
For example, the sales of a new product one year after delivery generally can not be determined by measuring the sales of a bunch of new products. As pointed out by Douglas Hubbard
, treating the future sales as a single fixed variable is unreasonable (although all too common). What is more reasonable is setting the low (L), high (H) , and most likely (E) values of the future sales. As I wrote in an earlier entry
these are the values that specify a triangular distribution. I.e. triangular distributions are set to zero below a given low value, L, and above the high value, H, and peaks at the expected value E. The distribution is then a describe be a triangular curve so that the total area is 1. Here is the distribution for L = 1, E=6, and H=7.
Some would argue there is a 'real' distribution of the future sales random variable and it is unlikely to be triangular. My response is for all practical purposes, it does not matter. The triangular distribution is a good-enough approximation to whatever the real distribution might be. By 'good enough' I mean they may be used to support decision making: they are a big improvement over using single values. They are also practical as they easy to specify and there is no assumption of symmetry, No wonder they are common in business analytics.
To wrap up, normal distributions are occasionally useful to describe outcomes of measurements while triangular distributions are useful for giving rough estimates of one's belief of the liklihood of outcomes based on the evidence on hand. More generally, normal distributions are useful in frequentist
statistics and triangular in Bayesian
statistics. See this Wikepedia article for a discussion of the kinds of statistics.
Much of what we do in development analytics is more Bayesian than frequentist. I hope to write more about that in the near future.
Modified by email@example.com
An ongoing theme of this blog is that development processes differ from other business processes in that there is a wide range of uncertainty inherent in the efforts. It follows that tracking and steering development efforts entails ongoing predicting, from the evolving project information, when a project is likely to meet its goals.
Late last year, Nate Silver author of the Fivethrityeight blog and well know predictor of elections published The Signal and the Noise, a text for the intelligent layperson on how prediction works. I was impressed by the book as it explained the principles behind the sort of Bayesian analytics we need for development analytics without any explicit math. However, I felt for the folks in our field would greatly benefit by having the mathematical blanks filled in. So I decided to write a series of papers introducing the topics to folks who had some statistics and maybe some calculus in college, but not a solid background in prediction principles.
The first in the series is now online: Filling in the blanks: The math behind Nate Silver's "The Signal and the Noise" Part 1. It presents the very basics of Bayesian analysis.
I hope you all find it useful and especially hope you find it interesting.
Folks who have heard me present will recognize the following
discussion as a variation of what I have used as an example to explain the
importance of variance in software and system estimates. Imagine this time you
are a development organization manager given the following artificial
opportunity. You can agree to the following deal: Have the teams at your own
expense develop some application, each meeting a given set of requirements. The
client really wants the applications and will accept them if acceptable and
perfectly will to be consulted throughout the projects. Here is the catch: if
you deliver the projects on time in 12 months, you will receive $1M per
application. If you are a day late, you get nothing. You have to decide whether to take the deal.
Lets suppose you take the projects to your estimators and
they tell you the estimated time to complete is 11 months and the estimated
cost to complete is $750K for each of the projects. So you stand to make an
estimated $250K per project. So you staff up as much as you take on three
projects looking forward to your bonus. Was this a good deal?
Those who have read The
Flaw of Averages by Sam Savage and Dan Denziger already know the answer.
Those who haven’t read the book should. This book nicely captures the sort
of statistical reasoning that underlies IBM Rational’s approach to business
analytics and optimizations (found in the RTC agile planner and the ROI
calculations in Focal Point). Some key rules:
- Uncertain quantities are captured by curves
called distributions (e.g. the bell shaped curve of normal distributions)
Most distributions for uncertain quantities are
not normal, bell shaped curves, i.e. normal distributions are abnormal.
- Calculating with averages in any case yields the
wrong answer with business critical effects. Rather one should calculate with
the distributions. This is done with Monte Carlo methods.
Back to the example: The time to complete is an uncertain
quantity and so must be described by a distribution. Often, the estimate
returned by the estimator is the mean of that distribution. The distribution
may be pretty wide and so may look like Figure 1 of the attached document. (I
have had bad luck trying to embed figures in the blog and I have put the figures in this this attachment.) Note that 40% of the distribution lies beyond 12 months.
Assuming the $750K cost to complete estimate is dead on,
lets apply some simple high school probability to get the distribution of
profit (See Figure 2):
The chance of succeeding at all three projects and
getting $3M is revenue.is (0.6)3=0.216,
The chance of succeeding at exactly two projects
and getting $2M in revenue is 3(0.6)2(0.4)=0.432
The chance of succeeding at exactly one project and
getting $1M in revenue is 3(0.6)(0.4)2=0.288
The chance you will fail at all three projects yielding
no revenue is (0.4)3=0.064.
The weighted average of the distribution of revenues is
(0.216)($3M) + (0.432)($2M) + (0.288)($1M) + (0.064)($0) =
So the likely outcome of your (3)$750K = $2.25M expense is a
But wait, it is worse. The distribution is probably not
normal. Programs are more likely to late than early and so are skewed to the
right. In this case the average (i.e. the mean) is less than the 50% point. So,
as shown in Figure 3, it is possible to have the estimate of 11 months and the
likelihood of failure is 50%. The revenue distribution is given in Figure 4. In
this case, the weighted average of the distribution of revenues is
(0.125)($3M) + (0.375)($2M) + (0.375)($1M) + (0.125)($0) =
In this the expected loss
But wait, it is still worse. The cost to complete is also
uncertain. To keep things as simple as possible, lets suppose the cost to
complete for each of the projects is described by three values: best case is
$700K, the likely case is $750K, and the worse case is $1M To compute the
expect profit in this case requires using this values as parameters for a
triangular distribution (see Figure 5) and then apply Monte Carlo methods to do
the calculation to get the distribution that describes the profit. The result
is shown in Figure 6. Briefly in this case:
The most likely outcome is a loss of $945K
There is a 90% certainty of losing at least
There is a 10% chance of losing more than $1.1M
So taking this deal is at best career limiting!
Notice by ignoring the rules, one is tempted to make a bad
deal. Applying each of the rules with more discipline shows how bad the deal
is. The moral of all this is that making business decisions based on
calculations of averages can lead to disastrous outcomes.
This moral needs to be taken to heart by our industry. Far
too often, managers when faced with making funding projects or business
commitments insist, “Just give me the number.” What they need is a distribution;
the number they are given is likely to be an average. Decisions based on the
number will likely go sour. No wonder the software and system business outcomes
rarely delight their stakeholders. The good news is that there are robust,
proven techniques to avoid the flaw of averages.
First, some personal disclosure: In the late 1980’s, I
worked for a while at Shell Research, developing seismic modeling and data
imaging algorithms. (See
.) While there I received training on oil exploration. I
remain awed by the passion, expertise, daring, and discipline of the engineers,
scientists, technicians, and skilled laborers who take responsibility for
providing the hydrocarbons we completely rely on.
Oil exploration is remarkably costly and risky. Even then in
the late 1980’s, it was not uncommon to spend $1B on an exploration well,
hoping to find oil based on the seismic data only to find it dry. At Shell, I was on a team that developed, for the time, a
highly compute-intensive algorithm for imaging seismic data captured in complex
subsurfaces. They literally bought
us a Cray since running the algorithm might make a marginal difference in the
success rate of exploration drilling.
Hence, I am not an oil industry basher, far from it. So I
have been watching the BP, Deepwater Horizon gulf catastrophe with great
interest. In this entry, I will share what I have gleaned from various news
sources. (I have found the Wall Street Coverage very credible). So here is my
Recall, the blowout occurred shortly after capping an
exploration well (a will drilled solely to confirm the presence of a oil
resevior). The depth of the well, reported 18,000 ft, is no big deal. The depth
of the water, 5000 ft, is far from the record of around 8000 ft. So the well itself was routine for
the industry. So what happened?
BP had drilled many of these wells. In fact, ironically the
blowout occurred while BP executives were celebrating their safety record. However,
over the last few years have become profitable by building a very
cost-conscious culture. Such a culture is likely to cut corners, repeatedly
taking small risks in business operations. Such behavior may be rational if you
believe the total liability is bounded. There is reason to believe BP’s
liability is ‘capped’ at $75M. This culture seemed to be at work on the oil
I bet that BP managers routinely
made the same decisions for years with no adverse outcomes. They were probably
rewarded for this behavior. Such a culture makes such disasters inevitable over
time. A great case study of such
cultures is found in Diane
Vaugh’s The Challenger Launch Decision:
Risky Technology, Culture, and Deviance at NASA. She studies the NASA
culture that led to the decision to launch the shuttle that exploded on launch killing
‘the first teacher in space’ while tens of thousands of school children watched
on television. The managers overruled the engineers who advised them that the
temperature was out of spec for a launch. As she explains, the managers had
gotten away with taking similar risks in the past and had decided to bow to
political pressures and approve the launch.
So what they were thinking is something like, “These risks
are no big deal and the savings matter.”
A key moral is that over time the unlikely becomes
inevitable. Further, experience and past data lead to exactly the wrong
behavior: they reward the risk taking, not the caution. Many have made exactly
the same point about behavior of the financial firms during the financial
Internalizing this moral and acting accordingly is key to
our industry. Increasing, we will be building life critical, economic critical
systems that are very complex, will operate over long periods, and whose
failure could be catastrophic. There is no turning away from this
inevitability. So, we all need to understand that cost savings must be balanced
by a clear understanding of the overall risks of failure, their consequences
and the real return of investment in failure avoidance. This of course takes some math. In
particular, thinking about averages is not useful. That is topic of my next
First, I am pleased that many saw the humor in the April fools posting. That said, I wonder if there will be ever quantum project management. Also, I fear this blog lacks humor. I will do what I can, but there is only so much that can be done to spice up the topic of analytics for software and system organizations.
So, back to the serious stuff.
But first a joke that I believe that dates back to vaudeville: Onstage, there is a streetlight. Under the streetlight, there is a man crawling around on hands and knees. A policeman walks up and asks what he is doing. The man says he is looking for his keys. The policeman asks if he is sure he lost them here. The man answers, "No, in fact I lost them down the street." The policeman asks why is he looking under the light. The man answers, "The light is brighter here."
OK, not so funny. So what's the point? A while back, I was discussing a client's management program with a colleague (who will remain nameless and I hope is reading the blog). I pointed it would not serve any purpose. My colleague answered "Well at least they are measuring something." I retorted, "First, you need to figure out what you need to measure, then figure out how to do the analysis and get the data." We left it at that. More generally, software and system organizations often measure what is easy, not what they need. They look where the light is brightest. We still have the question how to specify the needed measures, analytics, and data collection program.
In an earlier entry, I proposed some measurement principles
. While these principles are sound for assessing a measurement and analytics program, they do not provide operation guidance for defining the set of measures, associated analytics, and data. What is also needed is the analytics version of a requirements analysis. Last Friday two colleagues (named Clay Williams and Peri Tarr who I believe do read the blog) introduced me to the Goal Question Measure (GQM) method
. This method has been extended in various ways such as GQM+Strategy
I have seen the method applied. It looks much like functional decomposition and so it is a requirements analysis technique for analytics solutions. I think it should be extended to include identification of the data sources. So we would have GQMAD (not kidding), my spin on the main idea:
- Goals - what is the organization trying to achieve
- Questions - how would know quantitatively that the goals are being met?
- Measures that provide answers to the questions.
- Analytics that realize the measures
- Data that feeds the analytics.
Taking such an approach is a far cry from looking where the light is brightest. Note after building out the GQMAD requirements, one still needs to design how the data is collected and staged, how the analysis is executed, and measures displayed in order to answer the questions. So design and development of the analytics solution remains after the GQMAD process is carried out.
For my waterfallphobic friends, I share the concern. Building an analytics solution this way should be more iterative than is described above. Probably something like the Unified Process
can be applied using GQMAD as a good requirements practice.
Anyone out there with GQM experience they would like to share?
I have mentioned in the first posting, I am still getting the hang of blogging. I guess one use of blogs is to share what on my mind while staying in the neighborhood of the topic of analytics. So, I have been putting a lot of thought to Toyota's diemma about how to deal with the reports of dangerous acceleration in their cars. The recent reports of Prius incidents (see this article in the New York Times)
confirmed some of my earlier suspicions and hence this blog.
First, I need to come clean; all I know is from news accounts. I have had no contact with Toyota or any IBMers working with Toyota. Further, I need to say the the opinions here, in my opinion not controversial, are my own and do not reflect any IBM position.
So what do we know:
- The two models with reported acceleration problems, Camry and Prius, have bus interfaces from the petal to the electronic control unit (ECU) that manages the acceleration. (see Electionic Design News Article)
- There are confirmed cases where the floor mat is not not the cause (see the NYT article in the previous link)
- Toyota has has millions of Camrys on the road and the Prius has been the largest selling hybrid.
- Toyota has stated they cannot reproduce the problem.
So, here is how it seems to me. The models with the problem are those that have embedded software. Further, the incidence of the reports is consistent with the failure rate of software that meets usual software quality standards. It is well known that the time and cost of testing goes up dramatically as the defect density gets low. Further the defect may involve interactions between the ECUs. From an analytic perspective, the combined state space of the combined ECU's is very large and as the defects get removed, the bug states are sparse and so the likelihood of a set of interactions getting the integrated ECUs into a bug state is very low and unlikely to be found in conventional testing. So there may well be some latent defect in the embedded code that they
have not found after whatever amount of testing they find affordable,
particularly given the pressures of getting into the market. In conventional software, when the cost of finding the next defect is too high, the code is released in beta so that a crowd of volunteers can further exercise the code, putting in the hours to find the next set of defects. One cannot beta test an automobile.
Now, say there is one chance in a million miles of driving of the latent defects manifesting. They may be impossible to find with standard testing and will inevitably happen every so often to drivers. This is the standard insight that with large volumes unlikely events become inevitable. So with Toyota's large sales, they may be the victim of their success.
The avionics community has developed a discipline around safety-critical software. There are design and model testing methods to validate that the embedded software is good enough to stake people's lives on the code running correctly. (There is a good article is the latest Communications of the ACM on model checking for avionis) It seems Toyota and the entire auto industry needs to adopt these safety-critical disciplines going forward. The cost of these practices is overshadowed by the costs of costs of the highly publicized incidents, the suits, and other liability.
Modified by firstname.lastname@example.org
As readers of this occasional blog know, this blog has been less of a 'web log' and more a series of small essays on the topic of development analytics. I have decided to start writing less formal entries more frequently and have realized I would be comfortable doing that on my own web site, murraycantor.com.
I want to be entirely clear. IBM has in no way looked over my shoulder in the writing of the blog and has been very generous in providing me a forum. Nevertheless, I will be freer sharing my opinions when there is no opportunity of confusing my often idiosyncratic opinions of those of the company's.
So check out the new blog at www.murraycantor.com/blog. I hope to write something at least twice a month, maybe more often.
Meet you there and thanks!
Modified by email@example.com
Since the last entry introducing the concept of liability, I have had the opportunity to discuss it on several occasions colleagues in IBM, In the course of this discussions I formulated what seems to be a useful way to explain the idea. In particular, I presented this idea at the Managing Technical Debt Workshop held on October 9. The following is a preview of what I will present as a lightning talk at the Cutter Consortium Summit next week.
Imagine an insurance agent comes into your office with the following offer: "Our company will indemnify your code against the following risks:
Excess support costs (above some deductible)
The policy will only cost $X a year. You realize that code insurance is much like auto liability insurance. In the auto case, the insurance protects you financially against the possible unfortunate outcome of driving the car, in the code case the insurance protects you against against some unfortunate outcome of running the code. So code liability insurance is like automobile liability insurance. This leads to the definition:
'Technical Liability' is the financial risk exposure over the life of the code.
(Thanks to my colleague, Walker Royce, for this crisp definition.)
Note auto insurance and code insurance have some significant differences.
The context for driving - city streets, highways, parking lots, ... - is more limited than the range of contexts that code can operate. Software is truly everywhere from which being embedded in an avionics system to Angry Birds on a smartphone.
The risk for auto insurance is spread among small numbers of large relatively homogenous populations: young drivers, safe drivers, high-risk drivers, etc. So rates can be computed from population experience. We have no such insurance markets for software.
Generally, firms faced with assuming a liability have a choice: Either they buy a policy indemnifying them against the risk or they self-insure. When they self insure, it is often reported in the annual reports.
If you ship software, you are assuming a liability. As far as I know, code insurance is either rare or nonexistent. If it did, the cost of the policy would be charged against the financial value code. So we are left with self-insuring,
Here is the main point. In order to truly assess the economic value of the code, one should, as best one can, estimate the technical liability and a fair price, X, for the indemnification. Even a rough estimate of X is better than ignoring the liabilities assumed by shipping code.
So how to estimate X? My first observation should be of no surprise to readers of this blog. Since technical liability involves the future, there are a range of outcomes of future exposure, each of which has some probability. Technical liability has a probability distribution and so is a random variable. X is a statistic (perhaps the mean) of the distribution.
As suggested above, code liabilities comes in flavors: There are exposures resulting from security, reliability, integrity, and so on. Each of these flavors is characterized by its own random variable. The overall liability is the sum of the liabilities that apply to the particular code. As I mention in a previous entry, this sum of random variables is also a random variable found using Monte Carlo simulation.
Now, reasoning about code liability is not unprecedented. Car manufacturers estimate warrantee exposure, telephone switch manufacturers reason about the economic value of going from .99999 reliable to .999999 reliable. There are Bayesian models of the likelihood of a security breach. To estimate technical liability, we need to agree upon the taxonomy of flavors of liability, not a daunting task, and then assemble good enough models of each into an overall framework.
Modified by firstname.lastname@example.org
Over the last couple of years I have been more or less following the technical debt community's discussion on what exactly is technical debt. Some ague that technical debt is limited to what it would cost to address deficiencies such as those found by code inspection tools such as Sonar. Other writers such as Chris Stirling introduce aspects or kinds of technical debt: quality debt, design debt; ....
My interpretation of the Ward Cunningham metaphor on incurring debt by shipping is broader, including the wide range of after-delivery costs. This entry is continue that discussion and suggest one path forward.
I argued that technical debt should reflect the fact that the very act of shipping software incurs all sorts of possible liabilities, any one of which may incur some future cost.
Future service costs
Executives getting on planes to deal with critical situations
Fines resulting from privacy violations
Loss of business from failing a compliance audit
Loss of intellectual capital due to security flaws
The nature of the liabilities very from domain to domain. Shipping the next rev of a mobile game like angry birds entails much less liability that next rev of avionic software for a commercial jet.
The costs of fixing the code may be the least of it and under-estimates the assumed labilites. Reasoning about whether these liabilities outweigh the benefits of shipping the code is key to the ship decision.
Since I wrote that entry I have been watching the technical debt space and see that I may be the minority, but not alone, with this perspective, Some people argue that technical debt is solely the cost of addressing shortfalls in the code. Others adopt a broader definition. In fact, in a conversation I had with Capers Jones, a long-time expert in software measurement, he shared a conversation he had with Ward discussing the same points. I have seen others make a distinction between software debt and technical debt. I have decided not to weigh in on this argument, but suggest we call all of the liabilities, (wait for it ...) technical liability.
There is a key difference between standardly-defined technical debt and technical liability: Technical debt involves code quality and can be determined. The liabilities involve possible future events and so entail predictions of the future. Some might even consider technical debt knowable and technical liability unknowable.
Readers of this blog know where I am going. Technical liability, unlike the more limited technical debt, involves a range of future possibilities and so each of the components of liability should be specified as a random variable with a probability distribution. The security violation might or might not occur. But if it does, the possible expense could sink the company. Reasoning about the risk takes some advanced techniques like setting the price of an insurance policy.
Finally, the economic decision if it makes sense to ship a piece of software, one needs to balance the value expected from the ship against the assumed liabilities. Note that the future value is also a random variable. In that case the decision to ship should be based on the techniques found here. I will elaborate the reasoning ibehind technical liability n a future blog (promise).
In summary then, technical liability gives a more complete picture of the economics of shipping a piece of code than technical debt, but it requires more sophisticated analysis.