Yesterday, I was at the Conference on System Engineering Research (CSER) held this year at Stevens Institute. I sat through a talk which stimulated my curmudgeon tendencies. In the spirit of hopefully generating some contraversy, I will not hold back.
The talk was about an expert-system based engineering risk management system. Essentially, the authors got a set of experts together to identify catagories of risks (people, delivery, product ...), risks in the categories, and a method for identifying level of risk and their consequence and then summing the products of the levels and the consequences. The end is the total amount of category risk. Looking at the output is supposed to give you insight of the overall program risk and the contributing risks.
My problem is that I cannot parse the last sentence. In fact I do not understand terms like "program risk" and say "people risk". There may be a clash of cultures here; to many those terms seem reasonable.
My argument starts here: One can ask 'What is my risk of going over budget?' or 'what is my risk of missing the delivery date?' The answers to these sort of questions are answered using stardard business analytics. See, for example, Mun's text on risk analytsis that defines risk as statistical uncertainty of a quantity that matters. For example, 'time to complete' is a quantity that does matter to a project. The uncertainty in making the date can be measuresd as the variance (or standard deviation) of the estimate of the time-to-complete. (Note, for the math aware, time-to-complete is what the statisticians call a continuous random variable.) So the answer to the question, 'what is my schedule risk?' has an unambiguous, quantified answer. What is 'my people risk' has no such answer. In fact, 'people risk' is not a concept defined in business analytics.
Of course, it does make sense to ask what contributes to the schedule risk. One might fear that the inability to staff the project contributes to the schedule risk. Fair enough. In my mind, that does not make staffing a 'risk', but say a schedule risk factor.
I am not sure why I am so adamant about this, but I am. It could be that I believe that the less precise use and measurement of risk is holding our industry back.
Anyone want to comment or defend the so-called risk management practice underlying the talk I found so annoying ?
firstname.lastname@example.org 110000CH6X Tags:  system risk horizon deepwater gulf oil bp spill 2 Comments 2,925 Visits
First, some personal disclosure: In the late 1980’s, I worked for a while at Shell Research, developing seismic modeling and data imaging algorithms. (See this link.) While there I received training on oil exploration. I remain awed by the passion, expertise, daring, and discipline of the engineers, scientists, technicians, and skilled laborers who take responsibility for providing the hydrocarbons we completely rely on.
Oil exploration is remarkably costly and risky. Even then in the late 1980’s, it was not uncommon to spend $1B on an exploration well, hoping to find oil based on the seismic data only to find it dry. At Shell, I was on a team that developed, for the time, a highly compute-intensive algorithm for imaging seismic data captured in complex subsurfaces. They literally bought us a Cray since running the algorithm might make a marginal difference in the success rate of exploration drilling.
Hence, I am not an oil industry basher, far from it. So I have been watching the BP, Deepwater Horizon gulf catastrophe with great interest. In this entry, I will share what I have gleaned from various news sources. (I have found the Wall Street Coverage very credible). So here is my net:
Recall, the blowout occurred shortly after capping an exploration well (a will drilled solely to confirm the presence of a oil resevior). The depth of the well, reported 18,000 ft, is no big deal. The depth of the water, 5000 ft, is far from the record of around 8000 ft. So the well itself was routine for the industry. So what happened?
BP had drilled many of these wells. In fact, ironically the blowout occurred while BP executives were celebrating their safety record. However, over the last few years have become profitable by building a very cost-conscious culture. Such a culture is likely to cut corners, repeatedly taking small risks in business operations. Such behavior may be rational if you believe the total liability is bounded. There is reason to believe BP’s liability is ‘capped’ at $75M. This culture seemed to be at work on the oil platform:
I bet that BP managers routinely made the same decisions for years with no adverse outcomes. They were probably rewarded for this behavior. Such a culture makes such disasters inevitable over time. A great case study of such cultures is found in Diane Vaugh’s The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA. She studies the NASA culture that led to the decision to launch the shuttle that exploded on launch killing ‘the first teacher in space’ while tens of thousands of school children watched on television. The managers overruled the engineers who advised them that the temperature was out of spec for a launch. As she explains, the managers had gotten away with taking similar risks in the past and had decided to bow to political pressures and approve the launch.
So what they were thinking is something like, “These risks are no big deal and the savings matter.”
A key moral is that over time the unlikely becomes inevitable. Further, experience and past data lead to exactly the wrong behavior: they reward the risk taking, not the caution. Many have made exactly the same point about behavior of the financial firms during the financial meltdown.
Internalizing this moral and acting accordingly is key to our industry. Increasing, we will be building life critical, economic critical systems that are very complex, will operate over long periods, and whose failure could be catastrophic. There is no turning away from this inevitability. So, we all need to understand that cost savings must be balanced by a clear understanding of the overall risks of failure, their consequences and the real return of investment in failure avoidance. This of course takes some math. In particular, thinking about averages is not useful. That is topic of my next entry.
Since the last entry introducing the concept of liability, I have had the opportunity to discuss it on several occasions colleagues in IBM, In the course of this discussions I formulated what seems to be a useful way to explain the idea. In particular, I presented this idea at the Managing Technical Debt Workshop held on October 9. The following is a preview of what I will present as a lightning talk at the Cutter Consortium Summit next week.
Imagine an insurance agent comes into your office with the following offer: "Our company will indemnify your code against the following risks:
The policy will only cost $X a year. You realize that code insurance is much like auto liability insurance. In the auto case, the insurance protects you financially against the possible unfortunate outcome of driving the car, in the code case the insurance protects you against against some unfortunate outcome of running the code. So code liability insurance is like automobile liability insurance. This leads to the definition:
'Technical Liability' is the financial risk exposure over the life of the code.
(Thanks to my colleague, Walker Royce, for this crisp definition.)
Note auto insurance and code insurance have some significant differences.
Generally, firms faced with assuming a liability have a choice: Either they buy a policy indemnifying them against the risk or they self-insure. When they self insure, it is often reported in the annual reports.
If you ship software, you are assuming a liability. As far as I know, code insurance is either rare or nonexistent. If it did, the cost of the policy would be charged against the financial value code. So we are left with self-insuring,
Here is the main point. In order to truly assess the economic value of the code, one should, as best one can, estimate the technical liability and a fair price, X, for the indemnification. Even a rough estimate of X is better than ignoring the liabilities assumed by shipping code.
So how to estimate X? My first observation should be of no surprise to readers of this blog. Since technical liability involves the future, there are a range of outcomes of future exposure, each of which has some probability. Technical liability has a probability distribution and so is a random variable. X is a statistic (perhaps the mean) of the distribution.
As suggested above, code liabilities comes in flavors: There are exposures resulting from security, reliability, integrity, and so on. Each of these flavors is characterized by its own random variable. The overall liability is the sum of the liabilities that apply to the particular code. As I mention in a previous entry, this sum of random variables is also a random variable found using Monte Carlo simulation.
Now, reasoning about code liability is not unprecedented. Car manufacturers estimate warrantee exposure, telephone switch manufacturers reason about the economic value of going from .99999 reliable to .999999 reliable. There are Bayesian models of the likelihood of a security breach. To estimate technical liability, we need to agree upon the taxonomy of flavors of liability, not a daunting task, and then assemble good enough models of each into an overall framework.
It should not be a surprise that I have been following the BP oil spill with much interest. In fact, as I starting typing this entry, I was watching the grilling of the BP CEO, Tony Heyward, by Congress. Rep. Stupak is focusing on the BP’s risk management.
Some of you have read my earlier posting on my thoughts of the BP decision process that led to the Deepwater Horizon blowout. So far, information uncovered since that posting is remarkably consistent with my earlier suppositions. In this entry I would like to step back a bit and discuss what broader lessons might be learned from the incident. While it is all too easy to fall into BP bashing, I would rather use this moment to reflect more deeply on risk taking and creating value. (BTW, some of you might now that my signature slogan is ‘Take risks, add value’.)
In our industry we create value primarily through the efficient delivery of innovation. Delivering innovation, by definition, requires investing in efforts without initial full knowledge of the effort required and the value of the delivery. This incomplete information results in uncertainties in the cost, effort, schedule of the projects and the value of the delivered software and system, i.e. cost, schedule and value risk.
Deciding to drill an oil well also entails investing in an effort with uncertain costs and value. In this case, the structure of the subsystem and productivity of the well cannot be know with certainty before drilling. As I pointed out in an earlier blog, a good definition of risk is uncertainty in some quantifiable measure that matters to the business. So in both our industry and oil drilling we deliberately assume risk to deliver value.
So, what can we learn from the BP incident? Briefly, one creates value by genuinely managing risk. One creates the semblance of value for a while by ignoring risk.
Assuming risk, investing in uncertain projects, provides the opportunity for creating value. That value is actually realized by investing in activities that reduce the risk. The model that shows the relationship is described in this entry. So, reducing risk has economic value, but reducing risk takes investment. In the end, the quality risk management is measured with a return on investment calculation. This in turn requires a means to quantify and in fact monetize risk.
I wonder what was there risk management approach was followed by BP. A recent Wall Street Journal article suggested they used a risk map approach – building a diagram with one axis a score of the ‘likelihood of the risk’ and the other a score of the ‘severity of a failure’. So with this method, they would score the risk of a blowout as very low (based on past history) with a very high consequence. So, such a risk needs to be ‘mitigated’. (Some actually multiply the scores to get to some absolute risk measure.) Their mitigation was the installation of a blow-out preventer. They could then confidently report they have executed their risk management plan. Note these scores are at best notionally quantified and not monetized.
Paraphrasing my good colleague, Grady Booch (speaking of certain architecture frameworks), risk maps is the semblance of risk management. As pointed out by Douglis Hubbard in The Failure of Risk Management (and in an earlier rant in this blog), this sort of risk management is not only common, but dangerous: It is a sort of business common failure mode that leads to bad outcomes. Also, Hubbard points out, useful risk management entails quantification and calculation using probability distributions and Monte Carlo analysis. I would add that since risk management in the end is about business outcomes, risks need to be monetized as well as quantified. I am willing to bet a good bottle of wine that BP did no such thing. Any takers? The business common failure mode was over-reliance on the preventers, even though there are several studies showing they are far from ‘failsafe’.
Further, it appears BP assumed risk by consistently taking the cheaper, if riskier. design and procedure alternative, the one with greater uncertainty in the outcome, even when the cost of an undesired, if unlikely, outcome was possibly catastrophic. The laundry list of such decisions is long; some outlined in Congressman Waxman’s letter to Tony Hayward. CEO’s of Shell and Exxon testified before congress that their companies would have used a different, more costly designs and followed more rigorous procedures. According the congressional and journalistic reports, this behavior is BP standard operating procedure. So BP assumed risk by drilling wells but did not invest in reducing the risk.
For quite a while they got away with the approach of assuming but really reducing risk, and appeared to be creating value as reflected in stock and value and dividends to the investors. The BP management raised the stock price from around $40/share in 2003 to a peak of around of $74/share prior to the Deepwater Horizon incident. At this writing the stock is trading at $32/share and the current dividend has been cancelled. Investors might rightly wonder if there is another latent disaster and so discount the apparent future profitability with the likelihood of unknown liabilities. The total loss of stockholder value is over $100B, which is in the ballpark of the eventual liability of BP. So, whatever approach BP used to manage risk failed.
BTW, some may recognize this same pattern in the management of financial firms that participated in the subprime mortgage market. In that case, they ‘mitigated risk’ by relying on the ratings agencies. Those who actually built monetized models of the risk realized there was a great opportunity to bet against the subprime mortgage lenders and made huge fortunes (See, e.g. The Big Short: Inside the Doomsday Machine by Michael Lewis .).
Readers of the blog will notice a recurrent theme is some of the postings. It is essential that we assume and manage risk. To repeat a favorite quote, “One cannot manage what one does not measure.” The risk map, score methods, while common are insufficient to the needs of our industry; they do not measure, nor really manage risk. We as a discipline need to step up to quantifying, monetizing, and working off risk in order to be succeed as drivers of innovation. We need to step up to the mathematical approach found in the Douglas and Dan Savage’s (see this posting) texts.
I came to this same realization probably a decade ago. I held off at first because I had not deep enough understanding of how to proceed, and I knew I would encounter great skepticism. I tested the waters in 2005 and posted my first paper on the subject in 2006. I indeed received a great deal of skepticism and resistance, but enough acceptance to go forward. I have learned some important lessons from all that. In my next blog, I will share my experiences of bringing more mathematical thinking to risk management for SSD.