In this installment of the Quality busters series, I introduce some new quality tools: failure modes, effects analysis, and risk assessment. The Quality busters series discusses aspects of operational and non-functional requirements, highlighting the fact that solutions involve making tradeoffs between often conflicting goals.
The SHEEP application had a scheduling service component that ensured that price changes, such as holiday promotions and sales, took effect when advertised. The sales managers would enter the sales and promotions planned for each year well in advance. Then, when the time for the advertising campaign arrived, the scheduling service would automatically update the prices.
Everything appeared to be working fine until the day the vice president of sales assembled the SHEEP team and reported some bad news. "We are receiving a lot of complaints lately that the cash register price and the advertised price in states in the Mountain Time Zone are not matching. We are getting calls from consumer advocates in that region asking if we are using deceptive advertising. One state is even looking into legal action."
The SHEEP team members said that they would look into it. As they investigated the problem, they could not find any artifacts in the log files or system monitoring that suggested a problem. The database showed that the Mountain Time updates were being read.
Only after interviewing the operations team did they discover that one operator was routinely killing the scheduling service process in order to perform data backups. The operator would do this after the program had read the update information but while it was still in the very long process of updating the products. When the process was killed, the product updates were rolled back by the database, but the update information had already been flagged as processed and the change committed to the database. Because of the way the process had been killed, it did not leave any error message in the log file.
As a result of this event, the SHEEP team learned several things:
- Some process failures, especially those resulting from human intervention, do not leave a trace -- the process simply goes away.
- A process failure can have a severe impact on customers (for example, overcharging, affecting credit rating, satisfaction, and confidence) and the company (such as anger from customers, investigations from consumer advocates and government authorities, class action suits).
- The SHEEP team did not know the costs associated with application failures.
Early in my career, I worked as a logistics engineer. As such, I had to assess the safety, reliability, and maintainability of my company's products. Part of the safety assessment was determining the risks associated with using the products, which were laser systems. A laser system can damage eyes if the beam is viewed directly. The heat exchanger can cause burns. The power supply can shock a person. The list goes on.
The ranges of risks associated with these electro-mechanical devices were easier to enumerate than those for a complex business software application. But this experience helped me understand that every object -- whether a laser system or a business software package or even a pencil -- has risks associated with its use. The safe use of an object comes from understanding the risks and taking the necessary actions to minimize the probability of the risky event happening.
This is not an attitude that is taught in computer science courses or developed by working on narrowly focused application tasks. Many developers don't begin to think about risks associated with software until a problem occurs. But the software architect should think about application risks from the beginning of a project. Such a mindset is vital for creating high-quality software.
The first step in minimizing software risks is to determine the failures that might occur. It is impossible to determine all possible combinations of failure events in software. If so, you can code for them. Instead, determine failures at a component, subsystem, service, or process level, not the lower-level module, class, or function level.
A powerful and formal method for determining, identifying, and ranking failures is a methodology called failure mode and effects analysis (FMEA). A related method is called failure modes, effects, and criticality analysis (FMECA). Related methodologies, such as root cause analysis and failure tree diagramming, are helpful techniques, as well.
The FMEA methodology is a way to identify potential failures, to assess the risks of each, to rank the risks in terms of importance, and to identify corrective actions to address the most serious risks. Space and time do not permit me to offer a tutorial on FMEA in this short column. I encourage the interested reader to research the many published guidelines and standards for FMEAs and FMECAs, some of which are listed in Resources.
The basic FMEA procedure consists of the following steps:
- Assemble the project team. Include representatives from the software development team, the end-users, and the business sponsors.
- Establish the ground rules for the project. Indicate the level of detail, how the analysis will be performed, and how the results will be collected and published.
- Identify the software components to study. With software applications, it is often difficult to study the entire application, so study a subset. Usually this is easy to determine. For example, you might leave a logging component out of the analysis.
- Gather information about the components being studied. The architecture, design, and implementation of the components must be available. Because some members of the project team will not understand source code, it is important to keep the information in a form that all of the team can understand.
- Identify the functions, failures, effects, and causes. For each component studied, build a functionality list. Next, brainstorm about possible failures that can occur. It is important to remember that failures occur by different modes: spontaneous, externally induced, time-dependent, resource-dependent, and usage- (wear-out) induced. Then, for each failure, determine the effects of that failure on other components and users. Finally, for each failure, determine the likely causes of the failure.
- Evaluate the risk associated with each failure/effect combination. The risk of each failure is described by the severity (none to dangerous), the probability (unlikely to very likely), and detectability (easily detected to impossible to predict or detect).
- Prioritize the risks. At this point, each failure now has a risk associated with it. The business community can help prioritize the risks and provide the resources to perform corrective action on the most critical risks.
Make an FMEA part of any new application or major change to an existing application. When you perform an FMEA, you minimize future problems and risks.
For some systems, assess risks in terms of physical harm. For example, problems with software running a spacecraft or a nuclear power plant might result in death. For most commercial business systems, however, the risks are generally assessed in terms of financial, social, or criminal harm. For example, a failure might corrupt financial information, which might result in a person writing a check with insufficient funds, finally resulting in criminal action.
For business applications, the risk assessment must consider the effects of errors both on external entities (customers, vendors, and distributors) and internal entities (the company itself). If an application is not accessible, what is the cost to the company as a result of lost sales? If a supply chain component is not working, what is the cost to the company for rescheduling shipment or excessive inventory?
Some of the potential costs of errors are not monetary. Some costs are non-quantifiable, such as the effect of errors on customer service and satisfaction. Also, quantifiable costs might not have a direct monetary equivalence, such as productivity lost when users perform other tasks (or do nothing) while waiting for the application to resume operations.
Not only are the costs associated with a risk important, but so is the probability that the risk will occur. Some failures are more likely than others. Part of the risk assessment is to estimate the probability that a risk will occur.
Once you know the severity (usually expressed as cost) and the probability of each risk, rank the risks in order of importance. You can calculate this ranking in several ways; the various FMEA procedures discuss these methods. In general, high-cost risks that are very likely to occur are of highest importance, whereas low-cost risks that are unlikely to happen are of lowest importance.
One approach to ranking risks is to establish a risk priority number (RPN). This is calculated as follows:
- Rate the severity of each failure effect (1=low, 10=high).
- Rate the likelihood of each failure (1=unlikely, 10=very likely).
- Rate the likelihood of prior detection of the failure (1=easy to detect, 10=unpredictable)
- Calculate the RPN by multiplying together the three ratings.
Once you know the risks and rate their importance, it is time for the project team to act. Where possible, make changes to the application or the operational environment to address the most important risks.
In some cases, you can minimize the risk through software change. In other cases, minimize the risk through procedural changes that affect how the software is used. A combination of approaches might be necessary.
Just remember the old proverb: When the number one problem is solved, the number two problem becomes the number one problem. You might not address all of the risks identified by an FMEA study. Achieve a balance between what you can solve, what you can afford, and how important it is to mitigate the risk. Obviously, you must solve any problem that might cause physical, financial, and criminal harm. But you can tolerate a failure that is simply an inconvenience and handle it in the future. Businesses, users, and developers working together as a team should be able to determine which risks are important enough to expend resources to address.
When you create an application, it is important to keep the risks in mind. As architect, these are some of the questions you want to ask:
- Are the ways in which the application can fail understood?
- If a component fails, are the effects of that failure understood?
- Can a component failure cause harm, whether physical, financial, social, or criminal?
- Is the frequency of failure understood?
- Are the failure modes known?
- Are the costs to users, customers, partner businesses, and the company itself for various failures and operational downtime known?
- Does the department know about and apply failure analysis methodologies that can contribute to higher quality software?
The FMEA methodologies and corresponding risk assessments can contribute to improved software architectures and processes, resulting in higher reliability, increased safety, better customer satisfaction, and reduced costs. You can formally evaluate risks during the design phase so you can address them early in the application lifecycle. You can even use the results of an FMEA study to assist operations with troubleshooting activities and as a training tool for new programmers.
The application of FMEA as part of the software development lifecycle is often a quality requirement for such certifications as ISO 9001, QS 9000, ISO/TS 16949, and Six Sigma practices.
- Read the author's other articles in the Quality busters series on developerWorks.
- Visit the Forum on Risks to the Public in Computers and Related Systems (a moderated USENET equivalent is comp.risks). This project of the Association for Computing Machinery's Committee on Computer and Public Policy lists various public failures (from simple embarrassments to deadly mistakes) of computer systems. Find a collection of these at the moderator's Web site.
- Try these helpful Web resources about FMEA:
- Explore this partial list of software vendors who sell FMEA tools. The author does not endorse or recommend any products and lists these only to illustrate that tools are available for FMEA work.
- In Failure Mode and Effect Analysis: FMEA from Theory to Execution by D.H. Stamatis, get a complete guide to FMEA from concepts to tools and methods (American Society for Quality, 1995).
- The United States military uses FMEA techniques extensively and has a standardized approach to analysis work. This can be found in MIL-STD-1629A, "Procedures for performing a failure mode, effects, and criticality analysis." Download PDF versions of this and other military standards.
- Read Software Reliability: Measurement, Prediction, Application by John Musa et al. (McGraw-Hill, 1987), This classic book covers principles of software reliability that haven't changed.
- Visit these valuable resources on the IBM developerWorks site:
- The developerWorks Web Architecture zone specializes in articles covering various Web-based solutions.
- Browse for books on these and other technical topics.

Michael Russell has a bachelor's degree in physics and a master's degree in computer science. He was a logistics engineer, a technical services manager, and a certified IT architect at IBM for nearly 14 years. He is currently a Web application architect for a resort company in Orlando. He has experience in Windows, UNIX, and OS/400 environments. He uses Web technology for entertainment through his own company, Vicki Fox Productions (http://www.VickiFox.com).