Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Quality busters: Not measuring the risks

Assess the impact of application failures on customers and business

Michael Russell (MikeRussell@VickiFox.com), Application Architect, Vicki Fox Productions, Inc.
Photo of Michael Russell
Michael Russell has a bachelor's degree in physics and a master's degree in computer science. He was a logistics engineer, a technical services manager, and a certified IT architect at IBM for nearly 14 years. He is currently a Web application architect for a resort company in Orlando. He has experience in Windows, UNIX, and OS/400 environments. He uses Web technology for entertainment through his own company, Vicki Fox Productions (http://www.VickiFox.com).

Summary:  Members of the information industry often do not think about safety and risk. They consider these factors to be the domain of life-and-death environments like spacecraft software, nuclear power plant control systems, and medical equipment. But even business software can have safety concerns. An improperly processed financial transaction might cause long-lasting harm to a customer or to the business itself. In this article, I introduce failure modes and effects analysis (FMEA) and risk assessments as important tools for business software architects.

View more content in this series

Date:  15 Dec 2004
Level:  Introductory

Comments:  

In this installment of the Quality busters series, I introduce some new quality tools: failure modes, effects analysis, and risk assessment. The Quality busters series discusses aspects of operational and non-functional requirements, highlighting the fact that solutions involve making tradeoffs between often conflicting goals.

The customers are revolting

The SHEEP application had a scheduling service component that ensured that price changes, such as holiday promotions and sales, took effect when advertised. The sales managers would enter the sales and promotions planned for each year well in advance. Then, when the time for the advertising campaign arrived, the scheduling service would automatically update the prices.

Everything appeared to be working fine until the day the vice president of sales assembled the SHEEP team and reported some bad news. "We are receiving a lot of complaints lately that the cash register price and the advertised price in states in the Mountain Time Zone are not matching. We are getting calls from consumer advocates in that region asking if we are using deceptive advertising. One state is even looking into legal action."

The SHEEP team members said that they would look into it. As they investigated the problem, they could not find any artifacts in the log files or system monitoring that suggested a problem. The database showed that the Mountain Time updates were being read.

Only after interviewing the operations team did they discover that one operator was routinely killing the scheduling service process in order to perform data backups. The operator would do this after the program had read the update information but while it was still in the very long process of updating the products. When the process was killed, the product updates were rolled back by the database, but the update information had already been flagged as processed and the change committed to the database. Because of the way the process had been killed, it did not leave any error message in the log file.

As a result of this event, the SHEEP team learned several things:

  • Some process failures, especially those resulting from human intervention, do not leave a trace -- the process simply goes away.
  • A process failure can have a severe impact on customers (for example, overcharging, affecting credit rating, satisfaction, and confidence) and the company (such as anger from customers, investigations from consumer advocates and government authorities, class action suits).
  • The SHEEP team did not know the costs associated with application failures.

Changing attitudes

Early in my career, I worked as a logistics engineer. As such, I had to assess the safety, reliability, and maintainability of my company's products. Part of the safety assessment was determining the risks associated with using the products, which were laser systems. A laser system can damage eyes if the beam is viewed directly. The heat exchanger can cause burns. The power supply can shock a person. The list goes on.

The ranges of risks associated with these electro-mechanical devices were easier to enumerate than those for a complex business software application. But this experience helped me understand that every object -- whether a laser system or a business software package or even a pencil -- has risks associated with its use. The safe use of an object comes from understanding the risks and taking the necessary actions to minimize the probability of the risky event happening.

This is not an attitude that is taught in computer science courses or developed by working on narrowly focused application tasks. Many developers don't begin to think about risks associated with software until a problem occurs. But the software architect should think about application risks from the beginning of a project. Such a mindset is vital for creating high-quality software.


Determine failures

The first step in minimizing software risks is to determine the failures that might occur. It is impossible to determine all possible combinations of failure events in software. If so, you can code for them. Instead, determine failures at a component, subsystem, service, or process level, not the lower-level module, class, or function level.

A powerful and formal method for determining, identifying, and ranking failures is a methodology called failure mode and effects analysis (FMEA). A related method is called failure modes, effects, and criticality analysis (FMECA). Related methodologies, such as root cause analysis and failure tree diagramming, are helpful techniques, as well.

The FMEA methodology is a way to identify potential failures, to assess the risks of each, to rank the risks in terms of importance, and to identify corrective actions to address the most serious risks. Space and time do not permit me to offer a tutorial on FMEA in this short column. I encourage the interested reader to research the many published guidelines and standards for FMEAs and FMECAs, some of which are listed in Resources.

The basic FMEA procedure consists of the following steps:

  1. Assemble the project team. Include representatives from the software development team, the end-users, and the business sponsors.
  2. Establish the ground rules for the project. Indicate the level of detail, how the analysis will be performed, and how the results will be collected and published.
  3. Identify the software components to study. With software applications, it is often difficult to study the entire application, so study a subset. Usually this is easy to determine. For example, you might leave a logging component out of the analysis.
  4. Gather information about the components being studied. The architecture, design, and implementation of the components must be available. Because some members of the project team will not understand source code, it is important to keep the information in a form that all of the team can understand.
  5. Identify the functions, failures, effects, and causes. For each component studied, build a functionality list. Next, brainstorm about possible failures that can occur. It is important to remember that failures occur by different modes: spontaneous, externally induced, time-dependent, resource-dependent, and usage- (wear-out) induced. Then, for each failure, determine the effects of that failure on other components and users. Finally, for each failure, determine the likely causes of the failure.
  6. Evaluate the risk associated with each failure/effect combination. The risk of each failure is described by the severity (none to dangerous), the probability (unlikely to very likely), and detectability (easily detected to impossible to predict or detect).
  7. Prioritize the risks. At this point, each failure now has a risk associated with it. The business community can help prioritize the risks and provide the resources to perform corrective action on the most critical risks.

Make an FMEA part of any new application or major change to an existing application. When you perform an FMEA, you minimize future problems and risks.


Assess risks

For some systems, assess risks in terms of physical harm. For example, problems with software running a spacecraft or a nuclear power plant might result in death. For most commercial business systems, however, the risks are generally assessed in terms of financial, social, or criminal harm. For example, a failure might corrupt financial information, which might result in a person writing a check with insufficient funds, finally resulting in criminal action.

For business applications, the risk assessment must consider the effects of errors both on external entities (customers, vendors, and distributors) and internal entities (the company itself). If an application is not accessible, what is the cost to the company as a result of lost sales? If a supply chain component is not working, what is the cost to the company for rescheduling shipment or excessive inventory?

Some of the potential costs of errors are not monetary. Some costs are non-quantifiable, such as the effect of errors on customer service and satisfaction. Also, quantifiable costs might not have a direct monetary equivalence, such as productivity lost when users perform other tasks (or do nothing) while waiting for the application to resume operations.

Not only are the costs associated with a risk important, but so is the probability that the risk will occur. Some failures are more likely than others. Part of the risk assessment is to estimate the probability that a risk will occur.

Once you know the severity (usually expressed as cost) and the probability of each risk, rank the risks in order of importance. You can calculate this ranking in several ways; the various FMEA procedures discuss these methods. In general, high-cost risks that are very likely to occur are of highest importance, whereas low-cost risks that are unlikely to happen are of lowest importance.

One approach to ranking risks is to establish a risk priority number (RPN). This is calculated as follows:

  • Rate the severity of each failure effect (1=low, 10=high).
  • Rate the likelihood of each failure (1=unlikely, 10=very likely).
  • Rate the likelihood of prior detection of the failure (1=easy to detect, 10=unpredictable)
  • Calculate the RPN by multiplying together the three ratings.

Apply changes

Once you know the risks and rate their importance, it is time for the project team to act. Where possible, make changes to the application or the operational environment to address the most important risks.

In some cases, you can minimize the risk through software change. In other cases, minimize the risk through procedural changes that affect how the software is used. A combination of approaches might be necessary.

Just remember the old proverb: When the number one problem is solved, the number two problem becomes the number one problem. You might not address all of the risks identified by an FMEA study. Achieve a balance between what you can solve, what you can afford, and how important it is to mitigate the risk. Obviously, you must solve any problem that might cause physical, financial, and criminal harm. But you can tolerate a failure that is simply an inconvenience and handle it in the future. Businesses, users, and developers working together as a team should be able to determine which risks are important enough to expend resources to address.


Considerations

When you create an application, it is important to keep the risks in mind. As architect, these are some of the questions you want to ask:

  • Are the ways in which the application can fail understood?
  • If a component fails, are the effects of that failure understood?
  • Can a component failure cause harm, whether physical, financial, social, or criminal?
  • Is the frequency of failure understood?
  • Are the failure modes known?
  • Are the costs to users, customers, partner businesses, and the company itself for various failures and operational downtime known?
  • Does the department know about and apply failure analysis methodologies that can contribute to higher quality software?

In summary

The FMEA methodologies and corresponding risk assessments can contribute to improved software architectures and processes, resulting in higher reliability, increased safety, better customer satisfaction, and reduced costs. You can formally evaluate risks during the design phase so you can address them early in the application lifecycle. You can even use the results of an FMEA study to assist operations with troubleshooting activities and as a training tool for new programmers.

The application of FMEA as part of the software development lifecycle is often a quality requirement for such certifications as ISO 9001, QS 9000, ISO/TS 16949, and Six Sigma practices.


Resources

About the author

Photo of Michael Russell

Michael Russell has a bachelor's degree in physics and a master's degree in computer science. He was a logistics engineer, a technical services manager, and a certified IT architect at IBM for nearly 14 years. He is currently a Web application architect for a resort company in Orlando. He has experience in Windows, UNIX, and OS/400 environments. He uses Web technology for entertainment through his own company, Vicki Fox Productions (http://www.VickiFox.com).

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development
ArticleID=32206
ArticleTitle=Quality busters: Not measuring the risks
publish-date=12152004
author1-email=MikeRussell@VickiFox.com
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).