In manufacturing facilities, understanding the potential causes of system failures is crucial to preventing them. Fault tree analysis (FTA) offers one approach to root cause analysis, identifying and analyzing the root of asset issues before equipment breaks down.
Fault tree analysis is a deductive, top-down approach to determining the cause of a specific undesired event within a complex system. It involves breaking down the root cause of a failure into its contributing factors and representing it through a graphical model called a fault tree, which helps managers and engineers identify potential failure modes—and the probability of each failure mode—for safety and reliability analyses.
First developed in the early 1960s by Bell Laboratories to help the U.S. Air Force understand potential flaws in the Minuteman missile system, FTA has been widely used across various industries, including the aerospace, nuclear power, chemical and automotive sectors, among others.
Maintenance managers might utilize fault tree analysis to:
As manufacturing environments continue to evolve and become more complex, the need for effective risk management tools like FTA becomes increasingly important. Incorporating fault tree analyses into your organization's safety analyses and reliability engineering practices can help an organization gain deeper insights into potential causes of system failure, improving overall performance and reducing the likelihood of costly and potentially catastrophic incidents.
Explore IBM Maximo to learn how IoT data, analytics and AI can help streamline your asset operations.
Subscribe to the IBM newsletter
Performing a fault tree analysis is a complex process that involves seven key steps.
Before running your analysis, you should clearly define the undesired event you want to analyze. This event should be specific and measurable, like a component failure or a system malfunction. It’s also important to define the event in clear, consistent terms, since it will serve as the starting point for your fault tree diagram.
Once you define the undesired event, you should start to identify the factors and events that could contribute to its occurrence. Contributing factors tend to fall in to two broad categories: basic events and intermediate events.
Basic events—those events that cannot be further broken down into simpler events—are the most fundamental events in a fault tree, representing the lowest level of events you can analyze. A basic event in a fault tree for a car accident, for example, might be "driver loses control of the vehicle."
Intermediate events are located between the lower-level basic events and the top event (the primary undesired event being analyzed). Intermediate events are caused by other events in the fault tree and, in turn, cause other events. They represent higher-level events that can be analyzed further. Using the same car accident as an example, an intermediate event in the fault tree might be "tire blows out."
Be sure to consider both internal and external events, like component failures, human error and environmental conditions. You may need to consult with subject matter experts, and/or review of historical data, incident reports and maintenance records, at this stage of the analysis.
Using standard gate symbols and event symbols, construct a graphical representation of the relationships between the undesired (or output) event and its contributing factors (also called input events). The fault tree should be organized hierarchically, with the undesired event at the top and the contributing factors branching out below it.
Laying out basic events is pretty straightforward, since basic events cannot produce other events. Including intermediate events, however, is a bit more complex, as intermediate events will require Boolean logic gates that indicate the relationships between top-level, intermediate and basic events.
There are two main types of logic gates used in fault trees: AND gates and OR gates.
OR gates: Use an OR gate when any one of the input events is sufficient to cause the output event. In other words, the output event will happen if at least one of the input events connected to the OR gate happens. If, for instance, a system failure could result from either a component failure or an operator error, an OR gate would be used to connect the events.
Though less commonly used, NOT gates, XOR gates, K/N gates and INHIBIT gates can also help identify specific relationships between input and output events.
NOT gates: NOT gates represent the inverse of an input event. If the input event does not occur, the output event will occur. These gates are less common in fault tree analysis, since they model the absence of an event or the occurrence of a complementary event.
XOR gates (Exclusive OR gates): USE an XOR gate when exactly one of the input events must occur for the output event to happen. If none or more than one of the input events occur, the output event will not happen.
K/N gates: K/N gates, also known as voting gates or threshold gates, are used when a specific number of the input events (K) out of all the possible input events (N) must occur for the output event to happen. K/N gates can help you illustrate more complex relationships in a fault tree analysis.
INHIBIT gates: Like an AND gate, an INHIBIT gate indicates that an output event will occur if both input events and a conditional event (a condition or restriction that can apply to any gate) occurs.
Intermediate events can also include undeveloped events, which are events that aren’t fully understood or haven’t been fully analyzed.
Using the various available gates will help you create a comprehensive fault tree that captures the complex interactions between the various events and factors that precipitated the undesired event.
It's important to remember that building a fault tree is an iterative process, so you will continue to break down contributing events into their basic sub-events until the events cannot be parsed out any further. As you get new information and/or system conditions change, you may need to make several adjustments to refine the fault tree.
In order to quantify the risks associated with the undesired event, you will need to gather failure data (from historical records, industry databases, expert opinions, etc.) for the basic events in the fault tree. The failure data should be expressed as failure probabilities or failure rates, depending on the type of analysis you’re conducting.
Once you construct the fault tree and gather the failure data, you will perform the analysis, wherein you will calculate the probability of the undesired event occurring and identify the most critical contributing factors. Utilize either a qualitative or a quantitative data analysis method.
A qualitative analysis focuses on understanding the structure of the fault tree, the relationships between events, and the identification of critical paths and minimal cut sets (the smallest set of events that can create the undesired event). Qualitative analysis can help prioritize remedial actions and identify areas for further investigation.
A quantitative methodology, on the other hand, involves calculating the probability of the undesired event occurring based on the failure probabilities of the basic events. Quantitative analysis can help inform risk management decisions and evaluate the effectiveness of proposed improvements.
After performing the analysis, it’s time to interpret your results and communicate any relevant information to the necessary stakeholders.
It is important to remember that the results of an event tree analysis are dependent on the quality of the input data and the assumptions made during the analysis. As such, you should view the results as a starting point for further investigation and validation, rather than a definitive conclusion.
Based on the findings of the fault tree analysis, you will implement preventative measures and/or improvements to eliminate or decrease the likelihood of an undesired event. Be sure to monitor the performance of these improvements and continually update the fault tree to reflect any changes in system design, operating conditions or component performance, so that your tree remains accurate—and therefore useful—to your organization.
FTA provides a visual depiction of contributing factors and events that can lead to a system failure, making it easier to understand complex interactions between system components.
FTA allows you to calculate of the probability of a failure event occurring, enabling better risk management and decision-making and helping teams be proactive about corrective actions.
Since you can only analyze one output event at a time, fault tree analysis helps teams stay organized as they assess system levels and work through effects analyses methodically.
Unlike other approaches to failure mode and effects analyses (FMEAs), FTA accounts for human error, which can help teams understand whether issues are related to deviations from standard operating procedure.
FTA identifies which failures are likeliest to occur, helping teams decide which issues require urgent attention.
Intelligent asset management, monitoring, predictive maintenance and reliability in a single platform.
Enhance your application performance monitoring to provide the context you need to resolve incidents faster.
Unlock the full potential of your enterprise assets by using IBM Maximo Application Suite to unify maintenance, inspection and reliability systems into one platform. It’s an integrated cloud-based solution that harnesses the power of AI, IoT, and advanced analytics to maximize asset performance, extend asset lifecycles, minimize operational costs and reduce downtime.