Businesses rely every day on various systems and pieces of equipment to keep their operations running smoothly. But all systems inevitably require upkeep. It could be intangible software, like an IT service network that has accumulated enough bugs to break an important feature, sending developers scrambling for a fix. Or it could be a piece of physical equipment, like an ice cream machine in a fast food restaurant with a broken o-ring.
Eventually, everything breaks down, from multi-site IT systems down to individual light bulbs. Unplanned downtime can have catastrophic consequences, and it’s up to facility maintenance engineers and technicians to plan ahead so that swift measures are taken to rectify a failure. The goal is to minimize downtime, reducing the costs associated with lost productivity, revenue or customer dissatisfaction.
Downtime can be minimized in many ways. For example, businesses can aim to reduce the amount of time it takes to repair a piece of equipment by having sufficient replacement parts accessible to technicians on-site. Or, they can observe repair processes to find faster ways to perform repairs or quicker ways to notify technicians. Even further, they can make investments in better-performing tools with longer lifespans to reduce the number of repairs needed.
But in order to understand how to improve the reliability of systems and components, we first must be able to measure their reliability. Mean time to repair (MTTR)—also known as mean time to recovery—and mean time between failures (MTBF) are two failure metrics commonly used to measure the reliability of systems or products within the field of facilities maintenance. While these acronyms are related, they have different meanings and are used to answer different questions.
First, let’s review MTBF.
What is mean time between failures (MTBF)?
MTBF is a key performance indicator (KPI) that represents the average time between two consecutive failures of a system or product. MTBF is a measure of reliability, and it is commonly used in the context of warranties, maintenance planning and product development. Note that MTBF, which refers to repairable items, is not to be confused with the closely related term, mean time to failure, (MTTF), which refers to assets that are non-repairable and need to be replaced rather than repaired.
The MTBF calculation uses the following formula:
MTBF = Total operating time / Number of failures over a given period
So, for example, if a product is used for 1,000 hours and it fails 3 times during that period, the MTBF would be: 1000 hours / 3 failures = 333.3 hours
This means that on average, the product can be expected to fail after 333.3 hours of use.
MTBF is useful in determining the expected lifetime of a product and can help manufacturers plan for maintenance or replacement. However, it does not take into account how much time it takes to repair a product after it fails, which can be an important consideration in some applications.
That’s where MTTR comes in.
What is mean time to repair (MTTR)?
MTTR is the average time it takes to repair a system or product after it has failed. MTTR is used to measure the reliability of a system or product from a repair standpoint. MTTR typically includes the time it takes to notify maintenance teams, allow equipment to cool down for repair, fix the issue, reassemble any relevant equipment or systems and test before restarting production.
The goal of MTTR is to minimize the downtime caused by failures and reduce the costs associated with repairs.
Here’s how to calculate MTBF:
MTTR = Total downtime / Total number of failures over a specific time
For example, if over the last year, a system failed 5 times, resulting in 10 total hours of downtime (including repair time), the MTTR would be: 10 hours / 5 repairs = 2 hours
This means that on average, it takes two hours to repair the system after a failure occurs.
MTTR is useful in determining the efficiency of maintenance operations and can help identify areas where improvements can be made.
Differences between MTBF and MTTR
Mean time between failures (MTBF) and mean time to repair (MTTR) answer different questions and have different applications. MTBF and MTTR exist in a family of KPIs that include mean time to respond, mean time to detect (MTTD) and mean time to acknowledge (MTTA), among others.
MTBF is a measure of how long a system or product is expected to operate before it fails, and it is used to plan for maintenance or replacement. MTTR is a measure of how long it takes to repair a system or product after it fails, and it is used to minimize downtime and reduce repair costs.
MTBF does not take into account the period of time it takes to repair a product after it fails, while MTTR does not take into account the total time between failures.
How MTBF and MTTR work together
Across many use cases, both metrics may be used in tandem to get a more complete picture of the overall maintainability of a system or product. For example, in a manufacturing plant, MTBF might be used to determine the expected lifetime of a machine and plan for replacement, while MTTR might be used to optimize maintenance schedules for that machine and maximize total uptime. In the context of software development, MTBF might be used to measure the stability of a system and plan for updates or bug fixes, while MTTR might be used to optimize the development process and reduce the time it takes to fix issues.
Manage assets to improve MTBF and MTTR
Improving MTBF and MTTR to reduce downtime can be a complex process that involves identifying and addressing the root causes of system failures, optimizing maintenance operations and implementing improvements in design and manufacturing processes.
Today, large organizations use Computerized Maintenance Management Systems (CMMSs) to help them manage their maintenance processes. A CMMS typically offers features like work order management, preventative maintenance scheduling, inventory management, asset management and reporting.
IBM Maximo is enterprise asset management software that includes comprehensive CMMS capabilities. Maximo is a single, integrated cloud-based platform that uses artificial intelligence (AI), IoT and analytics to optimize performance, extend the lifecycle of assets and reduce the costs of outages. A related tool, IBM Instana Observability, offers full-stack observability, with the goal of helping users optimize and democratize incident prevention.
Both of these products will give you the visibility into your assets and operations that you’ll need to make smarter, data-driven decisions, ultimately resulting in fewer breakdowns and less downtime.