Businesses rely every day on various systems and pieces of equipment to keep their operations running smoothly. But all systems inevitably require upkeep. It could be intangible software, like an IT service network that has accumulated enough bugs to break an important feature, sending developers scrambling for a fix. Or it could be a piece of physical equipment, like an ice cream machine in a fast food restaurant with a broken o-ring.

Eventually, everything breaks down, from multi-site IT systems down to individual light bulbs. Unplanned downtime can have catastrophic consequences, and it’s up to facility maintenance engineers and technicians to plan ahead so that swift measures are taken to rectify a failure. The goal is to minimize downtime, reducing the costs associated with lost productivity, revenue or customer dissatisfaction.

Downtime can be minimized in many ways. For example, businesses can aim to reduce the amount of time it takes to repair a piece of equipment by having sufficient replacement parts accessible to technicians on-site. Or, they can observe repair processes to find faster ways to perform repairs or quicker ways to notify technicians. Even further, they can make investments in better-performing tools with longer lifespans to reduce the number of repairs needed.

But in order to understand how to improve the reliability of systems and components, we first must be able to measure their reliability. Mean time to repair (MTTR)—also known as mean time to recovery—and mean time between failures (MTBF) are two failure metrics commonly used to measure the reliability of systems or products within the field of facilities maintenance. While these acronyms are related, they have different meanings and are used to answer different questions.

First, let’s review MTBF. 

What is mean time between failures (MTBF)?

MTBF is a key performance indicator (KPI) that represents the average time between two consecutive failures of a system or product. MTBF is a measure of reliability, and it is commonly used in the context of warranties, maintenance planning and product development. Note that MTBF, which refers to repairable items, is not to be confused with the closely related term, mean time to failure, (MTTF), which refers to assets that are non-repairable and need to be replaced rather than repaired.

The MTBF calculation uses the following formula:

MTBF = Total operating time / Number of failures over a given period

So, for example, if a product is used for 1,000 hours and it fails 3 times during that period, the MTBF would be: 1000 hours / 3 failures = 333.3 hours

This means that on average, the product can be expected to fail after 333.3 hours of use.

MTBF is useful in determining the expected lifetime of a product and can help manufacturers plan for maintenance or replacement. However, it does not take into account how much time it takes to repair a product after it fails, which can be an important consideration in some applications. 

That’s where MTTR comes in. 

What is mean time to repair (MTTR)? 

MTTR is the average time it takes to repair a system or product after it has failed. MTTR is used to measure the reliability of a system or product from a repair standpoint. MTTR typically includes the time it takes to notify maintenance teams, allow equipment to cool down for repair, fix the issue, reassemble any relevant equipment or systems and test before restarting production. 

The goal of MTTR is to minimize the downtime caused by failures and reduce the costs associated with repairs. 

Here’s how to calculate MTBF:

MTTR = Total downtime / Total number of failures over a specific time

For example, if over the last year, a system failed 5 times, resulting in 10 total hours of downtime (including repair time), the MTTR would be: 10 hours / 5 repairs = 2 hours

This means that on average, it takes two hours to repair the system after a failure occurs.

MTTR is useful in determining the efficiency of maintenance operations and can help identify areas where improvements can be made. 

Differences between MTBF and MTTR

Mean time between failures (MTBF) and mean time to repair (MTTR) answer different questions and have different applications. MTBF and MTTR exist in a family of KPIs that include mean time to respond, mean time to detect (MTTD) and mean time to acknowledge (MTTA), among others.

MTBF is a measure of how long a system or product is expected to operate before it fails, and it is used to plan for maintenance or replacement. MTTR is a measure of how long it takes to repair a system or product after it fails, and it is used to minimize downtime and reduce repair costs.

MTBF does not take into account the period of time it takes to repair a product after it fails, while MTTR does not take into account the total time between failures. 

How MTBF and MTTR work together

Across many use cases, both metrics may be used in tandem to get a more complete picture of the overall maintainability of a system or product. For example, in a manufacturing plant, MTBF might be used to determine the expected lifetime of a machine and plan for replacement, while MTTR might be used to optimize maintenance schedules for that machine and maximize total uptime. In the context of software development, MTBF might be used to measure the stability of a system and plan for updates or bug fixes, while MTTR might be used to optimize the development process and reduce the time it takes to fix issues.

Manage assets to improve MTBF and MTTR

Improving MTBF and MTTR to reduce downtime can be a complex process that involves identifying and addressing the root causes of system failures, optimizing maintenance operations and implementing improvements in design and manufacturing processes.

Today, large organizations use Computerized Maintenance Management Systems (CMMSs) to help them manage their maintenance processes. A CMMS typically offers features like work order management, preventative maintenance scheduling, inventory management, asset management and reporting. 

IBM Maximo is enterprise asset management software that includes comprehensive CMMS capabilities. Maximo is a single, integrated cloud-based platform that uses artificial intelligence (AI), IoT and analytics to optimize performance, extend the lifecycle of assets and reduce the costs of outages. A related tool, IBM Instana Observability, offers full-stack observability, with the goal of helping users optimize and democratize incident prevention. 

Both of these products will give you the visibility into your assets and operations that you’ll need to make smarter, data-driven decisions, ultimately resulting in fewer breakdowns and less downtime.

Learn more about IBM Maximo Application Suite Get started with IBM Instana Observability


More from

IBM Cloud VMware as a Service introduces multitenant as a new, cost-efficient consumption model

4 min read - Businesses often struggle with ongoing operational needs like monitoring, patching and maintenance of their VMware infrastructure or the added concerns over capacity management. At the same time, cost efficiency and control are very important. Not all workloads have identical needs and different business applications have variable requirements. For example, production applications and regulated workloads may require strong isolation, but development/testing, training environments, disaster recovery sites or other applications may have lower availability requirements or they can be ephemeral in nature,…

IBM accelerates enterprise AI for clients with new capabilities on IBM Z

5 min read - Today, we are excited to unveil a new suite of AI offerings for IBM Z that are designed to help clients improve business outcomes by speeding the implementation of enterprise AI on IBM Z across a wide variety of use cases and industries. We are bringing artificial intelligence (AI) to emerging use cases that our clients (like Swiss insurance provider La Mobilière) have begun exploring, such as enhancing the accuracy of insurance policy recommendations, increasing the accuracy and timeliness of…

IBM NS1 Connect: How IBM is delivering network connectivity with premium DNS offerings

4 min read - For most enterprises, how their users access applications and data is an essential part of doing business, and how they service those application and data responses has a direct correlation to revenue generation.    According to We Are Social’s Digital 2023 Global Overview Report, there are 5.19 billion people around the world using the internet in 2023. There’s an imperative need for businesses to trust their networks to deliver meaningful content to address customer needs.  So how responsive is the…

Child support systems modernization: The time is now

6 min read - The majority of today’s child support systems are dated, first-generation systems that are now more than 25 years old. These systems need modernization to meet the evolving needs of children and families in the 21st century. With more than 20% of families and children supported by these systems, the impact is significant. Today’s constituents are interested in engaging with services using modern, consumer-friendly technologies, platforms and devices. Families also expect interactive experiences that drive outcomes tailored to their needs. The…