Businesses rely every day on various systems and pieces of equipment to keep their operations running smoothly. But all systems inevitably require upkeep. It could be intangible software, like an IT service network that has accumulated enough bugs to break an important feature, sending developers scrambling for a fix. Or it could be a piece of physical equipment, like an ice cream machine in a fast food restaurant with a broken o-ring.

Eventually, everything breaks down, from multi-site IT systems down to individual light bulbs. Unplanned downtime can have catastrophic consequences, and it’s up to facility maintenance engineers and technicians to plan ahead so that swift measures are taken to rectify a failure. The goal is to minimize downtime, reducing the costs associated with lost productivity, revenue or customer dissatisfaction.

Downtime can be minimized in many ways. For example, businesses can aim to reduce the amount of time it takes to repair a piece of equipment by having sufficient replacement parts accessible to technicians on-site. Or, they can observe repair processes to find faster ways to perform repairs or quicker ways to notify technicians. Even further, they can make investments in better-performing tools with longer lifespans to reduce the number of repairs needed.

But in order to understand how to improve the reliability of systems and components, we first must be able to measure their reliability. Mean time to repair (MTTR)—also known as mean time to recovery—and mean time between failures (MTBF) are two failure metrics commonly used to measure the reliability of systems or products within the field of facilities maintenance. While these acronyms are related, they have different meanings and are used to answer different questions.

First, let’s review MTBF. 

What is mean time between failures (MTBF)?

MTBF is a key performance indicator (KPI) that represents the average time between two consecutive failures of a system or product. MTBF is a measure of reliability, and it is commonly used in the context of warranties, maintenance planning and product development. Note that MTBF, which refers to repairable items, is not to be confused with the closely related term, mean time to failure, (MTTF), which refers to assets that are non-repairable and need to be replaced rather than repaired.

The MTBF calculation uses the following formula:

MTBF = Total operating time / Number of failures over a given period

So, for example, if a product is used for 1,000 hours and it fails 3 times during that period, the MTBF would be: 1000 hours / 3 failures = 333.3 hours

This means that on average, the product can be expected to fail after 333.3 hours of use.

MTBF is useful in determining the expected lifetime of a product and can help manufacturers plan for maintenance or replacement. However, it does not take into account how much time it takes to repair a product after it fails, which can be an important consideration in some applications. 

That’s where MTTR comes in. 

What is mean time to repair (MTTR)? 

MTTR is the average time it takes to repair a system or product after it has failed. MTTR is used to measure the reliability of a system or product from a repair standpoint. MTTR typically includes the time it takes to notify maintenance teams, allow equipment to cool down for repair, fix the issue, reassemble any relevant equipment or systems and test before restarting production. 

The goal of MTTR is to minimize the downtime caused by failures and reduce the costs associated with repairs. 

Here’s how to calculate MTBF:

MTTR = Total downtime / Total number of failures over a specific time

For example, if over the last year, a system failed 5 times, resulting in 10 total hours of downtime (including repair time), the MTTR would be: 10 hours / 5 repairs = 2 hours

This means that on average, it takes two hours to repair the system after a failure occurs.

MTTR is useful in determining the efficiency of maintenance operations and can help identify areas where improvements can be made. 

Differences between MTBF and MTTR

Mean time between failures (MTBF) and mean time to repair (MTTR) answer different questions and have different applications. MTBF and MTTR exist in a family of KPIs that include mean time to respond, mean time to detect (MTTD) and mean time to acknowledge (MTTA), among others.

MTBF is a measure of how long a system or product is expected to operate before it fails, and it is used to plan for maintenance or replacement. MTTR is a measure of how long it takes to repair a system or product after it fails, and it is used to minimize downtime and reduce repair costs.

MTBF does not take into account the period of time it takes to repair a product after it fails, while MTTR does not take into account the total time between failures. 

How MTBF and MTTR work together

Across many use cases, both metrics may be used in tandem to get a more complete picture of the overall maintainability of a system or product. For example, in a manufacturing plant, MTBF might be used to determine the expected lifetime of a machine and plan for replacement, while MTTR might be used to optimize maintenance schedules for that machine and maximize total uptime. In the context of software development, MTBF might be used to measure the stability of a system and plan for updates or bug fixes, while MTTR might be used to optimize the development process and reduce the time it takes to fix issues.

Manage assets to improve MTBF and MTTR

Improving MTBF and MTTR to reduce downtime can be a complex process that involves identifying and addressing the root causes of system failures, optimizing maintenance operations and implementing improvements in design and manufacturing processes.

Today, large organizations use Computerized Maintenance Management Systems (CMMSs) to help them manage their maintenance processes. A CMMS typically offers features like work order management, preventative maintenance scheduling, inventory management, asset management and reporting. 

IBM Maximo is enterprise asset management software that includes comprehensive CMMS capabilities. Maximo is a single, integrated cloud-based platform that uses artificial intelligence (AI), IoT and analytics to optimize performance, extend the lifecycle of assets and reduce the costs of outages. A related tool, IBM Instana Observability, offers full-stack observability, with the goal of helping users optimize and democratize incident prevention. 

Both of these products will give you the visibility into your assets and operations that you’ll need to make smarter, data-driven decisions, ultimately resulting in fewer breakdowns and less downtime.

Learn more about IBM Maximo Application Suite Get started with IBM Instana Observability


More from

Seven key insights on GraphQL trends

3 min read - GraphQL has emerged as a key technology in the API space, with a growing number of organizations adopting this new API structure into their ecosystems. GraphQL is often seen as an alternative to REST APIs, which have been around for a long time. Compared to REST APIs (or other traditional API specifications), GraphQL provides more flexibility to API consumers (like app developers) and delivers many benefits, along with a few new challenges to API development and delivery. I recently attended…

Common Grounds: Unleashing innovation and growth through partnership

2 min read - At IBM, we believe we can make greater progress together. Our purpose is to be the catalyst that makes the world work better—and our partners are key to this mission. Together with our partners and clients, we work to solve the most complex challenges with AI and hybrid cloud. Last year, we launched our “Common Grounds” video series to showcase the value of partnership and how we’re better together. We sat down with partners of all shapes and sizes to…

Achieve your AI goals with an open data lakehouse approach

3 min read - Artificial intelligence (AI) is now at the forefront of how enterprises work with data to help reinvent operations, improve customer experiences, and maintain a competitive advantage. It’s no longer a nice-to-have, but an integral part of a successful data strategy. The first step for successful AI is access to trusted, governed data to fuel and scale the AI. With an open data lakehouse architecture approach, your teams can maximize value from their data to successfully adopt AI and enable better,…

Introducing IBM Sterling Order Management on Microsoft Azure

4 min read - IBM and Microsoft believe in providing you with the power of choice so you can leverage the industry-leading omnichannel fulfillment capabilities of Sterling Order Management Software (OMS) along with your existing skills and investment in native Azure services.  IBM and Microsoft provide you with the ability to confidently deploy Sterling OMS on Azure using Azure Red Hat OpenShift (ARO) or Azure Kubernetes Service (AKS), with the added flexibility of using multiple native Azure services. The reference architecture details are available…