There are many things to take into consideration when building any type of infrastructure. Whether you’re building a software application or the underlying infrastructure, there is one critical part of our design—failure domains.

Failure domains are regions or components of the infrastructure that contain a potential for failure. These regions can be physical or logical boundaries, and each has its own risks and challenges for which to architect.

What are failure domains?

Here is a simple example to look at: If you’re running a web application with a single Apache server and a MySQL database on two servers, you have a few failure domains to account for on the infrastructure:

  • Web server: Running a single instance of your web server is a rather obvious single point of failure.
  • Database server: A single instance risks loss when the application is potentially unable to attach to the database.
  • Network: While we were smart to separate the role of web and database server, this also introduces the network as a new point of failure.

These are fairly simple to see when we look at how our application environment is comprised. So, what should we do?

Don’t hesitate, mitigate

Mitigation is the reduction of risk by some form of action or design. Let’s break down some simple mitigation strategies to help our example application.

Web server

We should be adding additional web servers to handle the requests, which will provide redundancy and resiliency. This means adding a load balancer into the application infrastructure to accept inbound connections and distribute the requests across the new web server farm.

Database server

Just like we did with our web server, we should be creating a horizontally scalable database architecture to allow for failures of certain nodes. This ensures data availability in the event of a localized outage. Luckily, MySQL can be deployed in this way using MariaDB, which is a distributed relational database to allow for multi-node installations.


Since the network is a key component, it is also a key risk. We can add multiple network cards to the server and attach the uplink ports to multiple switches. This will enable us to withstand both a top-of-rack switch outage and a single port outage (or even something as seemingly simple as a cable failure).

At the networking layer, we can have our network engineer ensure that the necessary failsafe designs are in place to prevent routing issues, switch issues and multiple uplinks to the external network provider for better resiliency for network connectivity.

Sounds like we have a few good solutions in hand. This is where have to pause and think about the impact of our proposed solutions.

How mitigation can introduce risk and complexity

We added a mitigation strategy for some of our components, but this doesn’t mean that the problem is solved. Have you ever heard this joke? “I had a problem that I decided to use Regex statements to fix. Now I have two problems.”

Adding a few extra web servers looked easy when we put it on the idea list. One thing about web farms is that they assume you have a queuing system into the database when you are doing write functions. So, although we fixed the issue of a single point of failure, we introduced complexity that may not be accounted for in the application design.

This is a key reason that we focus on some DevOps concepts and the importance of having the infrastructure and application teams fully engaged when making architecture decisions.

Widening the domain

If we look at our mitigation strategy, we added new servers and load balancers, and let’s assume that we’ve also gone the extra distance to add a message queuing infrastructure to endure data integrity.

It would seem like we are done, right? Not quite. If we widen the failure domain a little bit to something like a regional power outage or network outage, we suddenly have a new set of problems.

We can easily get into what many call “Analysis Paralysis.” This is where we spend so much time looking for the ultimate solution that we continually find reasons not to proceed. Hopefully, we also love agile and lean processes so we can proceed in an iterative fashion and continually revisit to attend to deficiencies and a feature backlog that can include failure domain mitigation.

When you look at your application or server designs, you may also see that extending outside of a geographical region for redundancy is a potential solution. Perhaps bursting to the cloud or to multiple clouds.

The point of our example was to highlight that we should be acutely aware of failure domains and scenarios as we architect our solution. Nobody wants to get caught out when the outage occurs and have to say, “I didn’t think of that.”

Failure domains and IBM Turbonomic

Failure domains should always be a top consideration when building any type of infrastructure. With proper identification and implementation of mitigation strategies, we can minimize the risk of downtime and ensure that our applications remain available. However, it’s crucial to recognize that mitigation can introduce new risks and complexities, and expanding our scope to include larger-scale outages is essential.

IBM Turbonomic offers a unique approach to identifying failure domains and mitigating risk. By leveraging AI-powered automation, IBM Turbonomic continuously analyzes a user’s infrastructure, including application demand, resource supply, and potential risks and vulnerabilities. By generating automatable actions in real-time, users can take a proactive approach to failure domains and resolve them without the need for manual intervention. With IBM Turbonomic, organizations can ensure the resilience and availability of their infrastructure, while minimizing the risk and complexities associated with failure domains. 

Get started with IBM Turbonomic.

More from IBM Turbonomic

Strengthening cybersecurity in life sciences with IBM and AWS

7 min read - Cloud is transforming the way life sciences organizations are doing business. Cloud computing offers the potential to redefine and personalize customer relationships, transform and optimize operations, improve governance and transparency, and expand business agility and capability. Leading life science companies are leveraging cloud for innovation around operational, revenue and business models. According to a report on mapping the cloud maturity curve from the EIU, 48% of industry executives said cloud has improved data access, analysis and utilization, 45% say cloud…

7 min read

How Red Hat OpenShift on AWS (ROSA) accelerates enterprise modernization initiatives on cloud, delivering business application innovation

3 min read - When it comes to driving large technology transformation on Cloud, leveraging existing investments, and optimizing open innovation within the larger ecosystem with a hybrid cloud platform, IBM Consulting™ offers several learnings to help organizations address the architecture and technology challenge.  Consider large financial services organization going through core banking modernization. The core banking application landscape involves multiple applications (both legacy and custom off-the-shelf) that are integrated and surfaced across multiple customer experiences, including mobile. The goal of modernizing such a large…

3 min read

IBM Consulting unveils Center of Excellence for generative AI

4 min read - IBM Consulting has established a Center of Excellence for generative AI. It stands alongside IBM Consulting’s existing global AI and Automation practice, which includes 21,000 data and AI consultants who have conducted over 40,000 enterprise client engagements. The Center of Excellence (CoE) already has more than 1,000 consultants with specialized generative AI expertise that are engaging with a global set of clients to drive productivity in IT operations and core business processes like HR or marketing, elevate their customer experiences…

4 min read

The five key benefits of AIOps and automation

4 min read - If you’re an IT professional—from the C-suite to a hands-on practitioner—you know the pressure your IT operations (ITOps) are under. You’re responsible for optimizing spend, operational efficiency and incorporating new and innovative technologies. But are your tools slowing you down? Coined by research firm Gartner, AIOps is artificial intelligence for IT operations. It is the application of artificial intelligence (AI) capabilities (e.g., natural language processing and machine learning models) to automate and streamline operational workflows. In this blog post, we…

4 min read