January 20, 2023 By Dina Henderson 4 min read

There are many things to take into consideration when building any type of infrastructure. Whether you’re building a software application or the underlying infrastructure, there is one critical part of our design—failure domains.

Failure domains are regions or components of the infrastructure that contain a potential for failure. These regions can be physical or logical boundaries, and each has its own risks and challenges for which to architect.

What are failure domains?

Here is a simple example to look at: If you’re running a web application with a single Apache server and a MySQL database on two servers, you have a few failure domains to account for on the infrastructure:

  • Web server: Running a single instance of your web server is a rather obvious single point of failure.
  • Database server: A single instance risks loss when the application is potentially unable to attach to the database.
  • Network: While we were smart to separate the role of web and database server, this also introduces the network as a new point of failure.

These are fairly simple to see when we look at how our application environment is comprised. So, what should we do?

Don’t hesitate, mitigate

Mitigation is the reduction of risk by some form of action or design. Let’s break down some simple mitigation strategies to help our example application.

Web server

We should be adding additional web servers to handle the requests, which will provide redundancy and resiliency. This means adding a load balancer into the application infrastructure to accept inbound connections and distribute the requests across the new web server farm.

Database server

Just like we did with our web server, we should be creating a horizontally scalable database architecture to allow for failures of certain nodes. This ensures data availability in the event of a localized outage. Luckily, MySQL can be deployed in this way using MariaDB, which is a distributed relational database to allow for multi-node installations.


Since the network is a key component, it is also a key risk. We can add multiple network cards to the server and attach the uplink ports to multiple switches. This will enable us to withstand both a top-of-rack switch outage and a single port outage (or even something as seemingly simple as a cable failure).

At the networking layer, we can have our network engineer ensure that the necessary failsafe designs are in place to prevent routing issues, switch issues and multiple uplinks to the external network provider for better resiliency for network connectivity.

Sounds like we have a few good solutions in hand. This is where have to pause and think about the impact of our proposed solutions.

How mitigation can introduce risk and complexity

We added a mitigation strategy for some of our components, but this doesn’t mean that the problem is solved. Have you ever heard this joke? “I had a problem that I decided to use Regex statements to fix. Now I have two problems.”

Adding a few extra web servers looked easy when we put it on the idea list. One thing about web farms is that they assume you have a queuing system into the database when you are doing write functions. So, although we fixed the issue of a single point of failure, we introduced complexity that may not be accounted for in the application design.

This is a key reason that we focus on some DevOps concepts and the importance of having the infrastructure and application teams fully engaged when making architecture decisions.

Widening the domain

If we look at our mitigation strategy, we added new servers and load balancers, and let’s assume that we’ve also gone the extra distance to add a message queuing infrastructure to endure data integrity.

It would seem like we are done, right? Not quite. If we widen the failure domain a little bit to something like a regional power outage or network outage, we suddenly have a new set of problems.

We can easily get into what many call “Analysis Paralysis.” This is where we spend so much time looking for the ultimate solution that we continually find reasons not to proceed. Hopefully, we also love agile and lean processes so we can proceed in an iterative fashion and continually revisit to attend to deficiencies and a feature backlog that can include failure domain mitigation.

When you look at your application or server designs, you may also see that extending outside of a geographical region for redundancy is a potential solution. Perhaps bursting to the cloud or to multiple clouds.

The point of our example was to highlight that we should be acutely aware of failure domains and scenarios as we architect our solution. Nobody wants to get caught out when the outage occurs and have to say, “I didn’t think of that.”

Failure domains and IBM Turbonomic

Failure domains should always be a top consideration when building any type of infrastructure. With proper identification and implementation of mitigation strategies, we can minimize the risk of downtime and ensure that our applications remain available. However, it’s crucial to recognize that mitigation can introduce new risks and complexities, and expanding our scope to include larger-scale outages is essential.

IBM Turbonomic offers a unique approach to identifying failure domains and mitigating risk. By leveraging AI-powered automation, IBM Turbonomic continuously analyzes a user’s infrastructure, including application demand, resource supply, and potential risks and vulnerabilities. By generating automatable actions in real-time, users can take a proactive approach to failure domains and resolve them without the need for manual intervention. With IBM Turbonomic, organizations can ensure the resilience and availability of their infrastructure, while minimizing the risk and complexities associated with failure domains. 

Get started with IBM Turbonomic.
Was this article helpful?

More from IBM Turbonomic

Application performance optimization: Elevate performance and reduce costs

4 min read - Application performance is not just a simple concern for most organizations; it’s a critical factor in their business's success. Driving optimal application performance while minimizing costs has become paramount as organizations strive for positive user experiences. These experiences can make or break a business, that’s why prioritizing high performance among applications is non-negotiable. What’s needed is a solution that not only safeguards the performance of your mission-critical applications, but also goes above and beyond through reduced cost, time efficiency and…

Operationalize automation for faster, more efficient incident resolution at a lower cost

3 min read - IT is under enormous pressure. The expectation is 24/7/365 performance while also delivering increasingly better customer experiences at the lowest possible cost. The reality is that it’s difficult to keep apps performing as designed, especially in modern, cloud-native environments with microservices and Kubernetes. Cloud costs are out of control, and teams spend too much time fixing instead of innovating. And it all happens at a rate that makes it impossible for humans to keep up. It’s time for IT to…

AWS EC2 instance types: Challenges and best practices for hosting your applications in AWS

7 min read - When it comes to hosting applications on Amazon Web Services (AWS), one of the most important decisions you will need to make is which Amazon Elastic Compute Cloud (EC2) instance type to choose. EC2 instances are virtual machines that allow you to run your applications on AWS. They come in various sizes and configurations—known as instance families—each designed for a specific purpose. Choosing the right instance offering and instance size for your application is critical for optimizing performance and reducing…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters