Thinking like an architect: Understanding failure domains

There are many things to take into consideration when building any type of infrastructure. Whether you’re building a software application or the underlying infrastructure, there is one critical part of our design—failure domains.

Failure domains are regions or components of the infrastructure that contain a potential for failure. These regions can be physical or logical boundaries, and each has its own risks and challenges for which to architect.

What are failure domains?

Here is a simple example to look at: If you’re running a web application with a single Apache server and a MySQL database on two servers, you have a few failure domains to account for on the infrastructure:

Web server: Running a single instance of your web server is a rather obvious single point of failure.
Database server: A single instance risks loss when the application is potentially unable to attach to the database.
Network: While we were smart to separate the role of web and database server, this also introduces the network as a new point of failure.

These are fairly simple to see when we look at how our application environment is comprised. So, what should we do?

Don’t hesitate, mitigate

Mitigation is the reduction of risk by some form of action or design. Let’s break down some simple mitigation strategies to help our example application.

Web server

We should be adding additional web servers to handle the requests, which will provide redundancy and resiliency. This means adding a load balancer into the application infrastructure to accept inbound connections and distribute the requests across the new web server farm.

Database server

Just like we did with our web server, we should be creating a horizontally scalable database architecture to allow for failures of certain nodes. This ensures data availability in the event of a localized outage. Luckily, MySQL can be deployed in this way using MariaDB, which is a distributed relational database to allow for multi-node installations.

Network

Since the network is a key component, it is also a key risk. We can add multiple network cards to the server and attach the uplink ports to multiple switches. This will enable us to withstand both a top-of-rack switch outage and a single port outage (or even something as seemingly simple as a cable failure).

At the networking layer, we can have our network engineer ensure that the necessary failsafe designs are in place to prevent routing issues, switch issues and multiple uplinks to the external network provider for better resiliency for network connectivity.

Sounds like we have a few good solutions in hand. This is where have to pause and think about the impact of our proposed solutions.

How mitigation can introduce risk and complexity

We added a mitigation strategy for some of our components, but this doesn’t mean that the problem is solved. Have you ever heard this joke? “I had a problem that I decided to use Regex statements to fix. Now I have two problems.”

Adding a few extra web servers looked easy when we put it on the idea list. One thing about web farms is that they assume you have a queuing system into the database when you are doing write functions. So, although we fixed the issue of a single point of failure, we introduced complexity that may not be accounted for in the application design.

This is a key reason that we focus on some DevOps concepts and the importance of having the infrastructure and application teams fully engaged when making architecture decisions.

Widening the domain

If we look at our mitigation strategy, we added new servers and load balancers, and let’s assume that we’ve also gone the extra distance to add a message queuing infrastructure to endure data integrity.

It would seem like we are done, right? Not quite. If we widen the failure domain a little bit to something like a regional power outage or network outage, we suddenly have a new set of problems.

We can easily get into what many call “Analysis Paralysis.” This is where we spend so much time looking for the ultimate solution that we continually find reasons not to proceed. Hopefully, we also love agile and lean processes so we can proceed in an iterative fashion and continually revisit to attend to deficiencies and a feature backlog that can include failure domain mitigation.

When you look at your application or server designs, you may also see that extending outside of a geographical region for redundancy is a potential solution. Perhaps bursting to the cloud or to multiple clouds.

The point of our example was to highlight that we should be acutely aware of failure domains and scenarios as we architect our solution. Nobody wants to get caught out when the outage occurs and have to say, “I didn’t think of that.”

Failure domains and IBM Turbonomic

Failure domains should always be a top consideration when building any type of infrastructure. With proper identification and implementation of mitigation strategies, we can minimize the risk of downtime and ensure that our applications remain available. However, it’s crucial to recognize that mitigation can introduce new risks and complexities, and expanding our scope to include larger-scale outages is essential.

IBM Turbonomic offers a unique approach to identifying failure domains and mitigating risk. By leveraging AI-powered automation, IBM Turbonomic continuously analyzes a user’s infrastructure, including application demand, resource supply, and potential risks and vulnerabilities. By generating automatable actions in real-time, users can take a proactive approach to failure domains and resolve them without the need for manual intervention. With IBM Turbonomic, organizations can ensure the resilience and availability of their infrastructure, while minimizing the risk and complexities associated with failure domains.

Author

Dina Henderson

Product Marketing Manager

IBM Blog

Get started with IBM Turbonomic.