Manage failure points in cloud application design

Four techniques to ensure the reliability of your cloud applications

As the ubiquity of web-based applications increases, reliability of those applications becomes an increasingly critical requirement. In this article, the author examines what application reliability really means in a cloud-based world and explores a range of approaches — reduction, management, detection, and avoidance — to improving the uptime of your cloud applications.


Peter Bell (, CTO, PowWow

Peter BellPeter Bell is the CTO of PowWow, a lean startup in New York City. He presents internationally and writes extensively on cloud computing, domain-specific languages, agile architecture, NoSQL, and requirements and estimating. He has presented at a range of conferences, including DLD Conference, ooPSLA, Code Generation, Practical Product Lines, the British Computer Society Software Practices Advancement conference, UberConf, the Rich Web Experience, and the No Fluff Just Stuff tour. He has been published in IEEE Software, Dr. Dobbs, IBM developerWorks, Information Week, Methods & Tools, Mashed Code, NFJS the Magazine, JSMag, and GroovyMag. He is also a regular instructor at General Assembly, a campus for technology, design, and entrepreneurship in New York.

24 February 2012

Also available in Chinese Russian

What does reliability mean?

There are many varied definitions of reliability for software applications. I would argue that the most important practical definition is that ...
The users of the system should be able to use that system to achieve their objectives.

Strictly speaking, it doesn't matter if your app is down as long as nobody is trying to use it. And provided the app appears to work from the user's perspective, it doesn't matter if a subsystem they are unaware of happens to be offline at that time.

Equally, uptime is of no value if users can't perform expected operations because of errors in the application code.

These are important distinctions because they keep our focus on what really matters: Whether users are able to get value from the system, not whether any given subsystem happens to be nominally operational.

Now let's look closer at four categories of techniques that can help you design strong, reliable web-based applications:

  • Reducing single points of failure.
  • Managing how failovers will behave.
  • Detecting more than just an app's heartbeat.
  • Proactively avoiding failures through design, development, testing, and deploying strategies.

Reducing single points of failure

The first way you can increase reliability is to reduce or remove the number of single points of failure. Cloud-based architectures make this a simpler process since they provide more abstract units of computing. With cloud computing, you don't need to worry about things like putting two network interface cards into each server and ensuring that the connections and cables are high quality and unlikely to be accidentally disconnected. Neither do you have to concern yourself with redundant array of independent disks (RAID) levels and decisions over which brands of hard disk are likely to have an acceptable mean times between failure.

The potential single points of failure vary depending on your application, but common areas to consider are:

  • Domain Name System (DNS) servers,
  • Web servers
  • Database servers
  • File servers
  • Load balancers
  • Other specialized servers
  • Third-party dependencies

Figure 1 outlines these common points of failure.

Figure 1. Common points of failure
Common points of failure

DNS servers

Many organizations that use cloud computing services to host their applications still use separate DNS servers. It is important to make sure that you or your DNS service provider has multiple DNS servers in different data centers containing the DNS records for your application. It doesn't matter how robust your web hosting solution is if users can't resolve the DNS to access your app. Review how your DNS servers are configured to make sure that you don't have an easily and inexpensively fixable single point of failure.

Web servers

There are a couple of key architectural best practices that you need to consider when you want to have multiple web servers and the ability to failover between them. The first is that you're not going to want to use the local file system on your web servers for persisting any important information. For anyone used to writing applications for a single web sever, this is the most common surprise. You have to assume that any given server could fail at any time, so if a user wants to upload a file or you want to log information, do it to a separate file system or database accessible by all of the web servers. In this way, a user can upload an image to one server but be able to access it from any of the other servers if the original server goes down.

Second, you're going to have to think about how you want to persist session-based information. For small amounts of data, you could use cookie-based storage, but be wary of storing more than an ID in a cookie because cookies are not secure from tampering by users and they have a maximum size of 4KB. More commonly, you're either going to use local session storage on each web server with a load balancer using sticky sessions so users within a session are consistently returned to the same web server or you are going to store all session state on a separate session server that will use an in-memory cache or a key-value data store to persist the session information. Storing session state on each web server and using sticky sessions are generally more performant because doing so reduces the number of network trips required to fulfill a request; but if a web server goes down, all the users on that server will lose their sessions. When reliability is critical, it's often worth using a separate server for session storage.

Database servers

It's important to understand the requirements for your database server and pick an appropriate technology and scaling strategy. For example, if you have a content management system with many reads and relatively infrequent writes, you might be able to live with a single point of failure for writing to the database. In such a case, you might choose to implement a relational database with master-slave replication, in which there is only a single master node but the slaves can continue to serve up reads even if the master goes down. If both read and write reliability are important, you might want to consider the wide range of NoSQL data stores that are making it increasingly easy to distribute persistence across multiple nodes to improve both reliability and scalability.

With databases, it is also important to think about consistency requirements. Many "web-scale" applications are now trading immediate consistency between nodes for the greater reliability and performance of eventually consistent systems. With an eventually consistent system, there is no guarantee that different database nodes will respond with the same data at the same time. Obviously, for a banking application, seeing different account balances from different nodes at the same time would not be acceptable, but it is surprising how many systems can accept eventual consistency of data in return for better reliability and performance.

File servers

As mentioned earlier, when building reliable systems, you probably won't want to store important information on individual web servers. Instead, all of your web servers will use a separate shared file server for storing and retrieving static files. Make sure that whatever file server you use does not provide a single point of failure, whether you use a cloud-based file system that inherently provides redundancy and failover or roll your own with load balancing and multiple file servers connecting to one or more synchronized file systems.

Load balancers

It is important to realize that load balancers in and of themselves might provide a single point of failure. Take the time to learn about how load balancing is implemented in your cloud architecture, and make sure that the implementation is such that if a single load-balancing node goes down, it will not take down the entire application.

Other specialized servers

One architectural best practice is to minimize coupling among subsystems in your application, making it easier to scale and/or re-architect any given subsystem without affecting the other subsystems. As your application grows, you often find yourself adding one or more servers for each of those subsystems. Common specialized servers include mail servers for sending out email campaigns, processing servers for doing things like video encoding or image manipulation, and reporting servers for providing richer reporting capabilities without overloading your production database servers.

The most important thing when architecting such systems is to determine how important their uptime is to the perceived uptime of the system. If you have an enterprise messaging app that uses email for delivery of a substantial proportion of your messages, the mail server is clearly going to be critical for the perception of uptime, so you're going to have to provide failover across multiple mail servers. If your mail server just sends occasional "forgotten password" or "status" emails, you might be able to accept the potential downtime of just provisioning a single mail server.

Third-party dependencies

Often, the successful functioning of your application will depend on access to third-party services. If you provide the ability to log in via Facebook, then Facebook's OAuth servers are a point of failure for your system. If you provide searches of LexisNexis data, its portal is a failure point for your app. This is one of the areas where you need to look at how critical the third-party service is to your users experience, how real time the interaction needs to be, and how likely the service is to go down when deciding how best to manage the risks associated with the potential point of failure.

For example, depending on a third-party site to provide authentication is a big risk. However, if that third party is Facebook, it might be a risk you can afford to take; such a large organization is less likely to have downtime than a smaller, less capable provider. If that is too much of a risk, make sure that your users can also login using alternative credentials such as an email address and password, so they can still access the app even if Facebook's OAuth servers are unavailable (see Figure 2).

Figure 2. More common points of failure
More common points of failure

At the other extreme, integrating with a third-party system to send emails may well be less risky since small amounts of downtime could go unnoticed. Emails are inherently asynchronous, but longer downtimes — up to and including a vendor going unexpectedly out of business — are still a possibility. If you want to mitigate risk, make sure that you have a plan for replacing the third-party system with another provider or even with an in-house solution within an acceptable amount of time.

Managing failure

It is important to think through how your failovers will work and test them to ensure that you get the outcomes you expect. Provided that you don't use sticky sessions, failover of the web tier is usually seamless if you use a well-designed load balancer. Database failover is something you will have to learn about for the persistence store you choose to implement, but most databases have good documentation on exactly what will happen when a node fails. Managing failover between interoperating subsystems is something you will need to plan for carefully. For example, if a web server sends a request to a mail server that accepts the request, but then fails before it sends the message, what will happen?

The simplest possible solution to such issues is to use a database as a shared blackboard between the various subsystems. Each subsystem updates the status of a job just before it works on it and just after it completes it. Timestamps make it easy to find "lost" jobs that got stuck in processing and pick them up on a regular basis, handing them off to another machine.

Another approach that is becoming the default these days is to use some kind of message queue that handles passing messages between subsystems and provides certain guarantees about the delivery of the messages (see Figure 3).

Figure 3. Failover without local storage
Failover without local storage

A wide range of message queues is available. Just make sure you understand exactly how the implementation you select works in the case of a message being received but not processed; and make sure that the message queue itself has well-understood behavior when the computer on which it is running fails.

Detecting failure

If you care about reliability, it is important to monitor and track the uptime of all of the elements of your system. You will want to have some kind of basic "heartbeat" monitoring on all of your servers to ensure that they are responsive, but you will also want to make sure that you have richer monitoring to validate that the mail server is actually sending emails successfully and that the file server can successfully respond with a known file when you ask for it.

It is especially important to think through the monitoring requests that you send to web servers to make sure they test the system comprehensively. Don't test that the home page loads if it is a static page and the rest of the site is dynamic, pulling information from a database. If you have an application that requires user authentication validated through your enterprise identity management system and then pulls data from your enterprise resource planning (ERP) system, make sure that your monitoring script logs in as a test user that you have added to the ERP system and that it confirms that data from the ERP system is being displayed correctly in the request response that the web server sends back. It's essential that you have this kind of end-to-end monitoring in addition to subsystem monitoring to ensure that the application is really working correctly.

Some companies such as Etsy are now taking these "end-to-end" tests even farther. Etsy has well-known internal business metrics for how its site should be performing. If the application starts to perform outside of the expected bands in terms of business metrics (such as total sales within a given time period), the engineering team is notified so that they can figure out what is happening and fix any potential issues. Always be thinking about what really matters in your application, how you can track it, and how you can automatically notify your team if the app stops performing within expected bounds.

Also, look at automated tracking of things, like the number of bug reports, so if you get a spike in a short period of time your system can automatically notify the team that there may be an issue with the application.

Avoiding failure

Some failures are unavoidable, but often, with a well-designed, fault-tolerant system, a substantial proportion of any downtime is the result of errors in the application code. It doesn't matter whether the app is up and responsive: If some of your users can't complete the tasks they want to using your application, then for them the site is effectively down. As a result, it's also important to have a good strategy for designing, writing, testing, and deploying code to maximize the functional uptime of your application.

Test-driven development

One of the most effective ways to ensure that your applications are well designed and provide the functionality you expect is to have your development team use test-driven development (TDD). This process both confirms correctness of the code and substantially improves the suppleness and quality of the design.

Continuous integration

One of the most important maxims in software development is "if it hurts, do it more often." If it is really difficult to run integration tests on a monthly basis, commit to running them weekly, automating the process until eventually it just requires a single click and you can run them every time a developer checks code into your version control system.

Continuous deployment

One of the most important technical books of 2010 was Continuous Deployment by Jez Humble and Dave Farley. It explains how to create a robust, well-designed deployment pipeline; if you are interested in increasing the reliability of your applications, this book is a must read. Some of the concepts include blue-green deployments (where you have two almost identical production environments that you can use for zero-downtime releases and rollbacks) and canary deployments (which allow you to roll out new code to a subset of users to test before rolling it out to your entire user base). With feature toggles, you can even deliver functionality to programmatically chosen users, allowing you to easily test features with various groups of users.

In conclusion

It is easier than ever to create highly reliable applications using cloud-based infrastructure, but it is important to think through all of the points of failure and ensure that you have a strategy for handling failure as well as that no important data gets lost during failovers. It is also essential that you think about approaches like TDD, continuous integration, and continuous delivery to ensure that your applications consistently work as expected. It doesn't matter whether your servers are up if your users can't complete the tasks they need to because of bugs in your production code.



Get products and technologies



developerWorks: Sign in

Required fields are indicated with an asterisk (*).

Need an IBM ID?
Forgot your IBM ID?

Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.


All information submitted is secure.

Dig deeper into Cloud computing on developerWorks

  • developerWorks Premium

    Exclusive tools to build your next great app. Learn more.

  • Cloud newsletter

    Crazy about Cloud? Sign up for our monthly newsletter and the latest cloud news.

  • Try SoftLayer Cloud

    Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer public cloud instance for one month.

Zone=Cloud computing
ArticleTitle=Manage failure points in cloud application design