What does reliability mean?
There are many varied definitions of reliability for software applications.
I would argue that the most important practical definition is that ...
The users of the system should be able to use that system to achieve their objectives.
Strictly speaking, it doesn't matter if your app is down as long as nobody is trying to use it. And provided the app appears to work from the user's perspective, it doesn't matter if a subsystem they are unaware of happens to be offline at that time.
Equally, uptime is of no value if users can't perform expected operations because of errors in the application code.
These are important distinctions because they keep our focus on what really matters: Whether users are able to get value from the system, not whether any given subsystem happens to be nominally operational.
Now let's look closer at four categories of techniques that can help you design strong, reliable web-based applications:
- Reducing single points of failure.
- Managing how failovers will behave.
- Detecting more than just an app's heartbeat.
- Proactively avoiding failures through design, development, testing, and deploying strategies.
Reducing single points of failure
The first way you can increase reliability is to reduce or remove the number of single points of failure. Cloud-based architectures make this a simpler process since they provide more abstract units of computing. With cloud computing, you don't need to worry about things like putting two network interface cards into each server and ensuring that the connections and cables are high quality and unlikely to be accidentally disconnected. Neither do you have to concern yourself with redundant array of independent disks (RAID) levels and decisions over which brands of hard disk are likely to have an acceptable mean times between failure.
The potential single points of failure vary depending on your application, but common areas to consider are:
- Domain Name System (DNS) servers,
- Web servers
- Database servers
- File servers
- Load balancers
- Other specialized servers
- Third-party dependencies
Figure 1 outlines these common points of failure.
Figure 1. Common points of failure
Many organizations that use cloud computing services to host their applications still use separate DNS servers. It is important to make sure that you or your DNS service provider has multiple DNS servers in different data centers containing the DNS records for your application. It doesn't matter how robust your web hosting solution is if users can't resolve the DNS to access your app. Review how your DNS servers are configured to make sure that you don't have an easily and inexpensively fixable single point of failure.
There are a couple of key architectural best practices that you need to consider when you want to have multiple web servers and the ability to failover between them. The first is that you're not going to want to use the local file system on your web servers for persisting any important information. For anyone used to writing applications for a single web sever, this is the most common surprise. You have to assume that any given server could fail at any time, so if a user wants to upload a file or you want to log information, do it to a separate file system or database accessible by all of the web servers. In this way, a user can upload an image to one server but be able to access it from any of the other servers if the original server goes down.
It's important to understand the requirements for your database server and pick an appropriate technology and scaling strategy. For example, if you have a content management system with many reads and relatively infrequent writes, you might be able to live with a single point of failure for writing to the database. In such a case, you might choose to implement a relational database with master-slave replication, in which there is only a single master node but the slaves can continue to serve up reads even if the master goes down. If both read and write reliability are important, you might want to consider the wide range of NoSQL data stores that are making it increasingly easy to distribute persistence across multiple nodes to improve both reliability and scalability.
With databases, it is also important to think about consistency requirements. Many "web-scale" applications are now trading immediate consistency between nodes for the greater reliability and performance of eventually consistent systems. With an eventually consistent system, there is no guarantee that different database nodes will respond with the same data at the same time. Obviously, for a banking application, seeing different account balances from different nodes at the same time would not be acceptable, but it is surprising how many systems can accept eventual consistency of data in return for better reliability and performance.
As mentioned earlier, when building reliable systems, you probably won't want to store important information on individual web servers. Instead, all of your web servers will use a separate shared file server for storing and retrieving static files. Make sure that whatever file server you use does not provide a single point of failure, whether you use a cloud-based file system that inherently provides redundancy and failover or roll your own with load balancing and multiple file servers connecting to one or more synchronized file systems.
It is important to realize that load balancers in and of themselves might provide a single point of failure. Take the time to learn about how load balancing is implemented in your cloud architecture, and make sure that the implementation is such that if a single load-balancing node goes down, it will not take down the entire application.
Other specialized servers
One architectural best practice is to minimize coupling among subsystems in your application, making it easier to scale and/or re-architect any given subsystem without affecting the other subsystems. As your application grows, you often find yourself adding one or more servers for each of those subsystems. Common specialized servers include mail servers for sending out email campaigns, processing servers for doing things like video encoding or image manipulation, and reporting servers for providing richer reporting capabilities without overloading your production database servers.
The most important thing when architecting such systems is to determine how important their uptime is to the perceived uptime of the system. If you have an enterprise messaging app that uses email for delivery of a substantial proportion of your messages, the mail server is clearly going to be critical for the perception of uptime, so you're going to have to provide failover across multiple mail servers. If your mail server just sends occasional "forgotten password" or "status" emails, you might be able to accept the potential downtime of just provisioning a single mail server.
Often, the successful functioning of your application will depend on access to third-party services. If you provide the ability to log in via Facebook, then Facebook's OAuth servers are a point of failure for your system. If you provide searches of LexisNexis data, its portal is a failure point for your app. This is one of the areas where you need to look at how critical the third-party service is to your users experience, how real time the interaction needs to be, and how likely the service is to go down when deciding how best to manage the risks associated with the potential point of failure.
For example, depending on a third-party site to provide authentication is a big risk. However, if that third party is Facebook, it might be a risk you can afford to take; such a large organization is less likely to have downtime than a smaller, less capable provider. If that is too much of a risk, make sure that your users can also login using alternative credentials such as an email address and password, so they can still access the app even if Facebook's OAuth servers are unavailable (see Figure 2).
Figure 2. More common points of failure
At the other extreme, integrating with a third-party system to send emails may well be less risky since small amounts of downtime could go unnoticed. Emails are inherently asynchronous, but longer downtimes — up to and including a vendor going unexpectedly out of business — are still a possibility. If you want to mitigate risk, make sure that you have a plan for replacing the third-party system with another provider or even with an in-house solution within an acceptable amount of time.
It is important to think through how your failovers will work and test them to ensure that you get the outcomes you expect. Provided that you don't use sticky sessions, failover of the web tier is usually seamless if you use a well-designed load balancer. Database failover is something you will have to learn about for the persistence store you choose to implement, but most databases have good documentation on exactly what will happen when a node fails. Managing failover between interoperating subsystems is something you will need to plan for carefully. For example, if a web server sends a request to a mail server that accepts the request, but then fails before it sends the message, what will happen?
The simplest possible solution to such issues is to use a database as a shared blackboard between the various subsystems. Each subsystem updates the status of a job just before it works on it and just after it completes it. Timestamps make it easy to find "lost" jobs that got stuck in processing and pick them up on a regular basis, handing them off to another machine.
Another approach that is becoming the default these days is to use some kind of message queue that handles passing messages between subsystems and provides certain guarantees about the delivery of the messages (see Figure 3).
Figure 3. Failover without local storage
A wide range of message queues is available. Just make sure you understand exactly how the implementation you select works in the case of a message being received but not processed; and make sure that the message queue itself has well-understood behavior when the computer on which it is running fails.
If you care about reliability, it is important to monitor and track the uptime of all of the elements of your system. You will want to have some kind of basic "heartbeat" monitoring on all of your servers to ensure that they are responsive, but you will also want to make sure that you have richer monitoring to validate that the mail server is actually sending emails successfully and that the file server can successfully respond with a known file when you ask for it.
It is especially important to think through the monitoring requests that you send to web servers to make sure they test the system comprehensively. Don't test that the home page loads if it is a static page and the rest of the site is dynamic, pulling information from a database. If you have an application that requires user authentication validated through your enterprise identity management system and then pulls data from your enterprise resource planning (ERP) system, make sure that your monitoring script logs in as a test user that you have added to the ERP system and that it confirms that data from the ERP system is being displayed correctly in the request response that the web server sends back. It's essential that you have this kind of end-to-end monitoring in addition to subsystem monitoring to ensure that the application is really working correctly.
Some companies such as Etsy are now taking these "end-to-end" tests even farther. Etsy has well-known internal business metrics for how its site should be performing. If the application starts to perform outside of the expected bands in terms of business metrics (such as total sales within a given time period), the engineering team is notified so that they can figure out what is happening and fix any potential issues. Always be thinking about what really matters in your application, how you can track it, and how you can automatically notify your team if the app stops performing within expected bounds.
Also, look at automated tracking of things, like the number of bug reports, so if you get a spike in a short period of time your system can automatically notify the team that there may be an issue with the application.
Some failures are unavoidable, but often, with a well-designed, fault-tolerant system, a substantial proportion of any downtime is the result of errors in the application code. It doesn't matter whether the app is up and responsive: If some of your users can't complete the tasks they want to using your application, then for them the site is effectively down. As a result, it's also important to have a good strategy for designing, writing, testing, and deploying code to maximize the functional uptime of your application.
One of the most effective ways to ensure that your applications are well designed and provide the functionality you expect is to have your development team use test-driven development (TDD). This process both confirms correctness of the code and substantially improves the suppleness and quality of the design.
One of the most important maxims in software development is "if it hurts, do it more often." If it is really difficult to run integration tests on a monthly basis, commit to running them weekly, automating the process until eventually it just requires a single click and you can run them every time a developer checks code into your version control system.
One of the most important technical books of 2010 was Continuous Deployment by Jez Humble and Dave Farley. It explains how to create a robust, well-designed deployment pipeline; if you are interested in increasing the reliability of your applications, this book is a must read. Some of the concepts include blue-green deployments (where you have two almost identical production environments that you can use for zero-downtime releases and rollbacks) and canary deployments (which allow you to roll out new code to a subset of users to test before rolling it out to your entire user base). With feature toggles, you can even deliver functionality to programmatically chosen users, allowing you to easily test features with various groups of users.
It is easier than ever to create highly reliable applications using cloud-based infrastructure, but it is important to think through all of the points of failure and ensure that you have a strategy for handling failure as well as that no important data gets lost during failovers. It is also essential that you think about approaches like TDD, continuous integration, and continuous delivery to ensure that your applications consistently work as expected. It doesn't matter whether your servers are up if your users can't complete the tasks they need to because of bugs in your production code.
- Wikipedia has a good article introducing the CAP theorem, which shows the importance of eventual consistency for delivering highly available systems.
- Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation by Jez Humble and Dave Farley (Addison-Wesley, 2010) provides a comprehensive overview of proven approaches to reducing the risk of deploying to production on a regular basis.
- Martin fowler talks on his Bliki about blue-green deployments as well as feature toggles and canary releases.
- The Etsy Code as Craft blog provides a wide range of articles of interest to anyone responsible for designing reliable systems.
For more on how to perform tasks in the IBM Cloud, visit these resources:
- Up and download files from a Windows instance.
- Install IIS web server on Windows 2008 R2.
- Create an IBM Cloud instance with the Linux command line.
- Create an IBM Cloud instance with the Windows command line.
- Extend your corporate network with the IBM Cloud.
- High availability apps in the IBM Cloud.
- Parameterize cloud images for custom instances on the fly.
- Windows-targeted approaches to IBM Cloud provisioning.
- Deploy products using rapid deployment service.
- Integrate your authentication policy using a proxy.
- Configure the Linux Logical Volume Manager.
- Deploy a complex topology using a deployment utility tool.
- Provision and configure an instance that spans a public and private VLAN.
- Secure IBM Cloud access for Android devices.
- Recover data in IBM SmartCloud Enterprise.
- Secure virtual machine instances in the cloud.
- In the developerWorks cloud developer resources, discover and share knowledge and experience of application and services developers building their projects for cloud deployment.
- Find out how to access IBM SmartCloud Enterprise.
Get products and technologies
- See the product images available for IBM SmartCloud Enterprise.
- Join a cloud computing group on developerWorks.
- Read all the great cloud blogs on developerWorks.
- Join the developerWorks community, a professional network and unified set of community tools for connecting, sharing, and collaborating.