It would be great if we could build our systems in isolation, with no need to consider the outside world, and happily ignore these simple truths:
- Back-end systems (databases and legacy applications) are often unreliable.
- Hardware fails.
- Customers will use your Web site in ways you can’t imagine, and often in numbers you can’t anticipate
- Software processes (like upgrades) often fail in unexpected ways.
- Code that you have come to rely on (open source or vendor software) can sometimes change in a completely unexpected way.
But the fact is we live in the real world, and whenever you design the architecture for a large-scale system, you step out onto a dangerous road, with yet another potential hazard waiting just around the bend. There is an old software architecture aphorism that says, bubbles never crash but real systems do, meaning that designs work fine on paper, but actual running software frequently doesn’t . This is because our systems function in the real world and not in a realm of pure abstraction. So, just as drivers need to keep their eyes open, watch for road hazards, and compensate for other drivers doing dangerous things, system architects also need to plan for the possibility of disaster.
This series of articles advocates the practice of defensive development, a way of looking at system design, software process design, and system development in such a way that you won’t be caught unawares when things go wrong. Taking the driving analogy further, you could make these parallel definitions:
- Defensive driving: Driving to save lives, time, and money, in spite of the conditions around you and the actions of others (Wikipedia).
- Defensive development: Developing software to avoid downtime and prevent errors, in spite of the conditions around you and the actions of others.
To understand defensive development in practice, consider these three subject areas that together form the defensive software development process:
- Defensive architecture
- Defensive design
- Defensive programming.
Beginning with the notion of defensive architecture, this article draws examples from IBM® WebSphere® Application Server, IBM WebSphere MQ, and others products to help you examine the principles of this element of defensive development.
Principles of defensive architecture
Whether you’re talking about systems based on WebSphere Application Server or one of the many products that are based on WebSphere Application Server (such as IBM WebSphere Portal, IBM WebSphere ESB, IBM WebSphere Process Server, and so on), one of the first things you do is determine the software architecture of your system. For the purpose of this article, we’ll define this as the way in which you map the desired features of your final system onto the features of the products you are considering, and then how these are mapped to the physical instances of the processes that execute at run time. This comprises both infrastructure architecture and application architecture, and both will be touched upon here. One of the final outputs of this process is usually a system topology that shows the different components of your solution and illustrates how those components map onto the actual physical machines (or virtual machines such as LPARs) in your production environment. Having defined that, you come to the first of three principles of defensive architecture:
1. Given that real systems (hardware and software) fail, avoid single points of failure
In theory, this first principle should be simple. One of the final outcomes of any architecture process is some type of visual graph (such as a UML deployment diagram) that represents the system topology. Any “singleton” on this graph could be a single point of failure. Detecting these singletons leads most of us to search for singleton nodes, but arcs can also be single points of failure.
Consider the following hypothetical section of such a diagram:
- Figure 1 shows a simple architecture that takes a user’s Web requests thru a load balancer to IHS, and the WebSphere Application Server plug-in to a cluster of WebSphere Application Server instances to a back-end system. In this case, you can be comfortable that since the load balancer, IHS, WebSphere Application Server, and the back end are clustered, there are no single points of failure in the nodes of this diagram.
Figure 1. Hypothetical system design
- Now, look at the arcs. Unlike HTTP -- and despite the way it is drawn on the diagram -- WebSphere MQ is actually a subsystem, not a protocol. There is an additional piece of software (the queue manager) that sits between the sender and the receiver of a message that is not shown in this typical diagram. There are well-established clustering mechanisms for WebSphere MQ which are necessary for the arc shown in Figure 1 to be resilient.
- When you redraw the diagram to show all the components, as in Figure 2, the problem becomes more clear. Although this might seem to be a contrived example, it is in fact a common one. Misleading (or incomplete) diagrams are responsible for many architectural errors and omissions -- not just in the context of resilience, but also in capacity planning and many other areas.
Figure 2. Redrawn diagram to show all components
Many architects focus on adding resiliency on the front end (HTTP) and forget to address the communication to their back-end systems. When dealing with software components, there are several ways you can address some of the more common single points of failure. For example:
- WebSphere MQ: Employ MQ clustering
- WebSphere internal JMS: Run the SIBus messaging engine in a clustered configuration
- WebSphere Application Server: Cluster your application servers and don’t forget to enable session persistence if your applications need it.
- Database: Use a hardware high availability (HA) solution, like HACMP, or use a solution, like IBM DB2® clustering or Oracle RAC and make sure that WebSphere Application Server (especially your data source configuration) is set up appropriately to handle clustering.
It is also important to consider aspects of your overall system design that are not represented in a topology diagram. Remember that the map is not the terrain. For example, when evaluating a set of servers, remember to consider whether they are all using the same power source. In other words, would it be possible for a power failure in one datacenter --or worse, in one rack of blades -- to bring down all of your servers? If you are using a virtualization technology, such as LPARs on pSeries® servers, are all of the LPARs on a single server? If so, then you have HA on paper, but not in the real world. You must have a mapping from logical resilience to physical resilience, and you must be able to illustrate this by showing (in some diagram) not only the components of your architecture, but their physical placement as well.
You also need to consider the resiliency of your network, as there are a number of scenarios that can adversely affect your network’s reliability, especially where a WAN is concerned. For example, it would do you little good to have multiple Internet connections available to you if:
- They all route through the same satellite that goes out of commission because of a solar flare.
- They are all carried over the same cable or cable conduit -- which is physically broken by either the anchor from a passing ship or a farmer’s plow.
- All Internet links are slowed down due to heavily increased traffic caused by an Internet worm or virus.
Perhaps these examples are extreme -- they are all true stories of WAN networks gone bad, by the way -- but the point is that even though perfect resiliency is impossible, there is still much you can guard against. The question then becomes: how far do you go in addressing system resiliency? Or if you prefer: how far can you afford to go? You must carefully weigh all the risks involved and determine which ones to address to achieve the greatest benefit for the cost.
The more resilience you build into your system through solutions like clustering and physical backups, the more expensive your overall system will be to construct and maintain. You must account for the cost of planning, provisioning, developing, and testing your system. Of course, you should spend your budget wisely; you don’t want to plug an expensive mouse hole when the barn door is open, or spend all your HA money on one architecture layer and ignore the others. For instance, it’s generally not cost effective to implement a clustered application server solution for reliability if your database or back-end connection could be a single point of failure that will render all of the servers useless.
When developing a system architecture, record all of the points of failure as you identify them. Then, for each point of failure, write a short description about how you can handle the failure (or simply record the fact that this failure is one you’ve decided to live with). This list will be invaluable as you build and test your system. It will also enable you to prioritize the potential fixes. When prioritizing, consider three factors:
- The likelihood of the particular failure.
- The cost (or pain) of a particular failure scenario occurring.
- The cost of adding the necessary resilience.
Remember to include not only nodes and arcs on your topology diagrams, but groups of components too, such as those sharing a network, a power source, or an entire site.
Now, on to the next principle of defensive architecture:
2. Be able to recover smoothly from the failure of any machine or process
Simply providing redundancies isn’t sufficient. You must ensure that your customers can tolerate the transition to your backup solution. Thus, having processes (preferably automatic) in place to deal with failure is an important part of reducing down time and achieving your Service Level Agreements. As you consider your processes, ask these questions for each component in your architecture:
Will the failover between redundant components take place automatically? This is the single most important question to answer. If failover is automatic, how long does it take for the failover system to switch in? Some automatic failover mechanisms (like the HTTP routing in the WebSphere HTTP Server plug-in) engage very quickly when an application server fails (consider the ConnectTimeout parameter). For other parts of your architecture, such as the failover of a database server, the time may be longer.
If the failover of a component is not automatic, what manual steps must be taken? Can you partially automate the steps to make the process more efficient and less susceptible to errors? For example, can you use scripting, either at the OS level or at the product level (such as using WebSphere Application Server’s built-in administrative scripting feature) to reduce the time it takes to failover? HACMP might also be useful for such components and can be configured either to perform lazy failover (which takes a little more time) or quick failover, depending on the requirements for that component and the cost of providing the necessary infrastructure and administration.
Upon a failure, will you lose any in-flight user activities? If so, is that acceptable to the business? If a user activity does not have a significant or measurable dollar value (for example, a user is registering on a site that is ad-supported), then perhaps losing a small percentage might be tolerable. However, if the dollar value to your business is particularly high for each activity, then you might need to take extraordinary measures to ensure no activities are lost. For example, you might need to take steps to preserve and recover from transaction logs, or you might want to consider putting intermediate transaction state in recoverable storage (like a database, or a store-and-forward messaging system like WebSphere MQ). Likewise, such stringent requirements could also drive the need to to provide more highly available infrastructure components or links; WebSphere MQ, for example, can be configured in a highly available configuration given enough memory, networking capability, and so on.
How will the system perform in “backup” mode? Can you implement load-shedding mechanisms? For example, consider the simple problem where you have two machines in a site, each running at over 50% utilization. If one fails, then the entire site will likely suffer an outage since the remaining machine will be overloaded, unless you plan for shedding some of the load. Likewise (though not ideal), in many disaster recovery solutions, the backup site does not have the full capacity of the main site, thus requiring the overall solution to include some sort of load-shedding mechanism. (For example, you could use IBM WebSphere Virtual Enterprise to prioritize the application workload when capacity is constrained.) This can be done in several ways; you can either eliminate some users through restricting your Web site to allow only certain types of transaction (for example, on a brokerage site you could allow users to trade but not run analytics tools), or through rejecting certain users (for example, enable external users but not internal ones).
How will the system perform during the transition? For example, it is possible that repeated operations trying to establish communication with a dead server could overload other parts (such as the DNS, or the deployment manager in WebSphere Application Server).
Simple component failure is not the only issue to consider. If the failure is a systemic one, then failing over to a duplicate component will not necessarily help the problem; if the root cause of the problem is not addressed, then the backup will fail for the same reasons the primary failed. For example, if an application server locks up under load because there are not enough database resources to fulfill outstanding requests, then killing the original application server and bringing up another application server will not solve the problem -- it may, in fact, amplify it. In general, you have to plan for these kinds of systemic resource issues and identify them at a higher level with adequate load testing.
Finally, it’s important that you have a "Plan C" for components that are deemed unreliable. In theory, it should be possible to work around any component that fails, but in practice, you likely only have the time and budget to work around those that are the most unreliable. Here are some suggestions for dealing with these cases:
If you have a component that fails (or seems to) on a regular basis, you might be able to get by implementing a process that reboots or restarts that component on a timed basis. For example, if your basic problem is a slow memory leak or a slow resource leak of another sort, then an effective workaround would be to reboot each server in a cluster on a regular schedule. (This doesn’t solve the problem but does mitigate the symptoms.) Likewise, you can take advantage of capabilities such as WebSphere Virtual Enterprise health policies for addressing specific issues (such as memory leaks or hung threads) to restart an application server that gets into a troubled state.
Your Plan C solution might include coding around an unreliable component with a manual "shunt" or bypass. For example, consider a hotel chain with several linked Web sites that all use the same central reservations engine. The reservations engine is less reliable than the rest of the site and prevents users from performing other tasks while it is down. One solution is to isolate the affected parts of the Web site with a ServletFilter and turn them off whenever problems are encountered. Users trying to make reservations can be redirected to use the phone reservation hotline, and other users were able to complete other online activities.
Your "Plan C" might involve manual entry or recoding of data. If an automated part of your system is unavailable, it is important that any critical data entered by your users is captured either in a separate audit log (not part of the application error log) or in some other non-volatile storage so that it can be replayed or re-entered when necessary. An example here is a Web site that enables the entry of high-value sales information. The back-end system is often overloaded and prevents the sales staff from entering their quarterly data on time. A solution could be to separate the data entry part of the application from the processing part by introducing a WebSphere MQ queue between the two halves of the system. When the back-end processing starts to become overloaded, the queue holds on to the data being entered into the front end of the system persistently. As the back end recovers and takes up the slack, it can pull the previously entered data off of the queue. Alternate methods of dealing with manual data entry or re-entry might require special interfaces for reading logs or for re-entering the data, so you need to plan for that as part of your development process.
Knowing what can possibly go wrong, and planning for recovery when it does leads us to the final principle of defensive architecture:
3. If you don’t test it, it won’t work
A complete and well thought-out test plan is as important an architectural asset as any of the system deployment and topology diagrams discussed earlier. It adds to these documents by telling you not just how the system should work but how you know what does work. To build your test plan, you need to consider how to test the failure of each individual component, as well as each group of components. For help with this, you can refer back to the list of failure points you captured earlier.
With testing, even more so than with architecture, you can’t cover all the bases. This makes this third principle particularly difficult to follow. The keys to success in testing include:
- Efficiency: Make sure you test as much as possible at the lowest level possible. This is generally accomplished in unit test. You will never be able to achieve complete coverage of all failure paths. Instead, aim for the broadest coverage at the lowest level, and execute fewer full test cases if need be. Remember, unit test scaffolding is expensive, but the set up and debugging of full failure testing is much more so.
- Prioritization: Choose your testing based on the idea of "preventing the most pain at the least cost." You need to consider at least three factors:
- the likelihood of a particular failure.
- the cost (or pain) of a particular failure scenario not working.
- the cost of testing that failure.
- Coverage: Test as many different types of failures as you can, then use the information you gathered to help you decide where to concentrate on and test further. Try to test different error paths in each testing iteration. Development teams should learn from their mistakes.
Be sure to make a detailed list of all the cases you decided not to test. This list should encompass at least each function, what the risk is for failure, and how likely it would be to affect other components of the runtime (for example, could this function cause a complete site outage). This will be an important input to your final risk analysis and can play a major role in how applications are placed in the shared runtime.
Next, you need to consider performance testing and capacity planning for the error paths. Failure scenarios should be tested under heavy load. In practice, the most painful failures happen during times of peak load. In addition, the process of failing over (redistributing load) can add significant extra load to the system. Many operations plan only for the reduced capacity of the steady-state "failed-over" system, but not for the transition period. When conducting performance testing, be careful of over-reliance on rules like the Pareto principle (the 80/20 rule). Consider not only the most common use cases for performance testing, but high-value use cases as well. Likewise, when formulating your tests, be sure to test boundary cases also.
Just as important as deciding what to test is deciding when to test. You want to make sure that your system test plan leaves adequate time to correct (and re-test) problems before production. In general, you should begin performance and resilience testing no later than 1/3 of the way through the project, even though this means you may be testing only a limited subset of the project functionality. Repeating these tests periodically will enable you to ensure that defensive measures can be implemented, and will help ensure a smooth implementation.
Finally, as you test, you should analyze your results, identify any suspect components, and develop pre-emptive coping strategies. This includes monitoring for signs of impending failure as well as determining if coping strategies like rebooting leaky processes will be effective.
This article examined some common software architecture problems, discussed some principles of defensive architecture, and reviewed some sample solutions for these kinds of occurrences. Continuing this exploration of defensive development, Part 2 of this series will examine issues around defensive design.
Many thanks to Keys Botzum, Tom Alcott, and Alex Polozoff for their helpful contributions, comments, and insights into this article.
- The Support Authority: 12 ways you can prepare for effective production troubleshooting
- IBM developerWorks Architecture zone
- IBM developerWorks WebSphere