When bad things happen to good systems
Software solution resiliency refers to the ability of a solution to absorb the impact of a problem in one or more parts of a system, while continuing to provide an acceptable service level to the business. Two significant components in the definition of resiliency here is the problem impact and what service level is considered acceptable when that problem occurs.
While in an ideal world, a resilent system would be able to deal with any problem in a way that would have no negative impact, but even with proper design and tesing it's likely that some user population and requests will be impacted by a failure. If a machine hosting the system (or system component) crashes, any requests that are “in flight” on that machine are moved to another machine - as transparently as possible to the users. More profound in impact is a catastrophic failure in a data center, resulting in all the work that was being processed by that data center to be continued by another data center - again as transparently as possible to the users, although in the event of a catastrophic outage you should be prepared for a signifcant impact. The goal in this case is to minimize the duration and impact. For example, the service level in the event of a machine failure might be the resumption of service in a small number of minutes or less, or a small number of hours in the event of a data center failure.
What this example shows is that the expected service level depends on the nature of the problem. A widespread catastrophic failure in a data center usually has a more negative impact than a machine failure.
Another class of problems relevant to solution resiliency is what happens when a resource is available, but has a performance problem. You might ask the question: Is this resource critical to all requests or to a subset of requests? Another question might be: Is it a non-critical resource?
For a distributed software solution to be resilient, all prerequisite layers supporting this solution must be resilient. These layers are typically the networks, firewalls, load balancers, network switches, operating systems, installed software (such as IBM WebSphere Application Server), and the hosted software solution that might be distributed over a large number of hosts.
This article provides some general resiliency guidelines. Future articles will focus on the WebSphere software solution resiliency in a specific data center. These articles will not address the problems associated with widespread catastrophic failures and the mitigation that results in a backup data center assuming the work of the failed data center.
Software solution resiliency tuning must lead to a solution that is less impacted by environmental changes. For example, if a WebSphere Application Server cluster member fails, the rest of the cluster members will be able to process requests as if no failure had occurred. The impact of the cluster member failure is only temporary.
This section provides key solution resiliency guidelines:
1. Develop resiliency test cases
The objective of resiliency test cases is to discover how the solution environment will behave when problems occur in the solution environment. For each solution resource identified, a description of a problem to introduce with that resource must be provided in the resiliency test case.
For example, one problem might be to shut the database down gracefully, another is to crash the database, and another is to introduce a network outage between the solution and the database. These are all possible real-world problems.
To come up with meaningful resiliency test cases, a good understanding of the following is critical:
- Solution operational model: The solution resiliency expert must understand the operational model of the solution to a level that enables him to describe the various components and their interactions. A set of sequence or flow diagrams describing, at an operation level, the end-to-end request/response flows is highly recommended. The purpose of these diagrams is to identify, document, and understand the various request/response flows. All resources required by the solution must be identified at the component interaction level.
- Solution non-functional requirements: Although the solution non-functional requirements can comprise a long list of requirements, response time, throughput and availability are the most critical.
There is no perfect software, things will go wrong. So, you must ask what can realistically go wrong when your software is used in a production environment. For example:
- What happens to the non-functional requirements or service level agreements of a solution flow if one or more resources are unavailable or slow down?
- What happens to the non-functional requirements of the flows if one flow sharing resources with the other flows is having a performance problem?
- Is the availability of a particular resource critical to the business function implemented by a particular solution flow?
- If a resource that is critical to the business function becomes unavailable, and later becomes available, then how would the solution behave? Would the solution still have a problem with that resource?
Questions such as these will drive the development of your resiliency test cases. The use cases that you build should address these areas for each resource:
- Directly-affected requests
- Non-functional requirements
- Indirectly-affected requests
- Non-functional requirements
- Directly-affected requests
- Resiliency test description
- Observed solution behavior
- Resulting problems (if any)
- Change to fix resulting problems
Table 1 shows examples for solution high level request flows and corresponding non-functional requirements:
Table 1. High-level flows and corresponding non-functional requirements
|1||User > FW > WS > AS > DB||RT = 0.5 s|
TPT = 1000/hour
Availability = critical
|2||User > WS > AS > DB||RT = 1 s|
TPT = 100/hour
Availability = non-critical
|3||User > FW > WS > AS > DB||RT = 5 s|
TPT = 200/hour
Availability = critical
The Flow column shows a request flowing through different products; a firewall, a web server, an application server, and a database. At this level of granularity, tuning for resiliency is often not completely possible, but you can address some resiliency questions.
For example, you need to ensure that each product listed in the flow is clustered so that if one member of a cluster fails, the other cluster members should be able to assume the work of the failed member. Further, if one member of a cluster fails, what will happen to the non-functional requirements? Will throughputs or response times suffer to the extent tha they will be unacceptable? The solution should employ clusters at every level in a way such that if a cluster member fails, the non-functional requirements are still acceptable.
To address the solution resiliency more accurately, you need to provide a more granular level of details of the resources used by the solution. Table 2 shows example resources required within WebSphere Application Server to support the solution flow 1 shown in Table 1. Because the solution flow examples shown in Table 1 require additional products, a similar level of granularity is required for those products as well. Since these products are installed on operating systems (physical or virtual), a number of resources at the operating system level must also be identified. For example, the number of files that are open concurrently by a process executed with a given user ID is a resource at the operating system level. CPU utilization and memory consumption are other operating system level resources.
Table 2. Solution flow: Resources
Once resources are identified accurately for a particular solution request flow, there must be a way to measure the resource usage during the execution of the test case. Documenting the various resource metrics and the tools to use to collect these metrics in the test case is important. Table 3 shows WebSphere resource metrics associated with each resource that must be monitored as part of the overall health monitoring of this particular solution flow.
Table 3. Solution flow: Resource metrics
To document this level of granularity, the solution resiliency expert relies on the following subject matter experts:
- Solution architects are critical to provide the various architectural aspects (such as architectural overview, operational model, architectural decisions) of the solution.
- Application developers are critical to provide the various technical details of the request/response flows.
- Network administrators are critical to provide the network resources and corresponding metrics.
- System administrators identify the operating system level resources and corresponding metrics that reflect the health of those resources supporting the solution.
- Product subject matter experts, such as a database administrator, identify the resources within a particular product required by each request/response flow and corresponding metrics that reflect the health of that particular part of the solution.
Clients often do not have an end-to-end view of the critical metrics of resources required for a given solution. Even those who have monitoring tools that support the end-to-end monitoring view still require a specification of the right metrics to track. Usually, tools have the ability to monitor a set of default metrics, such as memory and CPU utilization. However, such monitoring is often insufficient. A specification of the subject matter expert who will provide their product health status during the execution of a test case should be provided in the test case.
2. Run a resiliency test case
No resiliency tuning can be performed accurately without running test cases that are focused on the solution environment behavior when a problem is introduced with a resource. To develop confidence in the results obtained from running a test case, it is highly recommended that the test case is rerun with the same conditions.
3. Monitor solution behavior
Although resiliency testing is not focused on solution performance, monitoring solution performance in addition to solution resiliency is critical. This is to ensure that resiliency tuning is not causing performance degradation to unacceptable levels. So, as each resiliency test case is performed, monitoring the solution non-functional requirements and various resource metrics must be performed. This monitoring is critical for two main reasons: to make a decision whether the solution passed the test from resiliency and performance perspectives, and to identify root causes of possible performance or resiliency problems that could result from the problem introduced with a resource.
4. Identify possible resiliency problems
While monitoring the solution behavior when running a resiliency test case, one or more problems might be discovered as a result of the original problem introduced with a resource. There might be problems discovered in the solution while the resource problem is still present. These problems might or might not be acceptable to the business.
However, there could still be problems in the solution even after the resource problem is resolved. Usually, these problems must be resolved. For example, a database outage is introduced to discover what the solution behavior is. While the problem exists, the solution could not provide any value to the business. This should be understood by the business as the database is unavailable. However, once the database outage is resolved, the solution is expected to work as expected. If the solution still has problems after the database outage is resolved, the solution is not resilient to this environmental change and that is usually unacceptable to the business. The following problem categories might be applicable:
- Software solution defect: It is not uncommon for resiliency test cases to uncover software defects in the solution environment. Defects can be in any part of the environment such as the software product, operating system, a network product, or the application code itself. The resiliency test cases might uncover a defect that had already been resolved via a product patch or another release of the application code. It is highly recommended that the solution environment be at the latest software level before running test cases.
- Configuration problems: The default configuration used by software products does not usually provide the optimal results.
5. Identify resiliency parameters
When running resiliency test cases, it is common to find that the solution will not provide an acceptable service level. For example, a solution hang or a slow recovery might occur even after the problem introduced into a resource is resolved. Certain resiliency parameters can be identified based on the solution behavior.
6. Apply resiliency parameters
When applying resiliency parameters, it is important to keep in mind the guidelines provided by the request queuing funnel. This “funnel” drives the configuration in such a way so that the number of requests is smallest at the target system. Figure 1 shows an example request queuing funnel where requests go through a number of defined WebSphere Application Server resources.
Figure 1. Request queuing funnel
7. Repeat the resiliency process
After identifying resiliency parameters and applying them in the solution environment, a re-run of the resiliency test case is critical. Otherwise, there will be no confirmation that these resiliency parameters will work. An iteration of these steps (Figure 2) must be performed as many times as required to insure the solution passes the resiliency test.
Figure 2. Solution resiliency tuning iterative process
For mission-critical applications, it is imperative that resiliency testing is performed before the applications are put in production. If not tested and optimized for resiliency, the consequences can potentially be severe, impacting customer satisfaction levels, and prehaps even resulting in business loss or legal repercussions. The cost and challenge of up-front resiliency testing will almost always outweigh the decision to disregard these risks.
The author thanks Thomas Alcott, Kevin Grigorenko, Alexandre Polozoff, Peter Bahrs and Kyle Brown for their reviews.
- IBM WebSphere Application Server product documentation library
- IBM Business Process Manager product documentation library
- IBM developerWorks WebSphere zone