This article discusses how characteristics of autonomic computing can affect the way we build software products, IT infrastructure architectures, and application architectures. Typical enterprise architectures do not always give optimum performance at 100% availability. At times, service exploiters that are dependent on the availability of your services could get into trouble.
With the help of design patterns, you can build autonomic resiliency into an application to protect, optimize, and reconfigure itself. In this article we discuss:
- A short-circuit pattern used to circumvent a problem
- A service availability pattern
- Applying a service availability pattern to a sample application
Autonomic computing technology originates from the study of human anatomy. It's really all about how the human brain predicts events, and how:
- It sends signals to the body parts to react to certain predictions to self-prevent any damage
- The human body is self-healing
- The body adjusts to changes in the environment
Do any of those principles apply to IT, applications, or software? IBM Research (see Resources) started studying autonomic computing based on the principle of building computer systems that regulate themselves similar to how our nervous system regulates and protects our bodies.
The fundamental, resilient features of the autonomic computing architecture for application development, often called self-CHOP, are:
- Self-configuring
- Self-healing
- Self-operating
- Self-protecting
Autonomic application resilient design patterns aren't new. Until recently, most were basic patterns supporting simple and obvious goals, or for handling a simple problem. We'll describe how you can use a few autonomic application resiliency design patterns to reduce total cost of ownership. First we'll review a simpler resilient design pattern, then we'll discuss more complicated design patterns.
Typically, an autonomic application resilient design pattern involves four steps to build the logic into the applications. You then need software products to build the autonomic nature and application resiliency for the dynamic nature, or run time architectural issues, in enterprise architecture. The four steps, called a MAPE loop, are:
- Monitor
- With most enterprise architecture, it's common to have monitoring tools to capture events, alert the administration team, and help build metrics for application or service availability.
- Analyze
- In autonomic computing it's important to analyze the data gathered by the monitoring tools to understand the type of event and the root cause of the issue.
- Plan
- Knowledge gathered by the analysis step needs to be converted into a plan to create a resiliency pattern to build logic to handle the event, which could potentially cause undesired results.
- Execute
- When monitors trigger an event, and if the analysis data detects a serious event, there must be logic in the application code to act upon the plan created by the predefined policy definitions as the actions to certain events.
When applications call their own services or services that have shared infrastructure components, if one component slows down or doesn't respond it's going to result in a deadlock situation. The deadlock would then aggravate the problem, and eventually result in an outage of all applications that are dependent on the shared components.
Figure 1 below shows a scenario of an enterprise architecture with applications causing deadlocks at the enterprise level, potentially causing outages. In Scenario 1 in the figure, all applications are healthy in terms of performance, and all shared components are healthy. Application A is a Web application that also makes calls to Service Provider B, which uses the same front-end proxy infrastructure.
In Scenario 2 in Figure 1, Service Provider B slows down, resulting in an increased number of threads on proxy servers. Because Application A also makes calls to Service Provider B, and those calls also go through the same proxy front end, Application A requests will also add to the number of threads waiting for a response on proxy servers. It's similar to www.abcd.com code making calls to www.abcd.com code over HTTP. This causes a sudden increase in the threads waiting to get processed or waiting in the proxy server's queues, aggravating the issue to an outage condition for all applications, including Application B shown in Scenario 3.
Figure 1. Applications causing deadlocks

An application making services calls needs better control over service components in the architecture, and more options for reacting to run time status changes such as performance degradation or outages. Relying on TCP level timeouts often is not an ideal choice, depending on the load an application gets and how reliable an external service is.
A thread timeout mechanism is the monitor mechanism for external service calls; it is an "event," and the kill thread logic that is a predefined mechanism for the execution of the plan. This timeout mechanism can also be thought of as the policy definition for this design pattern.
Figure 2 shows a possible workaround for the deadlock situation. A service call is made using a thread, putting governance and predefined rules such as timeouts on the thread, which then gives much better control over the service call. You need to weigh the architectural advantages versus the drawbacks of initiating threads from a servlet. At times, the former could be a lot more valuable and justify the intentional use of such anti-patterns.
Figure 2. Sequence diagram for short circuit pattern

Service availability patterns aren't new. You can see their applications in various places in our daily lives. We'll look at one scenario where this pattern helps, then review how the advantages can be applied to IT architectures.
Figure 3 shows a common daily scenario for those who drive. Traffic is monitored by different tools. After analyzing the data, a traffic controller normally broadcasts the traffic congestion areas and warns about delays. Drivers can then change their plans to choose alternate routes to help themselves and to smooth an uneven traffic situation.
Figure 3. Traffic control system

IT Applications get lots of "traffic" in terms of Web transactions or application requests from end users, or from other applications exploiting Web services. Applications don't always give good response times, called slow down, for various reasons. Because of costs, it's very common for application architectures to have shared infrastructure components in the enterprise architectures.
As explained in A short-circuit pattern, one heavily used application or service having degraded performance issues or outages can affect the entire enterprise architecture. Let's apply the traffic controller pattern to one such situation in an IT enterprise architecture, and find a solution for a much healthier application infrastructure during outages or degraded performance situations.
Enterprise applications typically have many external interfaces within the outside data center, such as directory calls, database calls, MQ calls, or SOAP-based Web services calls. These applications don't have knowledge of the availability of the external interfaces run-time performance, so most of these interfaces normally have timeout settings. But, sometimes the timeouts are too late to prevent an outage to an application because there are too many threads in Web application servers waiting to get timed out.
For example, an application could be making a URL connection to another servlet to receive an XML document as the response. Another application makes a SOAP request over HTTP to retrieve a SOAP response. The timeout setting available in this scenario is the TCP level timeout value, or the default HTTP return codes returned by the service Web application servers. Unfortunately, imagine what would happen if the server was available and reachable but was responding very slowly. This would cause too many threads, assuming a heavily used application with a lot of concurrent users, waiting to get responses or waiting to get timed out in the application server's Web container. This could often cause a cascaded effect on the shared front-end servers, as shown in Figure 1.
In this design pattern, as shown in Figure 4, it's a regular practice for any enterprise to set up monitoring tools, such as Tivoli® products. They monitor their application infrastructure, and report any events such as degraded performance or outage, and send alert messages to the administration team to take actions immediately.
Figure 4. Service availability pattern

Imagine that a Web service or an LDAP server is facing degraded performance. Applications that are using it could get into trouble. Based on alerts given by monitoring tools, administrators and application support teams take appropriate actions; in a worst-case scenario, a recycle of the applications might be needed. Thinking from the autonomic computing perspective, because the monitoring tools already detected the failure, why not let the monitoring tools hook up with the application server infrastructure through a mediator called Service Availability Broadcast System? The Service Availability system simply analyzes data from monitoring tools, and passes on the messages to the interested applications that are subscribers of a given service availability topic. For example, an application that uses LDAP heavily might be interested in knowing any events related to LDAP availability so it can react to the LDAP events more gracefully.
The use case in Figure 5 shows the flow of how applications could subscribe to the availability event notifications of a target application, service, or an infrastructure component.
Figure 5. Use case for service availability pattern

Traditional monitoring tools will monitor and post any availability or performance-related events to the service availability system. Applications could subscribe to a particular target infrastructure component, such as a Web service, LDAP server, or DB2® server, or even monitor the network availability between two geographies that could impact the availability. Based on the information the Service Availability System receives, availability data from traditional monitoring data could then broadcast the event notification to the subscriber's applications that are interested in knowing the status of an enterprise component's availability information.
This pattern applies very well to a Web services model, too. Figure 6 shows the services triangle with the interaction between service providers, service consumer, and the dynamic discovery and invocation of services.
Figure 6. Service availability pattern for Web services model

The SOAP Fault specification only handles the cases of known application issues; Listing 1 shows a sample.
Listing 1. SOAP Fault specification
<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope
xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" >
<soap:Body>
<soap:Fault>
<faultcode>-1000</faultcode>
<faultstring>Database is unavailable.</faultstring>
<detail/>
</soap:Fault>
</soap:Body>
</soap:Envelope>
|
But the question remains of what happens if the Web application server hosting this Web service has performance issues. Even to return a SOAP Fault specification will have to go through the same infrastructure issues, so it does not help the situation much.
A Service availability pattern can help dynamic services to not only discover the service providers, but also to find the current run time status. The consumer will know the availability of the service provider at that moment, and can decide if it needs to call it or search to find another service in the service registry/UDDI.
Applying a service availability pattern
In the following example, Application A receives over seven million requests a day and a throughput of over 250 servlet requests per second during peak hours. Most of the application transactions need to make LDAP calls with varied levels of complexity in the search filter. During events that involve LDAP performance issues or outages, the optimal timeout value that is set (including the cases where some of the searches take longer) creates too many Web container threads. The threads that are waiting to either get timed out by the LDAP server or get a response from the LDAP server have a cascaded effect on the front-end IP sprayers and caching proxy servers. In this infrastructure, caching proxy servers are shared across the entire portfolio of applications, so they almost cause an outage for all applications that are supported by the shared caching proxy infrastructure.
Figure 7 shows how a service availability pattern was applied to yield a much healthier Application A during LDAP issues, and also protect caching proxy servers or front-end edge infrastructure.
Figure 7. Example of service availability pattern

Application A makes many concurrent calls. A custom monitoring tool for the LDAP cluster was in place for the infrastructure. The LDAP/ED availability tool gathers information from the LDAP monitoring tool and captures any availability events. As soon as the tool detects LDAP issues, it notifies the application (by updating an application property file). The application is scheduled to pick up any changes to the property file, gathering the knowledge of all LDAP availability events. After the outage or performance events are noted, the application simply tries to make any LDAP calls. It already knows the request is going to get timed out anyway, and provides a graceful error message to the end-user transaction saying the LDAP infrastructure is having issues and please try again later.
As soon as the availability status changes, the ED availability tool interprets it based on the LDAP monitor tool data it gets and notifies the application in a similar way. After the application knows the LDAP cluster is again available and performing well, it will start making LDAP calls and service is restored. This sample outlines how to get better event handling in typical Service-Oriented Architectures (SOA).
Learn
- Optimize resource usage and reduce costs series: Read the other parts of this series.
- IBM Research Autonomic Computing Overview: Read the IBM perspective on the state of information technology, the vision, research focus, and business focus related to autonomic computing.
-
"Increase stability and responsiveness by short-circuiting code" (developerWorks, October 2004): Read about a homegrown short-circuit pattern that ensures threaded execution and completion of a process in a fixed window of time.
-
"Symptoms deep dive, Part 1: The autonomic computing symptoms format" (developerWorks, October 2005): Read this article that introduces the autonomic computing symptoms architecture and format, and details symptoms, including such information as how symptoms are represented, how to identify them, the advantages for using a standard symptom representation, and how to adopt them as part of your systems management strategy.
- Graceful service degradation (or, how to know your payment is late): In this paper, Alexandr Andoni and Jessica Staddon introduce the concept of service degradation that alerts users that their service is about to be revoked. (From the Proceedings of the 6th ACM Conference on Electronic Commerce (Vancouver, BC, Canada, June 05 - 08, 2005). EC '05. ACM Press, New York, NY, 9-18.)
- SOAP Version 1.2 Part 1: Messaging Framework SOAP Faults: Use this W3C Recommendation for SOAP Faults.
-
Browse for books on these and other technical topics at the Safari bookstore.
Get products and technologies
- IBM trial products for
download: Build your next development project on Linux with IBM trial software,
available for download directly from developerWorks.
- WebSphere Studio Application Developer: Download a trial version.
Discuss
- Join the discussion: Drop in on the "Autonomic computing: an insider's perspective" discussion forum.
- Blog: Dave Bartlett, IBM VP of Autonomic Computing, shares his perspective.
- developerWorks blogs: Get involved in the developerWorks community.

Murali Narasimhadevara is a Senior IT architect with the IBM CIO office. Murali is also the Senior Webmaster for the IBM intranet, and has been helping develop it for the past eight years. He has extensive experience in building and managing high volume Web sites, application and Web server administration with a focus in WebSphere, performance/capacity planning, and enterprise application design. His areas of interest are in autonomic and utility computing for managing Web infrastructures.

Mahi R. Inampudi is the lead IT architect for IBM's On Demand Workplace expertise location system (BluePages). Other responsibilities include the architecture and solution design for several of IBM's internal offerings and collaborating with the CIO office and IBM Research helping design applications using the latest SOA methods. Recent interests include leveraging newer technologies, such as WebSphere Extended Deployment, the Rational product suite, and IBM's intraGrid architecture.




