An SOA links together many disparate services into one cohesive interoperating environment. Some or all of these services might be invoked synchronously during the course of a single transaction, creating a chain of interconnected service components. Each service within the chain is dependent on the downstream services—that is, a service can't proceed until it receives a response from the downstream services that are invoked. As unpredictable conditions arise (such as sudden high latency of the bus connecting the interoperating services), services may become blocked and fail. The failures are propagated up the chain causing other services to fail and possibly decreasing the stability of the overall SOA deployment.
How significantly a nonresponsive service disrupts the entire SOA infrastructure depends on the resiliency of the SOA. Resiliency —defined as the continued availability and performance of a service despite negative changes in its environment—is critical to maintain a healthy SOA. This article explores short-term solutions designed to increase the resiliency of SOAs that are susceptible to forming chains of synchronously interconnected service components.
Short-term solutions are solutions that can be readily applied to an existing SOA framework with little or no change to the overall architecture. These solutions involve configuration tuning and other subtle optimizations that require little restructuring of the SOA. At worst, the required restructuring is manageable, and quickly and easily implemented.
The primary goals of the short-term solutions are:
- Improve the overall stability of the SOA infrastructure by reducing its fragility.
- Improve tolerance of exceptional conditions in such an infrastructure.
- Improve performance and reduce millions-of-instructions-per-second (MIPS) usage.
- Improve manageability through increased serviceability.
This article focuses on two short-term solutions:
- Collocation of tightly coupled SOA services
- Implementing aggressive timers on SOA service invocations
Short-term solution 1: collocation of tightly coupled SOA services
Collocation essentially means the deployment of SOA services or applications on the same physical system. The term tightly coupled describes the relationship between two or more services that are invoked synchronously, meaning the calling service (the client) is blocked on the service invocation and can't continue until it receives a response from the service it invoked. Several tightly coupled services might participate in a single transaction, creating a chain of synchronously interconnected components within the SOA.
The short-term solution discussed in this section involves co-deploying (where possible) tightly coupled business applications that act as components in a chain of synchronously interconnected components. The focus is on business applications that are deployed to separate servers but share a great deal of synchronous intercommunication (that is, they're tightly coupled). These business applications might benefit greatly from redeployment to the same server, which in turn might translate to benefits for the overall health and stability of the SOA.
One of the obvious reasons for co-deploying tightly coupled business applications is the measurable performance improvement. Co-deployed applications can take advantage of intraprocess communication protocols, which typically facilitate direct communication between the applications. Intraprocess communication avoids all the overhead of a remote service invocation, including serialization, encryption, traversing the network stack, and network latency.
A second reason for co-deployment of tightly coupled business applications is the reduced stress on resources, specifically task- or thread-level resources. A synchronous service invocation over local intraprocess communication protocols typically places much less stress on the local server's resources than an invocation made over remote communication protocols. This is illustrated most clearly with an example, covered in the following section.
Problem scenario: heavy workload versus limited resources
Consider the situation in Figure 1 where business application A deployed in server 1 synchronously invokes service B, which is hosted by business application B and deployed in server 2. The two servers are connected physically by some service bus (typically a network connection).
Figure 1. Application servers hosting different but synchronously dependent applications connected through some communication channel (such as HTTP or RMI)
An external client invokes application A in server 1. The work request is dispatched to a managed task within server 1. Because the communication between application A and service B is synchronous, for example, Remote Method Invocation over Internet InterORB Protocol (RMI/IIOP), the managed task in server 1 that's dispatching the work against application A must block until it receives a response from service B. While blocked, the managed task can perform no other work.
This may become a problem if the hosting server (server 1) is under heavy load and has only a limited number of managed task resources available. As more and more of its managed tasks become blocked on synchronous remote service calls, the capacity of the server to dispatch work diminishes, and thus the overall service rate of the server decreases.
If the service rate drops below the arrival rate of new work, the server falls behind and new work is queued up until a managed task becomes available to dispatch it. If all the managed tasks become blocked, the server can't perform any work at all and appears hung. If the condition persists, the queued work and dispatched work begin to time out and fail.
Figure 2. External services can dramatically reduce the service rate of the server by blocking execution threads for extended periods of time
As server 1's overall service rate suffers, so does the service rate of all services hosted by server 1. Although the actual problem may be isolated to the faulty or high-latency communication between application A and service B, the effects of the problem permeate the entire server. The managed task resources are a shared resource—all applications hosted by server 1 share the same set of managed tasks. As the managed tasks are consumed and blocked by application A, other applications are unable to dispatch. This might cause an otherwise healthy service to fail, as it can't get enough resources to execute.
Solution: co-deploy tightly coupled applications
Co-deploying the two tightly coupled applications within the same server alleviates the stress on resources. If collocated, application A can invoke service B using local intraprocess protocols, which essentially allow for direct invocation. Remote communication is avoided; therefore, so are the problems associated with remote latency. Managed task resources are not blocked, allowing the tasks to execute to completion more efficiently.
With fewer blocked managed tasks, the server can maintain a higher service rate and is more likely to avoid queue growth. Avoiding queue growth prevents dispatch timeouts, which prevents precipitate failures that affect other components within the SOA. The result is a more stable and more resilient SOA.
Figure 3. The collocation of synchronously dependent services
As illustrated here, faulty communication or poor performance between an application and the service it invokes synchronously can have negative effects on the entire server and, in turn, all of the applications and services hosted by that server. As the services at server 1 are affected, so are all the components within the SOA that use those services. Therefore, the problems developing on server 1 could ripple across the entire SOA, disrupting the SOA on a large scale. As a short-term solution, collocation of tightly coupled business applications and services can mitigate some of the instability that might develop in the SOA environment.
Short-term solution 2: implementing aggressive timers on service invocations
A second short-term solution that can improve the stability of the SOA infrastructure, specifically in the area of tightly coupled services that communicate synchronously, is to apply aggressive timers to govern the service invocations. In situations where the tightly coupled services can't be collocated, aggressive service-invocation timers can be applied to alleviate the negative effects of a nonresponsive service.
An aggressive service-invocation timer is set to the reasonable expected response time of the service, with some consideration for transient delays. In practice, many timers are set far longer than the average response time of the service. For example, a service that typically dispatches and completes requests within one to two seconds might have a dispatch timeout of 300 seconds (five minutes). That's an example of a nonaggressive timer.
The idea is, if the service typically responds within two seconds, but a particular invocation hasn't responded in 30 seconds, then it's safe to assume that the service is suffering some condition that has rendered it nonresponsive. It's unlikely the service will ever respond given extra time; therefore waiting an additional 270 seconds before timing out the request is probably unnecessary. Timers configured in this manner give the service every opportunity to respond. However, no consideration is given to the effect the lengthy delay might have on other components synchronously involved in the transaction, nor to the effect of consumed and blocked shared resources on other services hosted within the local server.
Problem scenario: blocked managed-task resources
The problem with nonaggressive timers is that they take too long to react to exceptional conditions in the environment that might cause a service to become suddenly, temporarily nonresponsive (for example, a broken network connection). Within a shared resource virtualized infrastructure, such as an application server, the managed task that invoked the service is blocked until the nonaggressive invocation timer expires, terminating the service request and shaking loose the managed task that was blocked by the service call.
While blocked, the managed task continues to hold any shared resources it acquired during the course of its dispatch (for example, storage resources, task-level resources, and mutex locks). The shared resources held by the blocked managed task are unavailable to other tasks, which can inhibit the processing of other dispatched work within the server and can possibly lead to a decrease in the overall service rate of the application server.
Furthermore, depending on the nature of the communication failure, other managed tasks within the application server can be experiencing the same nonresponsiveness of the service. Under heavy load and limited task resources, it's possible that all available managed tasks within the server will become blocked on the nonresponsive service. With all managed task resources consumed, the application server can't process more incoming work.
Figure 4. Nonresponsive services can block all available managed tasks within the application server
In effect, the application service has become nonresponsive itself—same as the service that its threads are trying to invoke. And just as the nonresponsiveness of the downstream service caused the application server to become nonresponsive, the nonresponsiveness of the application server can have similar adverse effects on other upstream components in the chain of synchronously interconnected components.
The application server remains nonresponsive until an invocation timer expires, terminates the service invocation, and resumes a blocked managed task, allowing the managed task to complete its dispatch and accept new work. The less aggressively the invocation timer is set, the longer the application server remains nonresponsive. The longer the application server remains nonresponsive, the more likely other components within the SOA that invoke services against the application server will be similarly affected. This could lead to a disruption and destabilization of the entire SOA.
Figure 5. Nonresponsive services can cause issues to upstream callers, specifically by blocking their worker threads and decreasing the service rate of their corresponding servers
As illustrated in Figure 5, the negative effects of a single nonresponsive service combined with nonaggressive timers could cascade throughout the SOA, having extremely harmful consequences on the SOA's stability and availability.
The solution to this problem is to implement aggressive invocation timers on synchronously invoked services. Appropriately aggressive timers alleviate the stress on shared resources within the application server—specifically managed task resources—as managed tasks are not blocked for an unreasonably lengthy period of time on a service that will likely never respond.
The aggressive timer aborts the service invocation as soon as it appears to be nonresponsive, allowing the managed task to complete its dispatch and accept new work. With more managed tasks available to process work, the application server maintains a higher service rate, thereby relieving the stress on the entire chain of synchronously interconnected components within the SOA.
Figure 6. Aggressive timeout values can alleviate the impacts of nonresponsive services by releasing consumed resources sooner
The queuing theory models in Figures 7 and 8 illustrate quantitatively how aggressive timers increase the service rate (or throughput) of the application server. The model assumes an average service response time of one second, with a 99% probability that the service will complete successfully. There is a 1% probability of the service being nonresponsive. In the nonresponsive cases, the managed task executing the service doesn't respond until the service timer expires. Between the two figures, the service timeout value is varied from 60 seconds in Figure 7 (nonaggressive) to 20 seconds in Figure 8 (aggressive).
Figure 7. Simple quantitative model where the timeout value is set to 60 seconds and the effective throughput is 63 requests per second
With a 60-second timeout value, the model computes an overall service rate of 1.59 seconds, which translates to a throughput of 63 requests per second in an application server containing 100 worker threads. Figure 8 shows how implementing a more aggressive service timer improves the service rate and throughput of the server.
Figure 8. Simple quantitative model where the timeout value is changed to 20 seconds and the effective throughput is increased to 84 requests per second
With the more aggressive 20-second timeout value, the model computes an overall service rate of 1.19 seconds, which translates to a throughput of 84 requests per second. The model's calculations demonstrate how the aggressive timer in Figure 8 results in a significant improvement of the overall service rate and throughput of the application server.
It's important to note that the implementation of aggressive timers doesn't correct or prevent the transaction failure. In fact, one consequence of the aggressive timer is that it might abort a service request and fail a transaction that otherwise would have completed given more time. But the more important consideration here is the overall health of the application server and, more broadly, the health of the SOA. As described above, aggressively set timers can help mitigate the negative effects that cascade across the SOA when a service becomes nonresponsive. Ultimately, well-implemented aggressive timers result in a more resilient SOA.
The purpose of this article was to introduce you to short-term, immediately applicable solutions for specific performance and availability problems that may arise with the use of tightly coupled, synchronously interconnected components within the SOA. The article illustrated how a single nonresponsive service can have cascading effects throughout the SOA, devastating the overall health of the SOA and possibly leading to destabilization and a disruption of service.
The solutions described here—collocation of tightly coupled business applications and the application of aggressive timers—are presented in the context of being short-term for their ability to be applied readily to an existing SOA with very little restructuring or redesigning of the SOA framework. Upcoming articles in this series present long-term solutions that are more comprehensive and require more planning, design, or restructuring effort.
These short-term solutions are vital for their immediate stabilizing effect on an SOA that might already be suffering from the problems described in this article. The solutions also apply to SOA infrastructures that are susceptible to forming such chains of synchronously interconnected components. The ability of these solutions to stabilize the SOA and avoid the disruption of SOA services leads to the increased resiliency of the SOA.
Learn
- IBM Redbooks®:
Architecting High Availability Using WebSphere V6 on z/OS
gives a great introduction to Web services and Remote Procedure Calls (RPCs).
- IBM Redbooks:
Monitoring WebSphere Application Performance on z/OS
provides an overview of WebSphere Application Server on z/OS.
- Check out the
WebSphere SOA and J2EE in Practice
blog for Bobby Woolf's description of IBM WebSphere Extended Deployment.
- Read the IBM WebSphere Developer Technical
Journal article, "The
top 10 (more or less) J2EE best practices" (developerWorks, May 2004).
- Check out the
"Recommended
reading list: J2EE and WebSphere Application Server" (developerWorks, Apr 2006), the definitive overview of related WebSphere Application Server resources.
- The SOA and Web services zone on IBM developerWorks hosts hundreds of informative articles and introductory, intermediate, and advanced tutorials on how to develop Web services
applications.
- Play in the IBM SOA Sandbox! Increase your SOA skills through practical, hands-on experience with the IBM SOA entry points.
- The IBM SOA Web site offers an overview of SOA and how IBM can help you get there.
- Stay current with developerWorks technical events and webcasts. Check out the following SOA and Web services tech briefings in particular:
- Get started on SOA with WebSphere's proven, flexible entry points
- Building SOA solutions and managing the service lifecycle
- SCA/SDO: To drive the next generation of SOA
- SOA reuse and connectivity
- Browse for books on these and other technical topics at the
Safari bookstore.
- Check out a quick Web services on demand demo.
- Get an RSS feed for this series. (Find out more about RSS.)
Get products and technologies
- Innovate your next development project with
IBM trial software, available for download or on DVD.
Discuss
- Participate in the discussion forum.
- Get involved in the developerWorks community
by participating in developerWorks blogs, including the following SOA
and Web services-related blogs:
- Service Oriented Architecture -- Off the Record with Sandy Carter
- Best Practices in Service-Oriented Architecture with Ali Arsanjani
- WebSphere SOA and J2EE in Practice with Bobby Woolf
- Building SOA applications with patterns with Dr. Eoin Lane
- Client Insights, Concerns and Perspectives on SOA with Kerrie Holley
- Service-Oriented Architecture and Business-Level Tooling with Simon Johnston
- SOA, ESB and Beyond with Sanjay Bose
- SOA, Innovations, Technologies, Trends...and a little fun with Mark Colan
Snehal Antani works for the SOA Technology Practice within IBM Software Services for WebSphere (ISSW) and is the technical lead for IBM WebSphere Extended Deployment. He comes from a development background, working on several products, including IBM WebSphere Application Server for z/OS, IBM WebSphere Extended Deployment-Distributed, and IBM WebSphere Extended Deployment for z/OS, and has helped bring to production some of IBM's largest WebSphere Distributed and z/OS customers around the world. He has disclosed several patents and technical publications in the domains of enterprise application infrastructure and grid computing. He earned a BS in computer science from Purdue University and will complete his MS in computer science from Rensselaer Polytechnic Institute (RPI) in Troy, NY with a thesis in the area of quantifying and improving the resiliency of middleware infrastructures.
Rob Alderman works for the IBM WebSphere Application Server for z/OS development group. He is the technical team lead of the WebSphere for z/OS runtime development team. His development focus is primarily in the area of WebSphere for z/OS run time, where the WebSphere Application Server code interacts with and exploits the native operating system services available from z/OS. Rob earned a dual BS degree in computer systems engineering and computer science from Rensselaer Polytechnic Institute (RPI) in Troy, NY.
Comments (Undergoing maintenance)





