Skip to main content

Build a resilient SOA infrastructure, Part 2: Short-term solutions for issues involving tightly coupled SOA components

Improve SOA resiliency through aggressive timers and collocation of tightly coupled components

Snehal Antani (antani@us.ibm.com), ISSW-SOA Technology Practice, WebSphere XD Technical Lead, IBM
Snehal Antani works for the SOA Technology Practice within IBM Software Services for WebSphere (ISSW) and is the technical lead for IBM WebSphere Extended Deployment. He comes from a development background, working on several products, including IBM WebSphere Application Server for z/OS, IBM WebSphere Extended Deployment-Distributed, and IBM WebSphere Extended Deployment for z/OS, and has helped bring to production some of IBM's largest WebSphere Distributed and z/OS customers around the world. He has disclosed several patents and technical publications in the domains of enterprise application infrastructure and grid computing. He earned a BS in computer science from Purdue University and will complete his MS in computer science from Rensselaer Polytechnic Institute (RPI) in Troy, NY with a thesis in the area of quantifying and improving the resiliency of middleware infrastructures.
Robert G. Alderman (ralder@us.ibm.com), Software Developer, IBM
Rob Alderman works for the IBM WebSphere Application Server for z/OS development group. He is the technical team lead of the WebSphere for z/OS runtime development team. His development focus is primarily in the area of WebSphere for z/OS run time, where the WebSphere Application Server code interacts with and exploits the native operating system services available from z/OS. Rob earned a dual BS degree in computer systems engineering and computer science from Rensselaer Polytechnic Institute (RPI) in Troy, NY.

Summary:  This article, Part 2 in a series on building a resilient Service-Oriented Architecture (SOA) infrastructure, focuses on short-term solutions to problems associated with the use of synchronously interconnected SOA components across servers and tiers. The solutions presented here are highlighted because of their ability to mollify the negative impact incurred by these types of problems, thereby increasing the resiliency of the SOA.

View more content in this series

Date:  01 Nov 2007
Level:  Intermediate
Activity:  910 views

Introduction

An SOA links together many disparate services into one cohesive interoperating environment. Some or all of these services might be invoked synchronously during the course of a single transaction, creating a chain of interconnected service components. Each service within the chain is dependent on the downstream services—that is, a service can't proceed until it receives a response from the downstream services that are invoked. As unpredictable conditions arise (such as sudden high latency of the bus connecting the interoperating services), services may become blocked and fail. The failures are propagated up the chain causing other services to fail and possibly decreasing the stability of the overall SOA deployment.

How significantly a nonresponsive service disrupts the entire SOA infrastructure depends on the resiliency of the SOA. Resiliency —defined as the continued availability and performance of a service despite negative changes in its environment—is critical to maintain a healthy SOA. This article explores short-term solutions designed to increase the resiliency of SOAs that are susceptible to forming chains of synchronously interconnected service components.

Short-term solutions

Short-term solutions are solutions that can be readily applied to an existing SOA framework with little or no change to the overall architecture. These solutions involve configuration tuning and other subtle optimizations that require little restructuring of the SOA. At worst, the required restructuring is manageable, and quickly and easily implemented.

The primary goals of the short-term solutions are:

  • Improve the overall stability of the SOA infrastructure by reducing its fragility.
  • Improve tolerance of exceptional conditions in such an infrastructure.
  • Improve performance and reduce millions-of-instructions-per-second (MIPS) usage.
  • Improve manageability through increased serviceability.

This article focuses on two short-term solutions:

  1. Collocation of tightly coupled SOA services
  2. Implementing aggressive timers on SOA service invocations

Short-term solution 1: collocation of tightly coupled SOA services

Collocation essentially means the deployment of SOA services or applications on the same physical system. The term tightly coupled describes the relationship between two or more services that are invoked synchronously, meaning the calling service (the client) is blocked on the service invocation and can't continue until it receives a response from the service it invoked. Several tightly coupled services might participate in a single transaction, creating a chain of synchronously interconnected components within the SOA.

The short-term solution discussed in this section involves co-deploying (where possible) tightly coupled business applications that act as components in a chain of synchronously interconnected components. The focus is on business applications that are deployed to separate servers but share a great deal of synchronous intercommunication (that is, they're tightly coupled). These business applications might benefit greatly from redeployment to the same server, which in turn might translate to benefits for the overall health and stability of the SOA.


Benefits of collocation

One of the obvious reasons for co-deploying tightly coupled business applications is the measurable performance improvement. Co-deployed applications can take advantage of intraprocess communication protocols, which typically facilitate direct communication between the applications. Intraprocess communication avoids all the overhead of a remote service invocation, including serialization, encryption, traversing the network stack, and network latency.

A second reason for co-deployment of tightly coupled business applications is the reduced stress on resources, specifically task- or thread-level resources. A synchronous service invocation over local intraprocess communication protocols typically places much less stress on the local server's resources than an invocation made over remote communication protocols. This is illustrated most clearly with an example, covered in the following section.


Problem scenario: heavy workload versus limited resources

Consider the situation in Figure 1 where business application A deployed in server 1 synchronously invokes service B, which is hosted by business application B and deployed in server 2. The two servers are connected physically by some service bus (typically a network connection).


Figure 1. Application servers hosting different but synchronously dependent applications connected through some communication channel (such as HTTP or RMI)
Application servers           hosting different but synchronously dependant applications connected through           some communication channel (such as HTTP or RMI)

An external client invokes application A in server 1. The work request is dispatched to a managed task within server 1. Because the communication between application A and service B is synchronous, for example, Remote Method Invocation over Internet InterORB Protocol (RMI/IIOP), the managed task in server 1 that's dispatching the work against application A must block until it receives a response from service B. While blocked, the managed task can perform no other work.

This may become a problem if the hosting server (server 1) is under heavy load and has only a limited number of managed task resources available. As more and more of its managed tasks become blocked on synchronous remote service calls, the capacity of the server to dispatch work diminishes, and thus the overall service rate of the server decreases.

If the service rate drops below the arrival rate of new work, the server falls behind and new work is queued up until a managed task becomes available to dispatch it. If all the managed tasks become blocked, the server can't perform any work at all and appears hung. If the condition persists, the queued work and dispatched work begin to time out and fail.


Figure 2. External services can dramatically reduce the service rate of the server by blocking execution threads for extended periods of time
External services can           dramatically reduce the service rate of the server by blocking execution threads           for extended periods of time

As server 1's overall service rate suffers, so does the service rate of all services hosted by server 1. Although the actual problem may be isolated to the faulty or high-latency communication between application A and service B, the effects of the problem permeate the entire server. The managed task resources are a shared resource—all applications hosted by server 1 share the same set of managed tasks. As the managed tasks are consumed and blocked by application A, other applications are unable to dispatch. This might cause an otherwise healthy service to fail, as it can't get enough resources to execute.


Solution: co-deploy tightly coupled applications

Co-deploying the two tightly coupled applications within the same server alleviates the stress on resources. If collocated, application A can invoke service B using local intraprocess protocols, which essentially allow for direct invocation. Remote communication is avoided; therefore, so are the problems associated with remote latency. Managed task resources are not blocked, allowing the tasks to execute to completion more efficiently.

With fewer blocked managed tasks, the server can maintain a higher service rate and is more likely to avoid queue growth. Avoiding queue growth prevents dispatch timeouts, which prevents precipitate failures that affect other components within the SOA. The result is a more stable and more resilient SOA.


Figure 3. The collocation of synchronously dependent services
The collocation of           synchronously dependent services

As illustrated here, faulty communication or poor performance between an application and the service it invokes synchronously can have negative effects on the entire server and, in turn, all of the applications and services hosted by that server. As the services at server 1 are affected, so are all the components within the SOA that use those services. Therefore, the problems developing on server 1 could ripple across the entire SOA, disrupting the SOA on a large scale. As a short-term solution, collocation of tightly coupled business applications and services can mitigate some of the instability that might develop in the SOA environment.


Short-term solution 2: implementing aggressive timers on service invocations

A second short-term solution that can improve the stability of the SOA infrastructure, specifically in the area of tightly coupled services that communicate synchronously, is to apply aggressive timers to govern the service invocations. In situations where the tightly coupled services can't be collocated, aggressive service-invocation timers can be applied to alleviate the negative effects of a nonresponsive service.

An aggressive service-invocation timer is set to the reasonable expected response time of the service, with some consideration for transient delays. In practice, many timers are set far longer than the average response time of the service. For example, a service that typically dispatches and completes requests within one to two seconds might have a dispatch timeout of 300 seconds (five minutes). That's an example of a nonaggressive timer.

The idea is, if the service typically responds within two seconds, but a particular invocation hasn't responded in 30 seconds, then it's safe to assume that the service is suffering some condition that has rendered it nonresponsive. It's unlikely the service will ever respond given extra time; therefore waiting an additional 270 seconds before timing out the request is probably unnecessary. Timers configured in this manner give the service every opportunity to respond. However, no consideration is given to the effect the lengthy delay might have on other components synchronously involved in the transaction, nor to the effect of consumed and blocked shared resources on other services hosted within the local server.


Problem scenario: blocked managed-task resources

The problem with nonaggressive timers is that they take too long to react to exceptional conditions in the environment that might cause a service to become suddenly, temporarily nonresponsive (for example, a broken network connection). Within a shared resource virtualized infrastructure, such as an application server, the managed task that invoked the service is blocked until the nonaggressive invocation timer expires, terminating the service request and shaking loose the managed task that was blocked by the service call.

While blocked, the managed task continues to hold any shared resources it acquired during the course of its dispatch (for example, storage resources, task-level resources, and mutex locks). The shared resources held by the blocked managed task are unavailable to other tasks, which can inhibit the processing of other dispatched work within the server and can possibly lead to a decrease in the overall service rate of the application server.

Furthermore, depending on the nature of the communication failure, other managed tasks within the application server can be experiencing the same nonresponsiveness of the service. Under heavy load and limited task resources, it's possible that all available managed tasks within the server will become blocked on the nonresponsive service. With all managed task resources consumed, the application server can't process more incoming work.


Figure 4. Nonresponsive services can block all available managed tasks within the application server
Nonresponsive services           can block all available managed tasks within the application server

In effect, the application service has become nonresponsive itself—same as the service that its threads are trying to invoke. And just as the nonresponsiveness of the downstream service caused the application server to become nonresponsive, the nonresponsiveness of the application server can have similar adverse effects on other upstream components in the chain of synchronously interconnected components.

The application server remains nonresponsive until an invocation timer expires, terminates the service invocation, and resumes a blocked managed task, allowing the managed task to complete its dispatch and accept new work. The less aggressively the invocation timer is set, the longer the application server remains nonresponsive. The longer the application server remains nonresponsive, the more likely other components within the SOA that invoke services against the application server will be similarly affected. This could lead to a disruption and destabilization of the entire SOA.


Figure 5. Nonresponsive services can cause issues to upstream callers, specifically by blocking their worker threads and decreasing the service rate of their corresponding servers
Nonresponsive services           can cause issues to upstream callers, specifically by blocking their worker           threads and decreasing the service rate of their corresponding servers

As illustrated in Figure 5, the negative effects of a single nonresponsive service combined with nonaggressive timers could cascade throughout the SOA, having extremely harmful consequences on the SOA's stability and availability.


Solution: aggressive timers

The solution to this problem is to implement aggressive invocation timers on synchronously invoked services. Appropriately aggressive timers alleviate the stress on shared resources within the application server—specifically managed task resources—as managed tasks are not blocked for an unreasonably lengthy period of time on a service that will likely never respond.

The aggressive timer aborts the service invocation as soon as it appears to be nonresponsive, allowing the managed task to complete its dispatch and accept new work. With more managed tasks available to process work, the application server maintains a higher service rate, thereby relieving the stress on the entire chain of synchronously interconnected components within the SOA.


Figure 6. Aggressive timeout values can alleviate the impacts of nonresponsive services by releasing consumed resources sooner
Aggressive timeout           values can alleviate the impacts of nonresponsive services by releasing           consumed resources sooner

The queuing theory models in Figures 7 and 8 illustrate quantitatively how aggressive timers increase the service rate (or throughput) of the application server. The model assumes an average service response time of one second, with a 99% probability that the service will complete successfully. There is a 1% probability of the service being nonresponsive. In the nonresponsive cases, the managed task executing the service doesn't respond until the service timer expires. Between the two figures, the service timeout value is varied from 60 seconds in Figure 7 (nonaggressive) to 20 seconds in Figure 8 (aggressive).


Figure 7. Simple quantitative model where the timeout value is set to 60 seconds and the effective throughput is 63 requests per second
Simple quantitative           model where the timeout value is set to 60 seconds and the effective throughput           is 63 requests per second

With a 60-second timeout value, the model computes an overall service rate of 1.59 seconds, which translates to a throughput of 63 requests per second in an application server containing 100 worker threads. Figure 8 shows how implementing a more aggressive service timer improves the service rate and throughput of the server.


Figure 8. Simple quantitative model where the timeout value is changed to 20 seconds and the effective throughput is increased to 84 requests per second
Simple quantitative           model where the timeout value is changed to 20 seconds and the effective           throughput is increased to 84 requests per second

With the more aggressive 20-second timeout value, the model computes an overall service rate of 1.19 seconds, which translates to a throughput of 84 requests per second. The model's calculations demonstrate how the aggressive timer in Figure 8 results in a significant improvement of the overall service rate and throughput of the application server.

It's important to note that the implementation of aggressive timers doesn't correct or prevent the transaction failure. In fact, one consequence of the aggressive timer is that it might abort a service request and fail a transaction that otherwise would have completed given more time. But the more important consideration here is the overall health of the application server and, more broadly, the health of the SOA. As described above, aggressively set timers can help mitigate the negative effects that cascade across the SOA when a service becomes nonresponsive. Ultimately, well-implemented aggressive timers result in a more resilient SOA.


Conclusion

The purpose of this article was to introduce you to short-term, immediately applicable solutions for specific performance and availability problems that may arise with the use of tightly coupled, synchronously interconnected components within the SOA. The article illustrated how a single nonresponsive service can have cascading effects throughout the SOA, devastating the overall health of the SOA and possibly leading to destabilization and a disruption of service.

The solutions described here—collocation of tightly coupled business applications and the application of aggressive timers—are presented in the context of being short-term for their ability to be applied readily to an existing SOA with very little restructuring or redesigning of the SOA framework. Upcoming articles in this series present long-term solutions that are more comprehensive and require more planning, design, or restructuring effort.

These short-term solutions are vital for their immediate stabilizing effect on an SOA that might already be suffering from the problems described in this article. The solutions also apply to SOA infrastructures that are susceptible to forming such chains of synchronously interconnected components. The ability of these solutions to stabilize the SOA and avoid the disruption of SOA services leads to the increased resiliency of the SOA.


Resources

Learn

Get products and technologies

  • Innovate your next development project with IBM trial software, available for download or on DVD.

Discuss

About the authors

Snehal Antani works for the SOA Technology Practice within IBM Software Services for WebSphere (ISSW) and is the technical lead for IBM WebSphere Extended Deployment. He comes from a development background, working on several products, including IBM WebSphere Application Server for z/OS, IBM WebSphere Extended Deployment-Distributed, and IBM WebSphere Extended Deployment for z/OS, and has helped bring to production some of IBM's largest WebSphere Distributed and z/OS customers around the world. He has disclosed several patents and technical publications in the domains of enterprise application infrastructure and grid computing. He earned a BS in computer science from Purdue University and will complete his MS in computer science from Rensselaer Polytechnic Institute (RPI) in Troy, NY with a thesis in the area of quantifying and improving the resiliency of middleware infrastructures.

Rob Alderman works for the IBM WebSphere Application Server for z/OS development group. He is the technical team lead of the WebSphere for z/OS runtime development team. His development focus is primarily in the area of WebSphere for z/OS run time, where the WebSphere Application Server code interacts with and exploits the native operating system services available from z/OS. Rob earned a dual BS degree in computer systems engineering and computer science from Rensselaer Polytechnic Institute (RPI) in Troy, NY.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=SOA and Web services
ArticleID=266241
ArticleTitle=Build a resilient SOA infrastructure, Part 2: Short-term solutions for issues involving tightly coupled SOA components
publish-date=11012007
author1-email=antani@us.ibm.com
author1-email-cc=flanders@us.ibm.com
author2-email=ralder@us.ibm.com
author2-email-cc=flanders@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers