Planning and handling timeouts in service-oriented environments
Managing exceptional situations is important in any software solution. A system could be unavailable because of a hardware or network problem, vendor software could contain a bug, or an application could fail because of a user error. Serious design and development effort must be spent on ensuring that any solution handles situations like these in a graceful manner. The same is true when developing a service-oriented solution -- and the fact that SOA solutions aim to be more loosely coupled than other software solutions increases the need for proper non-functional design, development, and testing.
One often overlooked aspect of non-functional solution design is timeouts. There are many points in a solution design where timeouts can and should be configured. Timeouts can be configured at different levels, set dynamically through code, changed administratively, and set by clients outside of your domain. All these different manners of defining timeouts must be viewed and handled in concert to achieve the desired end-to-end behavior. This article puts the focus on timeouts with a comprehensive discussion of these elements, and also looks at how you can manage timeouts with some of the products in the IBM® WebSphere® portfolio.
What is a timeout?
There are probably several formal and informal definitions, but for the purpose of this article, let’s define it as the maximum length of time for which a piece of software logic will continue to wait for a certain event to occur. This software logic can be written in a higher level programming language, or it can be executed as assembler logic on a computer chip. In all cases, however, a timeout will cause this logic to wait for something to happen, and if that something does not happen in the pre-defined amount of time, it will stop the ”success” execution flow and raise this non-occurrence as an exception of some sort. If the event does occur before the defined time has expired, then normal execution continues.
In order to reasonably limit the scope of the timeout types discussed here, the remainder of this article will focus on timeouts that apply to service-oriented environments that use Java™-based application servers and, in particular, IBM WebSphere products. How and where these timeouts are specified differ in many ways:
- Programmatically, usually connecting to protocols or in thread management.
- Administratively, through an application server administrative console or script commonly for transactions and user sessions.
- By the client making a request.
A programmatic example would be if you defined a timeout value when receiving a message using the Java Messaging Service (JMS) in the receive() method on the javax.jms.MessageConsumer class. For another example, this time using a Java Enterprise Edition (Java EE) application server (such as IBM WebSphere Application Server), you can use the “work manager” support to start so-called asynchronous beans on a new thread and then wait for this thread to complete, defining a maximum amount of time that the current thread will wait for this completion.
An administrative type of timeout is not set programmatically, but rather can be configured (for example) in an application server. A typical timeout used with WebSphere Application Server is the default transaction timeout, which defines how long a transaction can run before it is automatically rolled back. When you are programming the client, another configurable timeout is the default HTTP timeout, which is used to indicate how long an outbound HTTP request will wait for a result message to come back before the client signals an error.
Another type of timeout, also configurable, is used in the context of an application that executes across multiple servers and clients. This timeout can be defined to indicate the maximum time an inactive user (for example, using a browser to access a secured Web site) will be considered logged in before being forced to re-authenticate. This session timeout also defines how long session state is cached on the server on behalf of a user (WebSphere Application Server sets this to 30 minutes by default). Or, the client inactivity timeout in the application servers EJB container can be used to define how long a transaction will stay open between requests from a client before being rolled back (the default is 30 seconds, by the way).
Beyond using a standard Java EE application server, many enterprises utilize higher level products like IBM WebSphere Process Server or IBM WebSphere ESB. This introduces yet even more ways of defining timeouts as part of their higher level function. For example, the BPEL language, which is supported by WebSphere Process Server, lets you specify a timeout on a receive activity; that is, when the process is waiting for an external message to arrive. In WebSphere ESB, you can define an async timeout on a callout primitive, which defines how long a mediation will wait for a response that is received asynchronously.
Some of the timeouts mentioned above affect only the currently executing thread, whereas others (for example, the default transaction timeout) apply to any thread running on the application server, and still others affect only a specific application. You can, therefore, define three different groups of timeouts:
Timeouts in all three groups can be set programmatically or by configuration, even though programmatic access is typically thread-scoped.
A fourth group of timeouts are those set by the client of a request. For example, a Web browser will only keep an HTTP connection with a server alive for a certain, pre-defined time. The default for this is usually 300 seconds, or 5 minutes. Most or all browsers offer ways of changing this default. There is little you can do on the server to influence this, other than somehow transfer (dummy) data while working on a request to prevent the browser from resetting a connection.
The examples above merely scratch the surface on timeouts, which shows how ubiquitous timeouts are in any solution. Most of the configurable timeouts have default values that are set to cover the most common use cases, and so you might never have to touch them. But as you will see, many of them interact with each other, and so it is certainly worth your time to evaluate whether the default values fit your needs before putting a solution into production.
In order for you to properly design timeout values, it is important that each service documents its required timeout behavior; in other words, the time in which the service can be expected to complete a request under normal conditions. Typically, this kind of information is prepared for a service level agreement (SLA). Every SLA must take into consideration the service levels supported by downstream services and components it invokes.
The impact of intermediaries
So far in this article, a timeout has been approached as something that applies to an interaction between two partners: a consumer receives a message from a queue, or a client user has a session in a Web server, or a service consumer invokes a service provider and waits for a response. However, systems that are built following principles of service orientation, such as loose coupling and separation of concern, will often introduce intermediaries. In other words, messages flow through multiple hops before reaching their final destination. Each interaction between two individual hops can have its own set of timeouts, all of which in concert define the overall behavior of a service invocation.
Take a look at the example shown in Figure 1
Figure 1. Sample scenario with intermediaries
Assume that a user wants to place an order over the Web. The manufacturer offers an online interface into its service-oriented ordering system. The architecture consists of:
- An ESB gateway that handles security policies
- An ESB for message protocol transformation
- A utility service for auditing and logging of incoming orders
- An Order business service
- A legacy back-end system, running on a mainframe, where the actual orders are managed and stored.
Between each one of these components you have the potential for timeouts to occur. Moreover, processing that occurs within each component could also time out. At the same time, what matters to the user is that his or her order gets processed in a timely manner, and if it cannot be processed as such, then no unwanted side effects happen; for example, that there is no charge for an order that could not be fulfilled.
This latter point is most important: avoiding side effects. In other words, how do you establish transactional behavior without having an actual, distributed transaction (using distributed transactions across service invocations is unusual in SOA). In a transaction, a set of steps either all happen, or none of them happen.
Suppose a new order -- assuming it translates into a synchronous request-response invocation of the Order service -- is executed on the back-end, but the session with the Web server times out before a confirmation can be returned to the customer. Is it acceptable if the customer receives an error message, even though the order has actually been carried out in the back-end system? While the answer depends on the actual scenario and the associated requirements, there are some timeout guidelines that generally apply.
Most importantly, the timeout values should decrease "downstream." This means that each client of an invocation should not time out before the timeout of the invoked component is reached. What this rule of thumb tries to accomplish is to give a downstream component the opportunity to time out on a request before a component further up the call stack does. Applying this to the sample scenario presented earlier might lead to timeouts for the remote invocation to be set as shown in Figure 2. (The values shown are examples only. The actual gap between the individual timeout values will differ depending on the expected latency in the network, and other factors.)
Figure 2. Applying timeout values to the sample scenario
You can translate this approach into a plain formula for timeout values in a multi-hop environment:
tn = tn+1 + Δn
where Δn equals the expected time spent within a component plus the overhead that is incurred between two intermediaries (such as, network latency, serialization, and so on). Above, notice that the timeout was increased by 1 second at each step along the way, which is not very realistic. As the example shows, this means that the timeout offered by the last component in the chain (in this case, the legacy back-end) determines the timeout that must be configured for all previous components.
Moreover, you have to distinguish between synchronous and asynchronous communication between the individual components. Up to now, we have assumed a synchronous interaction, where each client of an invocation expects a response message and will wait until this response arrives or times out. In cases where you send a "one way" message downstream, the situation is different in that each client will not wait for the processing of a message to complete, but simply "fire and forget" the message. Timeouts are less of a concern in those cases, because there is no dependency in an upstream component on the timeliness of processing requests downstream.
For example, the invocation of the Utility service shown in Figure 2 might happen asynchronously; that is, in parallel with the invocation of the business service. In that case, the timeout for the invocation of the business service can be set regardless of the behavior of the utility service. Whether the invocation happens synchronously or asynchronously depends on the nature of the utility service.
On an application level, however, timeouts come back into the picture when there is an expected response to a given request, but this response is received asynchronously. In other words, a client’s thread of execution might not be blocked immediately after a request has been sent. Instead, it might explicitly poll for the response at a later time (or in a parallel thread), and this reception of an expected response might still cause the overall interaction to time out if the response does not appear within a certain timeframe. In these cases, the timeout values that are used on the response stream must be set in correlation with when the request was sent.
Dynamic timeout handling
The discussion in the previous section appears to assume a statically defined environment. In other words, Figure 2 assumes that the timeout between individual components is always as defined, regardless of which service is invoked, and regardless of the content of each message.
In real life situations, however, this is rarely ever good enough. Assume, for example, that an ESB communicates with more than one back-end service provider (which, of course, is virtually always the case), and that different providers have different timeouts. As mentioned earlier, configured timeout values are determined by the last component in the chain. This means that the timeout that is used in a previous component will have to change when a new component with different timeout characteristics is added. This will become obvious when new services are added to the example (Figure 3).
Figure 3. Adding multiple service providers
Here, the addition of a service provider that requires a timeout value of 30 seconds forces the Web server, ESB gateway, and so on, to increase timeout values respectively.
Rather than increasing timeouts statically whenever a new component is added that requires larger timeout values, the environment should be capable of adjusting dynamically to these new requirements. Timeout values should ideally be scoped to the service that is being invoked, which means that the intermediary applies a timeout value dynamically at run time.
Figure 4. Dynamic timeout values
Figure 4 assumes that each message flowing through the system must be subjected to a different set of timeouts, depending on which service becomes the one on the end.
Figure 4 presents a somewhat ideal solution, because some of the components might not support dynamically setting timeouts. One example for this is the TCP/IP connection timeout between the browser and the Web server, which is always static (it is usually set to 5 minutes).
There are cases when scoping a timeout value to a particular service is not sufficient, and cases when the timeout depends on the content of a message. An example of this is when a service provider accesses multiple back-end systems with different timeout requirements, but only determines at run time which back-end is invoked. This requirement is very difficult to implement in practice, but let’s go back to the example scenario to evaluate this requirement (Figure 5).
Figure 5. Adding multiple backends
Assume, for example, that the consumer sends a message to the service provider. The service provider parses the request message and determines that it has to invoke a back-end function that requires a 5 second timeout. Prior components, like the ESB, will not know which function will be invoked, and thus cannot dynamically adjust the used timeout retroactively when passing the message to the business service provider. The only way out of this dilemma is to use a timeout that is bigger than the maximum timeout required by any of the downstream components that might be invoked. You can slightly change the formula introduced earlier to account for this:
tn = t(n +1)(max) + Δn
Staying with the example above, however, the business service provider still should dynamically adjust the timeout value that is used for its downstream invocation, and sometimes that introduces a challenge, depending on the protocol and API that is used.
Ideally, the timeout values to be used for specific services (or even specific messages) should be defined externally, so that they can be changed without having to redeploy anything. For example, you can use the IBM WebSphere Service Registry and Repository to store this information, retrieve it at run time, and then the timeout value can be set using the programmatic techniques described earlier.
Timeouts and exception handling
When a timeout occurs during the processing of a service request or response message, it is considered an exception and some measure must be taken to deal with the exception. The most common response is to send an error message back to the consumer indicating that a timeout occurred. If the invocation was part of a (distributed) transaction, using the XA protocol or the WS-AtomicTransaction standard for Web services, the transaction can be rolled back and all changes that were made during processing are automatically reverted. However, most SOA systems are loosely coupled and do not start distributed transactions. This means that whenever a timeout occurs on the client side of an invocation, the server side might still complete part or all of the underlying logic successfully, possibly creating changes to the system that you must compensate for since the client assumes something went wrong.
While you try to avoid these cases by decreasing timeout values downstream, as described earlier, you cannot rule out that a timeout might occur and some function will still complete, albeit slower than expected. Depending on the functionality of the invoked service, the consumer of the service might have to explicitly call an "undo" function to make sure that all possible changes and permanent side effects of the invocation are compensated. The WS-BusinessActivity standard describes a formal protocol for such compensations. Some compensation might also be possible in one of the intermediaries; for example, in the ESB.
Another way to deal with a timeout is to simply try to invoke the service again. Some protocols and products support built-in retry, which is configurable. This can conveniently cover cases where temporary outages are short lived and retries often succeed. However, this has an impact on the timeout values that are used. For example, if the timeout between the ESB and the business service provider is set to 10 seconds, and a retry count of 3 is configured with a delay of 1 second for each retry, the total maximum timeout value that must be considered for this connection becomes 32 seconds. You can use this formula to calculate your new timeout:
tmax = (t +d) * n
where d is the configured delay and n is the number of retries. Figure 6 shows an example of how you configure this for a callout node in WebSphere ESB.
Figure 6. Configuring retry for a callout node in WebSphere ESB
In WebSphere ESB, the default setting for this is 0, meaning there are no retries.
In some cases, timeouts must be considered as a somewhat common, almost normal occurrence. Assume, for example, that you have a business service that normally responds within a certain timeframe, but for some requests that occur regularly but rarely, it can take longer. Your solution design must take these expected timeouts into consideration and react appropriately, either by compensating or by communicating to the consumer that a request that had been reported back as having timed out earlier, has indeed completed successfully.
These considerations are just part of an overall exception handling and error strategy that must describe how to handle all kinds of exceptional situations, not just timeouts.
This article presented timeout values as an important aspect of planning and designing a service-oriented solution. The loosely coupled nature of such a solution makes it necessary to coordinate timeout values that are applied by all parts of the solution so that unnecessary errors and repeated cleanup activities can be avoided.
Moreover, the inherent flexibility of service-oriented solutions introduces a high level of dynamicity, which is also reflected in how timeouts are handled. Timeouts should be set dynamically across the entire solution wherever possible, sometimes even based on individual message content.
Special thanks to Rachel Reinitz, Distinguished Engineer, IBM Software Services for WebSphere, for her contributions to this article.
- Information Center: Timeout values: guidelines for altering timeout values
- Information Center: WebSphere Process Server V6.2