Technical Blog Post
Let's look at Dispatch Timeout Handling in WebSphere Application Server for z/OS
If you run WebSphere Application Server on z/OS you are aware of the many 'timer' settings that can affect the workload that is running in the Server.
This Blog entry will focus on the topic of dispatch timeout handling, and the tradeoffs between settings that control the behavior of the environment when dispatch timeouts occur.
Let's first look at the Dispatch Process Overview in the WebSphere Application Server on z/OS.
1. Request Received by Control Region (CR)
The HTTP request is received by the CR. The CR works with WLM to classify the work and a WLM enclave is created for the request.
2. Request Placed on WLM Queue
The CR places the request on the WLM work queue in preparation for dispatch into the Servant Region. A dispatch timer is started for the request.
The dispatch timer is central to this discussion. The issue is what happens if the work does not complete within the timeout value for the dispatch.
Different options exist which are discussed in the Dispatch Timeout Processing Overview section later in this Blog entry.
3. Request Dispatched to Thread in Servant Region (SR)
When a thread in the servant region is available to take work, WLM will dispatch the request from the work queue to the worker thread.
If no threads are available, the request remains in the WLM queue. A hung thread is not eligible for work dispatch. That is why having hung threads build up in a servant region is a problem: eventually no eligible threads remain and WLM can no longer dispatch work to the servant.
The timeout setting of SERVANT addresses this by abending the servant and creating a new JVM.
The timeout setting of SESSION has no way of addressing the issue of thread exhaustion in the servant region.
4. Request Processing
The work begins execution. How long the work takes to complete is a function of the application design. Some requests are very short-lived; others take longer because they perform more complex processing.
The goal is to have all work complete within the defined dispatch timer value.
However, some work fails to complete within the dispatch timer value. There are many different reasons why a request may not complete in time.
Some delayed requests may, given time, complete. Other delayed requests may, depending on why they are delayed, never complete.
Dispatch Timeout Processing Overview
** At a high-level, processing is:
- Container attempts to interrupt and 'shake loose' hung threads. If successful, then thread frees up; otherwise, thread continues as hung.
- Threshold defines percent of total threads that may be marked hung before EC3 occurs. This is a delaying action; it allows temporarily hung threads to complete if possible.
- If threshold percent exceeded then servant region is recycled with EC3 abend.
Understanding the Nature of Hung Threads
As noted earlier, some threads may be marked as 'hung' but eventually complete. Others are marked 'hung' and never complete:
The distinction is important:
- delayed threads will eventually complete and be available for other processing
- threads that are truly hung will not be available for other work
When considering thread timeout behavior and the settings that are appropriate, there are three aspects of the runtime thread environment to keep in mind:
- The nature of the timer expiration event - due to delay or due to true hung condition
- The frequency of the thread delay or hang condition
- The underlying cause of the thread delay or hang condition
With respect to #1 - neither a delayed thread or a hung thread is desirable. But of the two, the delayed thread is somewhat easier to manage to, depending on the duration of the delay and the frequency of occurrence (#2).
With respect to #2 - a timer expiration event that occurs rarely implies a different response than timer expiration events that occur frequently or, worse still, for every request. The former may be due to a rare combination of factors; the latter suggests a more systemic, structural problem.
With respect to #3 - depending on the frequency of timeout events and business impact of those timeouts, an investigation of the underlying cause will be called for. Timeouts may occur for a variety of reasons: insufficient system resources; network delays; DB2 tuning issues; or perhaps poor application design.
- If the nature and frequency of timer expiration events is high, neither SERVANT or SESSION will help. SERVANT can be configured with a threshold percent, but that is at best a delaying action; unless the threads clear faster than the timeouts occur, the threshold will be reached and an EC3 abend of the servant will take place. A setting of SESSION will let the hung threads stack up, eventually resulting in the servant being unable to process further work.
- If the frequency is low, then either SERVANT or SESSION will work, depending on the desired behavior. See previous chart in the Dispatch Timeout Processing Overview section.
- Unless the frequency is very low, some investigation of the underlying cause is called for. The server runtime's ability to manage poor thread completion behavior is limited.
Dispatch Timeout vs. Transaction Timeout
There is a difference between the timer maintained for dispatched requests and the timer maintained for transactions created by the application.
The terms request and transaction are often used interchangeably, but they are not the same thing. They may be related, but they are not synonymous.
The important point to keep in mind is that transaction has a specific meaning within the context of a timeout value discussion. Request and transaction are two separate things, and they have separate timers associated with them.
A value of SESSION for the dispatch timeout will not prevent the abend of the servant region if the transaction timeout value is exceeded.
A big THANK YOU goes out to Don Bagwell, for working with the entire WebSphere Application Server z/OS Team, Level 2 and Development, in getting this information into consumable documentation for our customers and Support teams.
There are several White Papers that go into more detail on WebSphere Application Server for z/OS Timeout settings:
- WebSphere Application Server for z/OS V7 - Dispatch Timeout Improvements
- WebSphere Application Server z/OS V8 Hidden Gems
Section - Thread Hang Recovery Diagnostic Improvements
- WAS z/OS V8 - Granular Control Functions
6 Sections dealing with different timeout settings / actions