Failover is a crucial component of any application server topology. When a machine fails due to a software failure, network failure, or power failure, it is expected that a second machine handles the workload indefinitely until the disabled machine is back online. The WebSphere HTTP Plugin is a crucial component of any failover scenario. Ideally, when a machine or application server is unavailable, the Plugin detects it, marks it as unavailable, and routes requests to other available clones. It also detects when an unavailable clone comes online, and begins distributing requests to it. However, there are situations where it is difficult for the Plugin to immediately realize the unavailability of a clone, and specific tasks that an administrator can do to prevent downtime from an end user's perspective.
By tuning the WebSphere HTTP Plugin configuration parameters to fit the particular environment, end users experience fewer delays and the failover performance of the WebSphere environment improves.
This best practice applies to the following product, version, and plaform:
- WebSphere Application Server Advanced, versions 4.0.x and 5.0.x (AIX, Windows NT, Linux, Solaris)
In a large environment with 12 application servers, 4 Web servers, and thousands of concurrent users (simulated, for instance, by a poorly coded servlet by running a synchronized sleep method for an indefinite amount of time), eventually one of the application server's thread pools reaches the maximum limit. At this point, TCP connections begin to build up on the machine, and the application server appears to hang. As soon as this happens, the entire site experiences a major loss of throughput. Eventually, the throughput of the entire system slowly decreases until it is completely unavailable. At this point, end users cannot get a response and begin experiencing errors.
The problem occurs because the Web server processes continually attempt to
route user requests to the hanging application server, and does not recognize it
as being unavailable. The WebSphere Plugin, which runs within each HTTP Server
process, has a parameter called RetryInterval. Any time a request is sent to an
application server that is unavailable, the Plugin marks the application server
as unavailable, and the RetryInterval timer begins. When that timer expires,
the Plugin assumes that the application server is now available, and once
again sends it requests. The default RetryInterval time is 60 seconds. In a high
volume site, there are hundreds of Web server processes with each one going
through this process individually. The system enters a vicious
cycle before all of the Web servers can mark an application server as
unavailable. The processes that had
marked the application server as unavailable
will retry it because the RetryInterval has expired.
Note: Any time an application server is "blocked" or "hung", the system experiences a brief period of lower throughput and some additional user errors. This is because the Web servers must go through this process of marking it as unavailable. How fast this happens depends on a variety of factors, but it is most directly affected by the parameters discussed below.
WebSphere Plugin configuration parameters
The default configuration parameters are intended to fit a large percentage of customer architectures. However, in high volume scenarios, you need to tune appropriately to provide maximum failover performance. To change the parameters, you need to search for the attributes in the InfoCenter because it depends on which version of WebSphere you are using. The default values for the configuration parameters involved in failover are as follows:
| MaxConnectionBacklog | 511 |
| ConnectTimeout | 0 |
| RetryInterval | 60 |
Recommended configuration parameters
The suggested settings below are a starting point, and may vary to fit a particular environment.
This is the maximum number of outstanding requests that the operating system buffers while it waits for the application server to accept connections. If the Plugin attempts to connect to the application server when this buffer is full, the connection is rejected and the Plugin marks the application server as unavailable. By default, this is 511. Generally, this is acceptable because the application server is not expected to become blocked or to hang. However, these cases do occur, and it is preferable to keep this number much lower. By setting it to 128, the Plugin becomes aware that the application server is unavailable much sooner, as the 129th request is rejected. If left at the default of 511, it takes 512 client requests (or connections) before the Plugin understands that an application server is unavailable.
When the WebSphere Plugin attempts to send a request to the application server, it assumes that application server is available unless the TCP/IP connection is rejected. By default, this setting is 0, which means the Plugin relies on the operating system (OS) TCP timeout values before a request is rejected. Generally, this is at least several minutes depending on the OS. By setting ConnectTimeout, the Plugin attempts non-blocking TCP connections that will timeout regardless of the OS settings. This setting helps an application server that has been removed from the network by a power failure or network failure. If ConnectTimeout is set, the Plugin waits to mark a clone unavailable, and begins routing requests to other application servers.
RetryInterval is the length of time that the Plugin waits for after marking a clone as unavailable, before sending it an additional request. This setting has a major effect on the failover performance of a high volume environment. If set improperly, it causes the cycle discussed in the example above. Each time an application server is unreachable, it is marked as unavailable for 600 seconds. You can increase this setting without any major effect on the environment. It is important to restart a hung application server before this timer expires. If it takes longer than 600 seconds to do this, then increase this value. However, increasing this value takes longer for the application server to be in service again. For a period of time, the environment runs with N-1 application servers. This is a minor effect compared to the throughput decrease caused by continually retrying the application server before it is restarted.
The final setting for RetryInterval depends on the operating procedures of the particular environment. If restarts are automated using a script or another method, you can specify RetryInterval as low as 600. However, there is a possible risk of having a script accidentally restart an application server. Consider paging an operator to manually restart the application server. If this is the case, increase the RetryInterval value to allow an operator time to assess the situation and act.
WebSphere does not provide an automatic mechanism to discover a hanging application server and to automatically restart it. Therefore, it is often advisable to put a script in place that can monitor the number of connections between an application server and a Web server. If the number of connections reaches a set threshold, the script can take action and either notify an administrator or perform a restart on the application server. An example of such a script will be discussed in a follow-up best practice.
- WebSphere Application Server InfoCenter
- Failover and Recovery in WebSphere Application Server
- IBM WebSphere V4.0 Advanced Edition Handbook
- IBM WebSphere Application Server V5.0 System Management and Configuration: WebSphere Handbook Series





