Question & Answer
There are competing factors that must be considered to determine the appropriate WebSphere web server plug-in settings for your environment. There is no single configuration that is correct for all environments. Consider the following facts to determine the best values for your environment.
- Connections consume resources on both the machine where the plug-in is installed and the machine where the application server is installed.
- Connections that do not terminate do not free these resources and can exhaust the resource pool.
- Timeout values set too low cause the client to experience unnecessary failures and increase the traffic load from the retries.
- Heavy traffic and intermediate issues can cause unexpected delays in responses.
- When an application response is not received within the ServerIOTimeout specification, you can decide whether to continue to send requests to that server. If you continue to send requests to that server, you risk timeouts and failures if there is a problem on the machine. If you decide to halt requests to that server, you decrease your sites capacity.
- Requests that contain a message body and typically can change the application state, such as a POST request, must not be retried unless the application is designed to accept multiple instances of the same request.
- Requests, typically GET and HEAD requests, that do not contain a message body are automatically retried by the plug-in when failures occur and this functionality cannot be disabled. If you configure the plug-in to stop sending requests to a server that does not respond within a certain time period, you must adjust other settings to ensure a single request never disables your entire site.
The PostBufferSize property controls whether a request containing data, such as a POST request, is resent to another server when failures and timeouts occur. By default, this property is set to 64K, which is the maximum content that can be buffered. If a request larger than this setting arrives, it is not retried. Increasing the PostBufferSize property consumes more memory. As an alternative, you can disable the retry functionality of requests with content by setting the PostBufferSize property equal to 0, or you can set an unlimited buffer size by setting the PostBufferSize property to -1.
Requests that do not contain a body, such as GET requests, are automatically retried to another server whenever a failure or timeout occurs. This behavior cannot be disabled.
If a non-affinity request is retried, the plug-in attempts to handle the request on each available server until a good response is received or all servers are attempted. If an affinity request is retried and there is a positive ServerIOTimeout, affinity requests are retried to the affinity appserver only. If there is a negative ServerIOTimeout, then the affinity server is marked down and retries are made to the remaining available servers.
The ServerIOTimeout property is designed to allow requests to expire instead of waiting indefinitely for a response from the server. In Versions 6.1 and earlier, the ServerIOTimeout property defaults to 0, which indicates that there is no time limit for requests. This value is not recommended because if the server never responds, the resources involved in this request can never be freed, and eventually the resource pool is exhausted. In some cases, the operating system or adapter resources could break the connection because of inactivity but the behavior cannot be relied upon. If you plan to set the ServerIOTimeout property to 0, you need to determine the appropriate behavior for your environment.
If the ServerIOTimeout property is set to 0, do not expect requests to be retried. A request is not retried until the previous connection is broken. If you need to retry failed requests, either set ServerIOTimeout to a nonzero value, use the operating system settings, or use some other non-plug-in mechanism to ensure a connection is broken at some predetermined time.
With Version 7.0, the default value of the ServerIOTimeout property is 60 seconds. This value is not the ideal setting for applications running intensive queries or other nontrivial functions. If you expect requests to take longer than 60 seconds to be served, the ServerIOTimeout value needs to be adjusted.
The ServerIOTimeout setting is based on how long the server takes to handle a request; not on a particular URI or application. When a value is specified for the ServerIOTimeout property, you must allow for the slowest, longest request time, and then add a little more time to handle peak operation situations.
A server is NOT marked as down if the ServerIOTimeout property is set to a value >= 0 when a request fails because the response time exceeded the value specified for the ServerIOTimeout property. Other requests continue to be sent to this server. If affinity is defined and server selection is based on session data, the same server is selected if the request is retried because the server is still be available. The actual number of times that the plug-in sends the request to the same server depends on the number of servers defined in the cluster. If the server is not healthy, sending requests back to the same server is not likely to result in a good response and exasperates a potential performance problem.
Starting with version 18.104.22.168, 22.214.171.124, and 126.96.36.199, if you do NOT want the same server selected for failed affinity requests, specify a negative value for the ServerIOTimeout property. When the value for the ServerIOTimeout property is negative, the plug-in marks down the server when a response timeout occurs. When the plug-in marks down the server, requests are not be sent to that server until the interval specified for the plug-in RetryInterval property expires. If there is only one server in the cluster, it is never marked down regardless of the plug-in properties.
If you set the ServerIOTimeout timeout value too low and the property is a negative value, there are bad consequences. If you have a long running request that exceeds the ServerIOTimeout value and the request is retried to other servers, the plug-in marks down each server as the request fails. Therefore, if you specify a negative value for the ServerIOTimeout property, you must ensure that the value specified for the RetryInterval property is within the range such that:
- The lowest value for the range is 1. The server is only guaranteed a second to try to recover. The minimum value must be a value that is reasonable for the server to recover.
- The highest value for the range is 1 less than the result of multiplying the absolute value of the setting for the ServerIOTimeout property by one less than the number of servers in the cluster. ((absolute value of the ServerIOTimeout * (number of servers in cluster -1)) - 1
For example, if you set the ServerIOTimeout property to -5, and you have 3 servers in the cluster, the value specified for the RetryInterval property must be in the range 1 - 9. Specifying such a value guarantees that all of the servers are never marked down because of an unexpected intensive request.
If your applications are designed such that a single request can have significant impact, such as an extensive query with data locking, use the ServerIOTimeoutRetry property to a prevent retries. By default, this value is set to -1 when it is added to the plugin-cfg.xml configuration, that means to retry up to the number of members in the cluster; if this value is set to 0; no retries occur after data is sent to the server. If the property is not in the plugin-cfg.xml configuration, the plug-in module operates as it would if the value was set to zero, that is, requests that fail due to ServerIOTimeout expiration are not retried.
It is recommended to set the ServerIOTimeoutRetry value such that when combined with the ServerIOTimeout value, the product is the maximum time a client is expected to wait for a response. For example, assume users accept up to a 5 minute response time and the ServerIOTimeout value is 60 seconds, set the ServerIOTimeoutRetry value to 5 or less (assuming there are at least 5 members in the cluster).
The ServerIOTimeoutRetry property was introduced with apar PM70559.
The following example illustrates a non-recommended configuration and the potential problem that could arise:
There are 3 servers in the cluster.
The RetryInterval property is set to the default value of 60 seconds.
The ServerIOTimeout property is set to -5 seconds.
A request is made that does not get a response until after 10 seconds elapse.
Assume t is the time the original request is received.
The plug-in sends the request to server 1 and waits 5 seconds for a response. No response is received so server1 is marked down at the time the request was received plus 5 seconds (t + 5).
The request is now sent to server 2. It fails to receive a response within 5 seconds so the plug-in marks server 2 down at the time the original request was received plus 10 seconds (t + 10).
The request is now sent to server 3. It fails to get a response so the server is marked down at the time the original request was received plus 15 seconds (t + 15).
Because all of the servers are now marked down, all requests fail until server 1 is retried. Server 1 is retried at the time the original request was received plus 5 seconds (ServerIOTimeout) plus 60 seconds, which are the RetryInterval value (t + 5 +60). Server 2 is marked as up 5 seconds (t + 10 + 60) later and server 3 is marked as up 10 seconds (t + 15 + 60) after server 1 is marked as up. There are 50 seconds where all the servers are marked as down and all requests would fail((t + 5+ 60) - (t + 15)) -- server 1 time marked up minus server 3's time marked down.
The following example illustrates the recommended configuration for the preceding scenario:
There are 3 servers in the cluster.
Set the RetryInterval property within the range of 1 to 9.
( (number of servers -1 ) x (absolute value of ServerIOTimeout) ) - 1 = (( 3 - 1) * 5) - 1 = 10 - 1 = 9
The ServerIOTimeout property is set to -5 seconds. NOTE: This value is used as example only. It is not meant to imply that you specify a timeout value of -5 seconds in all situations. For example, this value is not appropriate for the ServerIOTimeout property if you know that some responses take 10 seconds.
A request is made that does not get a response before ServerIOTimeout pops on server 1.
Assume t is the time the original request is received.
The plug-in sends the request to server 1 and waits 5 seconds for a response. No response is received so server 1 is marked down at the time the request was received plus 5 seconds (t + 5).
The request is now sent to server 2. If the RetryInterval property is set to the minimum recommended value of 1, server 1 is marked up 1 second after server 2 receives the request: (t + ServerIOTimeout + RetryInterval = t+5+1 = t+6). Server 2 does not get a response and is marked down at the time the original request was received plus 10 seconds: (t + ServerIOTimeout to server 1 + ServerIOTimeout to server 2 = t+10).
The request is sent to server 3. If the RetryInterval property is set to the maximum recommended value of 9. Server 1 is marked up (t+ ServerIOTimeout + RetryInterval = t + 5 + 9 = t+14), 1 second before server 3 is marked down (t + ServerIOTimeout to server 1 + ServerIOTimeout to server2 + ServerIOTimeout to server 3 = t +5 + 5+ 5 = t +15). Server 3 fails to receive a response within the ServerIOTimeout property value.
There is never a time where all servers are marked down because of an unexpected long running request.
12 October 2020