In each column, The WebSphere Contrarian answers questions, provides guidance, and otherwise discusses fundamental topics related to the use of WebSphere products, often dispensing field-proven advice that contradicts prevailing wisdom.
What's in a name?
So, the editor of the IBM® WebSphere Developer Technical Journal asked if I would consider writing a regular column along the lines of the Comment lines columns I have submitted over the past couple of years. In a moment of weakness, I said "yes." Reflecting on the kind of advice I usually have to offer, I have entitled this column The WebSphere Contrarian.Merriman-Webster's Online Dictionary defines "contrarian" like this:
Figure 1. Contrarian definition
No, I'm not here to offer stock market advice. If you've read any of my Comment lines articles, you likely realize that I do often take a contrary position when providing answers to WebSphere-related questions. For that matter, I often take a contrary position when answering questions on just about anything. (I'm such a troublemaker.) While some might instead describe me as a skeptic, my experience with Murphy's Law has conditioned me to question whether anything is ever really assured or foolproof, hence my skepticism or contrarian view!
So, let's begin...
High availability alternatives
When I speak on the topic of high availability (HA), one of the slides I usually use covers "HA by layer," where I discuss various alternatives for HA for most of the typical components in a J2EE™ application deployment. A portion of that slide, which includes the topic I want to cover here, is shown in Figure 2.
Figure 2. High availability by layer
The item of specific interest is the notation NOT Recommended – IP Sprayer between plug-in and Web container. I'm often asked why I recommend against such as configuration. Well, aside from the fact that I often take a contrarian view (my personal default), my rationale goes like this:
If there is any application state (for example, an HTTP session), then my standard recommendation is to not place an IP sprayer or load balancer between the HTTP server plug-in(s) and the WebSphere Application Server Web containers. While the ClusterAddress parameter in the plugin-cfg.xml file can be used for this purpose, this parameter was introduced for WebSphere Application Server on z/OS® so that the z/OS workflow management could be used. As you may or may not know, z/OS workload management is very sophisticated, using application state, server responsiveness, load, and so on, to make its routing (and failover) decisions, while IP sprayers or network switches only have some portion of this information readily available.
An additional (and likely bigger) issue is configuring an IP sprayer to correctly maintain application state. Most IP sprayers rely on IP layer 3 or IP layer 4 information to maintain application state, or "stickiness," relying on username, client IP address, or random assignment to assign a request to a specific server. Unfortunately, peculiarities such as DHCP, Network Address Translation, and Web proxies can result in a client's IP address changing between requests. Therefore, these techniques don't provide the same affinity that the WebSphere Application Server Network Deployment HTTP server plug-in does. This means that the IP sprayer has to be configured to recognize the HTTP cookie or URL information that WebSphere Application Server uses, and unless the IP sprayer has provisions for IP layer 7 content routing, the application state will not be properly maintained.
Even if you have an IP sprayer or network switch that does provide for IP layer 7 content routing, there is still the issue of maintaining the routing information in the IP sprayer anytime there is a change to the applications or server configurations in the WebSphere Application Server cell. Even if you can automate this process of keeping the IP sprayer in synch with WebSphere Application Server by somehow extracting the correct information from the plugin-cfg.xml file and updating the IP sprayer, this is likely to entail some amount of manual effort. Since personnel costs are by far the highest cost in IT, I'm not convinced this is such a cost effective solution.
Even If there is no application state, I'm not a big advocate of placing an IP sprayer between the HTTP server plug-in and WebSphere Application Server anyway, since it introduces added complexity and administrative effort to the architecture, as well as the added cost of the hardware.
Lastly -- and this applies to both applications that require state and those that do not -- placing an IP sprayer between the HTTP server plug-in and the Web container requires the IP sprayer to have a mechanism for determining that an application server Web container is not responding so that it can redirect requests to another application server. While it's typically possible to configure the IP sprayer to perform some sort of "application server health check," this too adds additional complexity -- and if not configured properly, the IP sprayer will continue to send requests to an unavailable application server in cases where the HTTP server plug-in would recognize the outage (something that has actually occured with clients who have tried to pursue this).
So now that I've provided the contrary view, let's discuss what I would recommend.
As a first step, I suggest that you perform some tuning of the parameters in the plugin-cfg.xml file. The parameters most likely to help are the ConnectTimeout and MaxConnections parameters. Typically, ConnectTimeOut can be adjusted down from the default, which is the OS TCP/IP timeout, to a significantly smaller number, such 5-10 seconds. This enables the plug-in to redirect the request to another application server much sooner than it would using the defaults and improves application responsiveness.
While at one time I also recommended reducing the ServerIOTimeOut from the default -- a value of 0 (zero) at that time meant that WebSphere Application Server (V6.x and later) didn't provide for a time out, but would inherit an external setting, typically the OS TCP/IP time out -- I no longer make a blanket recommendation to decrease the value. My change in position is because reducing the ServerIOTimeOut value actually increases the likelihood that a single bad HTTP GET request -- one that performs a lengthy query, for example -- can cause problems that aren't handled gracefully by the plug-in. (Be aware that with a POST request, the PostBufferSize property can be used to limit propagation of a lengthy request.)
To understand my change in position, consider what happens when a ServerIOTimeOut: occurs on a GET request with a positive value for this parameter; a "bad" (or long running) GET request will first be sent to one server in the ServerCluster, once the ServerIOTimeOut elapses the request will be retried on another server in the ServerCluster, this continues until the request has been dispatched to each server in the ServerCluster. As a result, a single long running GET will minimally result in a long wait for the application (browser) client to receive an error. "Long" in this case is the ServerIOTimeOut * the number of servers in the ServerCluster -- in other words, until all servers have been tried; in the case of a particularly "bad" (resource intensive) request, it could potentially stall or slow other application requests.
Beginning with WebSphere Application Server Versions 220.127.116.11, 18.104.22.168, and 22.214.171.124, ServerIOTimeOut has a provision to be set to a negative value (in prior versions, the value for ServerIOTimeOut could only be set to be a positive value). With a negative value for ServerIOTimeOut, the plug-in will mark a server as unavailable (for any new requests), and the plug-in sends the request to another application server, which also will be marked unavailable when the ServerIOTimeOut value elapses. In a two server cluster, a low value for ServerIOTimeOut (for example, 30-60 seconds) would result in all service ceasing in 60-120 seconds. Another note of caution: beginning with WebSphere Application Server V126.96.36.199, the ServerIOTimeout default value is set to 60 seconds; therefore, if you're using WebSphere Application Server V7.x, so at a minimum I would recommend increasing this value (for example, to 300 seconds, which is 5 minutes).
As a result, then, until a mechanism is provided to prevent an automatic retry of an HTTP GET request in the event of a ServerIOTimeOut, you have a dilemma when setting this property since there's no clean way to prevent a single "bad" request from creating havoc in one form or another. Depending on your monitoring infrastructure in terms of how quickly you can detect and restart a "hung" application server, setting a value of 0 (zero) might be the best alternative because this would result in the plug-in connecting a a single server, eventually causing a hang which would enable you to restart the server. If you encounter this all, I suggest you consider adjusting the ServerIOTimeOut to try and minimize the impact in your environment and to contact your IBM representative to open a Feature Request to have this addressed.
Changing the MaxConnections can also help from overloading a server. Typically, this should be no more than 20 - 25% greater than the thread pool size in the Web container. Otherwise, when an application server fails, you could overload the remaining application server instances with requests. I also recommend reducing the maximum open connections for the TCP transport channel, even though the WebSphere Application Server Information Center recommends staying with the default of 20,000. My rationale, again, is to avoid overloading a server with requests, especially in cases where the server is starting to serve requests slowly. While the thread used to manage connection requests for the WebSphere Application Server Web container is extremely efficient, the last thing you want to do when a server is slowing down is to to queue additional work for it to perform.
Okay, so what if you've tried the tuning suggested above and you're still not getting the responsiveness and Quality of Service that you desire (or require)?
If this is the case, I would suggest you look at IBM WebSphere Extended Deployment operations optimization, which has a component called the On Demand Router (ODR) that is designed to be placed between the HTTP server(s) and the WebSphere Application Server Web containers and provides application state "aware" workload management (as well as application version aware workload management). Further, the ODR also incorporates server responsiveness, load, and so on, into its workload management algorithm (just like z/OS workload management). The ODR is also integrated with the WebSphere Application Server management infrastructure, so there's no need for development of additional manual or automated mechanisms for synchronizing configuration between the WebSphere Application Server cell configuration and the IP sprayer. Additionally, WebSphere Extended Deployment is likely to be just as cost effective, if not more so, than the purchase of additional network switches, although, unlike network switches, the ODR is not supported for deployment in a DMZ.
And so this concludes the very first chapter of The WebSphere Contrarian. I hope you'll stop in again in a couple of months for the next installment of my contrary views!
Thanks to Sam Pearson and Keys Botzum for their comments.
- Comment lines by Tom Alcott: Everything you always wanted to know about WebSphere Application Server but were afraid to ask
- plugin-cfg.xml file, from the WebSphere Application Server V6.1 Information Center
- IBM WebSphere Session Management, from IBM WebSphere: Deployment and Advanced Configuration
- Tuning transport channel services, from the WebSphere Application Server V6.1 Information Center
- Creating ODRs, from the WebSphere Extended Deployment Operations Optimization V6.1 Information Center
- IBM developerWorks WebSphere
- WebSphere Extended Deployment resources