In each column, The WebSphere® Contrarian answers questions, provides guidance, and otherwise discusses fundamental topics related to the use of WebSphere products, often dispensing field-proven advice that contradicts prevailing wisdom.
Plan a successful recovery
Let me start by saying I’m not suggesting that you plan to fail; rather, that you need to be prepared in the event of a failure, specifically the failure of some portion of your application infrastructure. Further, since a failure or outage is inevitable, you need to plan and practice failures so that you are prepared for when a failure or outage occurs. Otherwise, the John Wooden phrase "Failing to prepare is preparing to fail" will apply to you.
Specifically, what you should be doing is practicing or simulating outages, preferably in a test environment, not only to determine how best to tune your application infrastructure for recovery, but also to insure that your problem notification and problem resolution procedures are effective. As I often state when speaking to customers: You don’t want to learn by doing during an outage, which is why practice is essential.
The IBM® WebSphere® Application Server Information Center article on the Queuing Network associated with WebSphere Application Server outlines the components that you need to tune and test for performance, which is also a good starting list of the components that you also needs to test and tune in order to minimize the impact in the case of a failure. However, the list (and figure) in the Information Center doesn’t necessarily list all the components, nor is there any mention of the number of components at each layer, as is typical when clustering is employed for high availability. While not intended to be all inclusive, Figure 1 adds some additional granularity to the multiple dimensions of hardware and software components, as well as the number of instances of a given hardware or software component that need to be considered.
Figure 1. WebSphere Application Server components and queues in a typical clustered environment
If you draw a similar diagram for your environment, the testing and tuning that needs to be considered should become apparent. For example:
- If one of the HTTP servers fails, does the remaining HTTP server (or servers) have sufficient capacity to handle the average production workload? What about the peak production workload? What OS tuning is required for connection timeouts and request timeouts in order to make failure recognition and request redirection by the “upstream” components (in this case the IP sprayer) as seamless as possible?
- Similarly, if an application server should fail, do the remaining application server instances have capacity to process average and peak workloads? What tuning is required of the HTTP server plug-in to minimize latency in failover of requests to the remaining application server(s)?
- In the case of a database failure, what OS tuning, WebSphere Application Server connection pool tuning, and JDBC provider tuning needs to be performed? If hardware clustering for the database is employed, such as HACMP or Veritas Cluster Server, which tuning for the clustering software needs to be performed? If database replication is involved, such as IBM DB2® HADR or Oracle® RAC, again what specific tuning and configuration should be employed to make a failover scenario as seamless as possible?
Implied in the preceding section is the requirement for the monitoring of your current production environment, as well as the collection of current response time and resource utilization metrics so that a representative load (one that replicates production average load and peak load) can be employed for testing and tuning. Table 1 lists the WebSphere Application Server PMI statistics that I typically employ for this purpose. Those familiar with WebSphere Application Server, or more specifically WebSphere Application Server PMI, will notice that this list is not the default for PMI statistics. The reason I employ the list below is that it provides me with the data to determine actual resource use, not just pool sizes. To enable the PMI settings shown here, follow the instructions in the Information Center for Enabling Custom PMI collection Be aware also that a couple of the metrics listed are not PMI statistics; rather, they are available in the application server's SystemOut.log. Additionally, because the SessionObjectSize has a significant performance impact, it should only be used in test, and then only sparingly.
Table 1. WebSphere Application Server PMI statistics
|Connection pools||JVM runtime||HTTP session manager||System data|
* only in test
|Thread pool||Web container||Messaging engine|
** not in PMl in logs
Once you have accurate data on current resource use and response time at each level so that you can proceed with tuning for both performance and for failover, which you will then validate via testing. In terms of maximizing performance and throughput perspective, I typically start by observing the actual pool usage (say, the web container thread pool or the JDBC connection pool) at peak and multiply the actual pool use by 1.2 in order to size the pool before performance testing. For the purpose of testing for failover, the observed values will also come into play in order to correctly set various queue depths and timeouts that are employed by WebSphere Application Server components to detect downstream component failure and, in turn, to direct requests to an alternative (clustered) component.
For example, if you have long running requests (say, several seconds), either at the web tier or at the data access tier, then you’ll need to set connection timeouts or request timeouts to be long enough to permit normal requests to process without signaling a failure. The downside to this is that requests will continue to be sent a possibly non-responsive component for several seconds after it has ceased to function, which means requests will be queued upstream of the failed component, consuming sockets or thread and connection pools waiting for a failure before redirecting requests.
My thoughts on tuning the HTTP Server plug-in remain unchanged from an earlier article, so I’ll focus on connection pool and JDBC provider tuning. Based upon the observed values for the connection pool use time, you’ll want to set the connection pool connection timeout to as low a number as practical for your environment and applications so that you don’t incorrectly mark a database as unavailable because of a single long running request. Suppose you observe an average use time of 500ms; it’s likely fine to consider setting the timeout to less than 10 seconds, which is significantly less than the default of 180 seconds. How much lower? Well, is the application you’re testing representative of all your applications, or is it a case that there is one (or more) applications that routinely run long queries? If so, the connection pool average use time multiplied by some “safety factor” (such as 1.2 as mentioned before for pool sizes) for these long running applications should represent the lowest value for setting connection timeout. I'd add that in the later versions of WebSphere Application Server, the connection pool Purge Policy of EntirePool is the default. This is the recommended policy, so you should employ this Purge Policy if it's not the default in your environment. Once you've tuned the connection pool, you'll also need to investigate and set the timeout properties for your JDBC provider, because the WebSphere Application Server connection pool relies on the JDBC provider to manage connection and request timeouts. This information will generally be documented in your database documentation, rather than the WebSphere Application Server Information Center. In turn, the JDBC provider relies on the operating system TCP/IP implementation, so that too will need to be tuned. A quick reference to some key operating system and TCP/IP parameters is available in the Information Center.
(A hint: For DB2, two key properties for tuning connection behavior for the JDBC provider are blockingReadConnectionTimeout and loginTimeout, which control the length of time an existing connection remains open and the time spent waiting for a new connection, respectively. Extra hint: the Oracle analog for these two properties are oracle.jdbc.ReadTimeout and setLoginTimeout (see Resources). Further, in WebSphere Application Server there's a pre-populated list of custom properties for each datasource, and for both DB2 and Oracle the WebSphere Application Server custom property loginTimeout controls the time spent waiting for a new connection.)
After initially setting values, you’ll then need to proceed with testing, though unlike normal performance testing -- which typically involves running attempts to maximize throughput by running one or more components to 100% CPU (or perhaps network saturation) -- you’ll instead simulate a workload that is representative of your production average as well as your production peak. Once you reach a steady state, you’ll then induce or simulate the failure of a given component. A commonly employed technique is pulling a network cable or stopping the network adapter, and while these techniques might be often used, they might not accurately simulate many outages.
Say, for example, that one of your components slows down due to load and continues accepting requests and doesn’t actually close connections (it’s only slow, not dead). Inducing a network failure doesn’t simulate this particular type of failure. In fact, simulating a network failure should be “low hanging fruit,” since detecting refused connection requests or timed requests is core to most distributed computing HA, as noted earlier. If you really want to be “prepared to fail” you’ll need to perform additional tests, for example, by writing a shell script that runs server for(ever) loops, writing outputs in order to consume CPU and memory, effectively hanging the OS, but keeping existing connections and requests open until they timeout (if they ever do). Another alterative would be to write a servlet that could be added to the applications being tested to implement sleep for a long time; again, the intent being to consume one or more set of resources to eventually induce a failover. Another alternative is employing functions for "hanging" a process, which is often provided by system monitoring and debugging tools.
As you progress though testing by simulating the failure of one cluster member, you’ll need to observe the number of requests that become queued and the amount of time needed to failover. You’ll then need to start adjusting the timeouts and pool sizes in order to provide for earlier recognition by the upstream component (for example, the HTTP server plug-in for the web container, or the web container for the datasource), while at the same time not reducing thoughput to a crawl in the event of a failure.
In general, increasing values for timeouts or pool sizes will delay recognition of a downstream component failure, but in the case of pool sizes a larger value also provides some buffering in the event of a failure. As you can see, tuning to prevent your website from stalling in the event of a failure will require a tradeoff between increasing and deceasing various parameters. Arriving at the optimal values for your environment will require iterative testing with various settings and failure scenarios so that you (or at least your computer systems) will be prepared to fail, which in turn should help insure your success (and continued employment).
- Information Center: Queuing network
- Information Center: Enabling PMI data collection
- Information Center: Common IBM Data Server Driver for JDBC and SQLJ properties for all supported database products
- Information Center: Tuning the application serving environment
- The WebSphere Contrarian: Less might be more when tuning WebSphere Application Server
- The WebSphere Contrarian: A better Web application configuration for high availability
- IBM developerWorks WebSphere