Preparing for failure
In each column, The WebSphere® Contrarian answers questions, provides guidance, and otherwise discusses fundamental topics that are related to the use of WebSphere products, often dispensing field-proven advice that contradicts prevailing wisdom.
Plan a successful recovery
Let me start by saying I’m not suggesting that you plan to fail, rather, that you need to be prepared if a failure occurs, specifically the failure of some portion of your application infrastructure. Further, since a failure or outage is inevitable, you need to plan and practice failures so that you are prepared for when a failure or outage occurs. Otherwise, the John Wooden phrase "Failing to prepare is preparing to fail" will apply to you.
Specifically, you should be practicing or simulating outages, preferably in a test environment, not only to determine how best to tune your application infrastructure for recovery, but also to ensure that your problem notification and problem resolution procedures are effective. As I often state when I speak with customers: You don’t want to learn by doing during an outage, which is why practice is essential.
The Queuing Network topic for IBM® WebSphere® Application Server in the IBM Knowledge Center outlines the components that you need to tune and test for performance. This topic is also a good starting list of the components that you also need to test and tune to minimize the impact if a failure occurs. However, the list (and figure) in the IBM Knowledge Center doesn’t necessarily list all the components, nor does it mention the number of components at each layer, which is typical when clustering is employed for high availability. While not intended to be all inclusive, Figure 1 adds more granularity to the multiple dimensions of hardware and software components, and several instances of a hardware or software component that must be considered.
Figure 1. WebSphere Application Server components and queues in a typical clustered environment
If you draw a similar diagram for your environment, the testing and tuning that needs to be considered should become apparent. For example:
- If one of the HTTP servers fails, does the remaining HTTP server (or servers) have sufficient capacity to handle the average production workload? What about the peak production workload? What OS tuning is required for connection timeouts and request timeouts to make failure recognition and request redirection by the “upstream” components (in this case the IP sprayer) as seamless as possible?
- Similarly, if an application server should fail, do the remaining application server instances have capacity to process average and peak workloads? What tuning is required of the HTTP server plug-in to minimize latency in failover of requests to the remaining application servers?
- If a database fails, what OS tuning, WebSphere Application Server connection pool tuning, and JDBC provider tuning needs to be performed? If hardware clustering for the database is employed, such as HACMP or Veritas Cluster Server, which tuning for the clustering software needs to be performed? If database replication is involved, such as IBM DB2® HADR or Oracle® RAC, again what specific tuning and configuration should be employed to make a failover scenario as seamless as possible?
Implied in the preceding section is the requirement for the monitoring of your current production environment and the collection of current response time and resource utilization metrics. This way, a representative load (one that replicates production average load and peak load) can be employed for testing and tuning.
Table 1 lists the WebSphere Application Server PMI statistics that I typically employ for this purpose. If you are familiar with WebSphere Application Server, or more specifically WebSphere Application Server PMI, you will notice that this list is not the default for PMI statistics for WebSphere Application Server V8 and earlier. The reason that I employ the following list is that it provides me with the data to determine actual resource use, not just pool sizes.
Table 1. WebSphere Application Server PMI statistics
|Connection pools||JVM runtime||HTTP session manager||System data|
* only in test
|Thread pool||Web container||Messaging engine|
** not in PMl in logs
To enable the PMI settings that are shown in Table 1, if you're running WebSphere Application Server V8 and earlier, follow the instructions in the Enabling Custom PMI collection topic in the IBM Knowledge Center. If you're running WebSphere Application Server V8.5 or later, no customization is needed because the PMI defaults changed with the addition of Intelligent Management in WebSphere Application Server V8.5, unless you want to add more metrics. Be aware also that a couple of the metrics that are listed are not PMI statistics. Rather, they are available in the application server's SystemOut.log. Additionally, because the SessionObjectSize has a significant performance impact, it should be used only in test and then only sparingly.
When you have accurate data on the current resource use and response time at each level, you can proceed with tuning for performance and failover, which you then validate by testing. In terms of maximizing performance and throughput perspective, I typically start by observing the actual pool usage (say, the web container thread pool or the JDBC connection pool) at peak. Then, I multiply the actual pool use by 1.2 to size the pool before performance testing. To test for failover, the observed values come into play to correctly set various queue depths and timeouts that are employed by WebSphere Application Server components to detect downstream component failure and, then, direct requests to an alternative (clustered) component.
For example, you have long running requests (say, several seconds), either at the web tier or at the data access tier. In this case, you’ll need to set connection timeouts or request timeouts to be long enough to permit normal requests to process without signaling a failure. The downside to this is that requests will continue to be sent a possibly non-responsive component for several seconds after it ceases to function. That is, requests will be queued upstream of the failed component, consuming sockets or thread and connection pools that are waiting for a failure before the requests are redirected.
My thoughts on tuning the HTTP Server plug-in remain unchanged from an earlier article, so I’ll focus on connection pool and JDBC provider tuning. Based on the observed values for the connection pool use time, set the connection pool connection timeout to as low a number as practical for your environment and applications. This way, you don’t incorrectly mark a database as unavailable because of a single long-running request. Suppose that you observe an average use time of 500ms. It’s likely fine to consider setting the timeout to less than 10 seconds, which is significantly less than the default of 180 seconds. How much lower? Well, is the application you’re testing representative of all your applications, or is it a case that there is one (or more) applications that routinely run long queries? If so, the connection pool average use time that is multiplied by some “safety factor” (such as 1.2 as mentioned before for pool sizes) for these long-running applications should represent the lowest value for setting connection timeout. I'd add that in the later versions of WebSphere Application Server, the connection pool Purge Policy of EntirePool is the default. This Purge policy is recommended, so you should employ it if it's not the default in your environment.
When you tune the connection pool, you also need to investigate and set the timeout properties for your JDBC provider because the WebSphere Application Server connection pool relies on the JDBC provider to manage connection and request timeouts. This information will generally be documented in your database documentation, rather than in the IBM Knowledge Center for WebSphere Application Server. In turn, the JDBC provider relies on the operating system TCP/IP implementation so you need to tune that too. A quick reference to some key operating system and TCP/IP parameters is available in the IBM Knowledge Center.
(Hint: For DB2, two key properties for tuning connection behavior for the JDBC provider are blockingReadConnectionTimeout and loginTimeout, which control the length of time an existing connection remains open and the time spent waiting for a new connection, respectively. Extra hint: the Oracle analog for these two properties is oracle.jdbc.ReadTimeout and setLoginTimeout (see Resources).
Further, in WebSphere Application Server there's a pre-populated list of custom properties for each data source, and for both DB2 and Oracle the WebSphere Application Server custom property loginTimeout controls the time that is spent waiting for a new connection.)
After initially setting values, you’ll then need to proceed with testing. In normal performance testing, you typically run attempts to maximize throughput by running one or more components to 100% CPU (or perhaps network saturation). However, unlike normal performance testing, you’ll simulate a workload that is representative of your production average peak and your production peak. When you reach a steady state, you’ll then induce or simulate the failure of a component. A commonly employed technique is pulling a network cable or stopping the network adapter, and while these techniques might be often used, they might not accurately simulate many outages.
For example, one of your components slows down due to load and continues accepting requests and doesn’t actually close connections (it’s only slow, not dead). Inducing a network failure doesn’t simulate this particular type of failure. In fact, simulating a network failure should be “low hanging fruit,” since detecting refused connection requests or timed requests is core to most distributed computing HA, as noted earlier. If you want to be “prepared to fail,” you need to perform more tests. For example, you can write a shell script that runs server for(ever) loops, write outputs to consume CPU and memory, effectively hanging the OS, but keeping existing connections and requests open until they time out (if they ever do). Another alterative is to write a servlet that can be added to the applications that is being tested to implement sleep for a long time. Again, the intent is to consume one or more sets of resources to eventually induce a failover. Another alternative is employing functions for "hanging" a process, which is often provided by system monitoring and debugging tools.
As you progress though testing by simulating the failure of one cluster member, you’ll need to observe the number of requests that become queued and the amount of time that is needed to failover. You’ll then need to start adjusting the timeouts and pool sizes to provide for earlier recognition by the upstream component (for example, the HTTP server plug-in for the web container, or the web container for the data source). At the same time, you'll try not to reduce throughput to a crawl if a failure occurs.
In general, increasing values for timeouts or pool sizes will delay recognition of a downstream component failure, but for pool sizes, a larger value also provides some buffering if a failure occurs. As you can see, tuning to prevent your website from stalling during a failure will require a trade-off between increasing and deceasing various parameters. Arriving at the optimal values for your environment requires iterative testing with various settings and failure scenarios. This way, you (or at least your computer systems) are prepared to fail, which in turn can help ensure your success (and continued employment).
- Queuing network
- Enabling PMI data collection
- Common IBM Data Server Driver for JDBC and SQLJ properties for all supported database products
- Tuning the application serving environment
- The WebSphere Contrarian: Less might be more when tuning WebSphere Application Server
- The WebSphere Contrarian: A better Web application configuration for high availability