The WebSphere Contrarian: Preparing for failure

While many enterprises ensure that application and infrastructure performance testing and tuning is part of every implementation project plan, another essential testing and tuning phase is often overlooked -- one to insure that application and component failover occurs without any impact to application availability. This installment of The WebSphere® Contrarian discusses how to approach that task. This content is part of the IBM WebSphere Developer Technical Journal.

Tom Alcott, Senior Technical Staff Member, IBM

Tom Alcott is Senior Technical Staff Member (STSM) for IBM in the United States. He has been a member of the Worldwide WebSphere Technical Sales Support team since its inception in 1998. In this role, he spends most of his time trying to stay one page ahead of customers in the manual. Before he started working with WebSphere, he was a systems engineer for IBM's Transarc Lab supporting TXSeries. His background includes over 20 years of application design and development on both mainframe-based and distributed systems. He has written and presented extensively on a number of WebSphere run time issues.



02 November 2011

In each column, The WebSphere® Contrarian answers questions, provides guidance, and otherwise discusses fundamental topics related to the use of WebSphere products, often dispensing field-proven advice that contradicts prevailing wisdom.

Plan a successful recovery

Let me start by saying I’m not suggesting that you plan to fail; rather, that you need to be prepared in the event of a failure, specifically the failure of some portion of your application infrastructure. Further, since a failure or outage is inevitable, you need to plan and practice failures so that you are prepared for when a failure or outage occurs. Otherwise, the John Wooden phrase "Failing to prepare is preparing to fail" will apply to you.

Specifically, what you should be doing is practicing or simulating outages, preferably in a test environment, not only to determine how best to tune your application infrastructure for recovery, but also to insure that your problem notification and problem resolution procedures are effective. As I often state when speaking to customers: You don’t want to learn by doing during an outage, which is why practice is essential.


The basics

The IBM® WebSphere® Application Server Information Center article on the Queuing Network associated with WebSphere Application Server outlines the components that you need to tune and test for performance, which is also a good starting list of the components that you also needs to test and tune in order to minimize the impact in the case of a failure. However, the list (and figure) in the Information Center doesn’t necessarily list all the components, nor is there any mention of the number of components at each layer, as is typical when clustering is employed for high availability. While not intended to be all inclusive, Figure 1 adds some additional granularity to the multiple dimensions of hardware and software components, as well as the number of instances of a given hardware or software component that need to be considered.

Figure 1. WebSphere Application Server components and queues in a typical clustered environment
Figure 1. WebSphere Application Server components and queues in a typical clustered environment

If you draw a similar diagram for your environment, the testing and tuning that needs to be considered should become apparent. For example:

  • If one of the HTTP servers fails, does the remaining HTTP server (or servers) have sufficient capacity to handle the average production workload? What about the peak production workload? What OS tuning is required for connection timeouts and request timeouts in order to make failure recognition and request redirection by the “upstream” components (in this case the IP sprayer) as seamless as possible?
  • Similarly, if an application server should fail, do the remaining application server instances have capacity to process average and peak workloads? What tuning is required of the HTTP server plug-in to minimize latency in failover of requests to the remaining application server(s)?
  • In the case of a database failure, what OS tuning, WebSphere Application Server connection pool tuning, and JDBC provider tuning needs to be performed? If hardware clustering for the database is employed, such as HACMP or Veritas Cluster Server, which tuning for the clustering software needs to be performed? If database replication is involved, such as IBM DB2® HADR or Oracle® RAC, again what specific tuning and configuration should be employed to make a failover scenario as seamless as possible?

Monitoring

Implied in the preceding section is the requirement for the monitoring of your current production environment, as well as the collection of current response time and resource utilization metrics so that a representative load (one that replicates production average load and peak load) can be employed for testing and tuning. Table 1 lists the WebSphere Application Server PMI statistics that I typically employ for this purpose. Those familiar with WebSphere Application Server, or more specifically WebSphere Application Server PMI, will notice that this list is not the default for PMI statistics. The reason I employ the list below is that it provides me with the data to determine actual resource use, not just pool sizes. To enable the PMI settings shown here, follow the instructions in the Information Center for Enabling Custom PMI collection Be aware also that a couple of the metrics listed are not PMI statistics; rather, they are available in the application server's SystemOut.log. Additionally, because the SessionObjectSize has a significant performance impact, it should only be used in test, and then only sparingly.

Table 1. WebSphere Application Server PMI statistics
Connection poolsJVM runtimeHTTP session managerSystem data
JDBC
  • AllocateCount
  • ReturnCount
  • CreateCount
  • CloseCount
  • FreePoolSize
  • PoolSize
  • JDBCTime
  • UseTime
  • WaitTime
  • WaitingThreadCount
  • PrepStmtCacheDiscardCount
JMS
  • JMS queue
  • Connection factory
  • Connection pools
  • Pool size
  • Percent maxed
  • Perent used
  • Wait time
  • HeapSize
  • UsedMemory
Optional
  • % Free after GC
  • % Time spent in GC
  • ActiveCount
  • CreateCount
  • InvalidateCount
  • LiveCount
  • LifeTime
  • TimeSinceLastActivated
  • TimeoutInvalidationCount
Optional
  • SessionObjectSize*

* only in test

  • ProcessCpuUsage
Thread poolWeb containerMessaging engine
  • ActiveCount
  • ActiveTime
  • CreateCount
  • DestroyCount
  • PoolSize
  • DeclaredThreaHungCount
  • ActiveCount
  • ActiveTime
  • CreateCount
  • DestroyCount
  • PoolSize
  • DeclaredThreaHungCount
  • BufferedReadBytesCount
  • BufferedWriteBytesCount
  • CacheStoredDiscardCount**
  • CacheNotStoredDiscardCount**
Optional
  • AvailableMessageCount
  • LocalMessageWaitTime

** not in PMl in logs


Tuning

Once you have accurate data on current resource use and response time at each level so that you can proceed with tuning for both performance and for failover, which you will then validate via testing. In terms of maximizing performance and throughput perspective, I typically start by observing the actual pool usage (say, the web container thread pool or the JDBC connection pool) at peak and multiply the actual pool use by 1.2 in order to size the pool before performance testing. For the purpose of testing for failover, the observed values will also come into play in order to correctly set various queue depths and timeouts that are employed by WebSphere Application Server components to detect downstream component failure and, in turn, to direct requests to an alternative (clustered) component.

For example, if you have long running requests (say, several seconds), either at the web tier or at the data access tier, then you’ll need to set connection timeouts or request timeouts to be long enough to permit normal requests to process without signaling a failure. The downside to this is that requests will continue to be sent a possibly non-responsive component for several seconds after it has ceased to function, which means requests will be queued upstream of the failed component, consuming sockets or thread and connection pools waiting for a failure before redirecting requests.

High availability note

In a distributed computing environment, high availability and failover detection between components most often relies on a communication connection failure: a premature closing of a connection, the inability to open a connection, or the timeout of a request over an existing connection. This is the basis not only for a significant portion of the WebSphere Application Server Network Deployment high availability architecture between components, but also in other products targeted at distributed computing environment high availability and failover, such as hardware clustering offerings like HACMP, MC/Serviceguard, Veritas Cluster Server, and so on.

My thoughts on tuning the HTTP Server plug-in remain unchanged from an earlier article, so I’ll focus on connection pool and JDBC provider tuning. Based upon the observed values for the connection pool use time, you’ll want to set the connection pool connection timeout to as low a number as practical for your environment and applications so that you don’t incorrectly mark a database as unavailable because of a single long running request. Suppose you observe an average use time of 500ms; it’s likely fine to consider setting the timeout to less than 10 seconds, which is significantly less than the default of 180 seconds. How much lower? Well, is the application you’re testing representative of all your applications, or is it a case that there is one (or more) applications that routinely run long queries? If so, the connection pool average use time multiplied by some “safety factor” (such as 1.2 as mentioned before for pool sizes) for these long running applications should represent the lowest value for setting connection timeout. I'd add that in the later versions of WebSphere Application Server, the connection pool Purge Policy of EntirePool is the default. This is the recommended policy, so you should employ this Purge Policy if it's not the default in your environment. Once you've tuned the connection pool, you'll also need to investigate and set the timeout properties for your JDBC provider, because the WebSphere Application Server connection pool relies on the JDBC provider to manage connection and request timeouts. This information will generally be documented in your database documentation, rather than the WebSphere Application Server Information Center. In turn, the JDBC provider relies on the operating system TCP/IP implementation, so that too will need to be tuned. A quick reference to some key operating system and TCP/IP parameters is available in the Information Center.

(A hint: For DB2, two key properties for tuning connection behavior for the JDBC provider are blockingReadConnectionTimeout and loginTimeout, which control the length of time an existing connection remains open and the time spent waiting for a new connection, respectively. Extra hint: the Oracle analog for these two properties are oracle.jdbc.ReadTimeout and setLoginTimeout (see Resources). Further, in WebSphere Application Server there's a pre-populated list of custom properties for each datasource, and for both DB2 and Oracle the WebSphere Application Server custom property loginTimeout controls the time spent waiting for a new connection.)


Testing

After initially setting values, you’ll then need to proceed with testing, though unlike normal performance testing -- which typically involves running attempts to maximize throughput by running one or more components to 100% CPU (or perhaps network saturation) -- you’ll instead simulate a workload that is representative of your production average as well as your production peak. Once you reach a steady state, you’ll then induce or simulate the failure of a given component. A commonly employed technique is pulling a network cable or stopping the network adapter, and while these techniques might be often used, they might not accurately simulate many outages.

Say, for example, that one of your components slows down due to load and continues accepting requests and doesn’t actually close connections (it’s only slow, not dead). Inducing a network failure doesn’t simulate this particular type of failure. In fact, simulating a network failure should be “low hanging fruit,” since detecting refused connection requests or timed requests is core to most distributed computing HA, as noted earlier. If you really want to be “prepared to fail” you’ll need to perform additional tests, for example, by writing a shell script that runs server for(ever) loops, writing outputs in order to consume CPU and memory, effectively hanging the OS, but keeping existing connections and requests open until they timeout (if they ever do). Another alterative would be to write a servlet that could be added to the applications being tested to implement sleep for a long time; again, the intent being to consume one or more set of resources to eventually induce a failover. Another alternative is employing functions for "hanging" a process, which is often provided by system monitoring and debugging tools.

As you progress though testing by simulating the failure of one cluster member, you’ll need to observe the number of requests that become queued and the amount of time needed to failover. You’ll then need to start adjusting the timeouts and pool sizes in order to provide for earlier recognition by the upstream component (for example, the HTTP server plug-in for the web container, or the web container for the datasource), while at the same time not reducing thoughput to a crawl in the event of a failure.

In general, increasing values for timeouts or pool sizes will delay recognition of a downstream component failure, but in the case of pool sizes a larger value also provides some buffering in the event of a failure. As you can see, tuning to prevent your website from stalling in the event of a failure will require a tradeoff between increasing and deceasing various parameters. Arriving at the optimal values for your environment will require iterative testing with various settings and failure scenarios so that you (or at least your computer systems) will be prepared to fail, which in turn should help insure your success (and continued employment).

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into WebSphere on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=769195
ArticleTitle=The WebSphere Contrarian: Preparing for failure
publish-date=11022011