Happy Thanksgiving to everyone. I hope everyone was able to get a good meal and time with family today.
This week I'm writing to you from Seoul, South Korea (it is actually Friday the day AFTER Thanksgiving here yet the Macy's Thanksgiving parade I am watching via Slingbox is still on). I'm working with some colleagues here and doing some mentoring and skills transfer to help broaden the problem determination skills within IBM. Which brings me to today's topic. We encountered a classic application hang. Sometimes, but not all the time, the administrator would restart the application on WAS v8.5 and when the test team started to apply load to the application it would hang. Javacores from kill -3 showed all threads stuck in createOrWaitForConnection. Now for those of you who do follow my blog you probably know about the various techniques I've posted to debug this situation. As we had no access to the developers it was up to us to try and figure out what was causing the hang. Various random twiddling of various AIX OS level parameters didn't work (random changes never do). If they waited long enough the application would sometimes recover and start processing again.
After watching the testing go on for a while I finally suggested we increase the connection pool maximum size to 2n+1 where n = thread pool maximum. The setting the team had set the connection pool maximum was equal to the thread pool max. There was some disbelief that we should go down this path. Any good administrator knows that we want classic funneling where thread pool max is larger than connection pool max to make optimal use of memory, CPU, etc. They re-ran the test and after the 5th attempt realized that we would not recreate the hang. I've posted this command before:
netstat -an |grep ESTA |grep <port#> |wc -l
which gives a connection count to the database on port#. It may be double the value (showing source and destination connections) so you may have to divide the value in half. In our case with thread pool max at 50 and connection pool max set to 101 we were capturing as many as 90 established connections to the database at any one time. Obviously the developers of the application were following the anti-pattern of opening a second connection to the database before closing the first connection which resulted in the deadlock our team in Seoul was observing.
So why wasn't this deadlocking with each and every test? That comes down to randomness. Load tests while they may follow a set process and scripts there is some variability between each test. While it may not vary widely test after test the variability exists in terms of timing on the server. There can be various processes running, or not, at any given point in time. Load on the CPU or tasks the OS is doing can subtly change that timing inducing variability. Timing is key and in some cases the test team got lucky and the test would work. Other times the timing was off and the application would deadlock. This particular anti-pattern is very sensitive to timing. Get the wrong timing and the application will deadlock and hard.
In addition, when they would wait a while the application would recover. This is because underneath the cover of WAS it is quietly reclaiming connections because it knows how long threads have been holding open connections. Once a threshold (timeout) is reached WAS begins the active process of reclaiming connections that have been opened too long. This results in free connections being returned to the pool and the threads that were stuck in createOrWaitForConnection can resume processing.
What is the lesson learned here? When load testing an unknown application it might be worth setting connection pool max to 2n+1 of the thread pool max just to start with and using the command line netstat command (or your application monitoring tools) to see how many connections the application attempts to use. Then once experience is gained with the application reduce the size of the connection pool to something more reasonable based off the observed high water marks in the the connection pool utilization. This is a lot easier tactic than trying to debug an application that is deadlocked in createOrWaitForConnection.