IBM Support

Beware of Connection Storm

White Papers


Abstract

"Connection Storm" is a behaviour that could be seen in large production environments as a result of a slowndown in the app server or database layers.

Content

It happens like this: Let's say your site has 50 servers, each configured with 30 Web Container threads. On a typical day, the JVMs never have more than 8 active threads at any given time, and each server maintains in average 10 connections to the database.
Now there is a slowdown, which can be either on the database or the WebSphere Commerce JVMs. For this example, let's assume the problem is with a 3rd party Web Service that suddenly becomes slower than usual. As requests are taking longer to complete, new requests that come in need to use a new WebContainer thread, and also the DataSource must acquire a new connection from the Database for the thread. This could happen in a very short period of time until the WebContainer pool hits max and it is completely in use.
Before the slow down, the database had 500 established connections (50 JVMs x 10 connections).
Now with the JVMs' WebContainer pool completely saturated, the number of connections surged all the way to 1,500 ( 50 JVMs * 30 WebContainer threads = Connections).
The delta is 1,000 new connections created during the event. This is what we call "Connection storm".
Creating a connection has a cost. The database needs to handle the tcp connection, allocate memory and create the agent. If a large number of connection requests is received in a very short period of time, this can bog down the perfomance of the database to a point that it can appear to be unresponsive.
The cascade effect plays an important role here: The more connections to the database server, the slower the DB response is, which in turns drives even more new connections to the database.
Remember, as the DB is slow, requests take longer in the AppServer and new request need new threads and new connections.
A connection storm is a side effect of a slowdown, but it doesn't mean that it will always be present and it might not be a problem at all
. Considering the earlier example, the most common result is a slow AppServer layer, with a saturated Web Container pool and most threads waiting on the outbound Web Service call (as seen in Javacores).
Here the database might not even show a factor.
For the connection surge to be a problem, these two conditions need to take place:
1. The surge of connections is large, several hundreds connections. A few dozen connections from a couple of JVMs shouldn't be a problem, unless the DB was strugging to begin with
2.  The connections are requested within a very short period of time
Another similar situation is bringing JVMs up under load. With cold caches (especially without WXS), JVMs can be overloaded quickly and exhaust their WebContainer and database connections. Here you will also see connections surge quickly, but most likely the cost of establishing the connections will be a small factor compared to the actual SQL load and CPU cost in the JVMs due to executing.
The connection storm is a side effect of some other condition slowing down the AppServers or the database. The problem is that if the database becomes impacted by all the new connection requests, most troubleshooting data such as Javacores, will point to the database being the problem. The actual root cause of the slow down will not be evident until the effect of the connection storm is reduced or eliminated

Monitoring Connections
Databases offer a variety of ways to monitor the established connections. In DB2 for example, you can use "list applications", snapshots or db2pd.
Using the snapappl_info administrative view is one of the simplest ways. It's lightweight and can be executed frequently.
SELECT CURRENT TIMESTAMP time,
             COUNT(*) total,
             SUM(CASE WHEN appl_status = 'UOWWAIT'  THEN 1 ELSE 0 END) uowwait,
             SUM(CASE WHEN appl_status = 'UOWEXEC'  THEN 1 ELSE 0 END) uowexec,
             SUM(CASE WHEN appl_status = 'LOCKWAIT' THEN 1 ELSE 0 END) lockwait
    FROM sysibmadm.snapappl_info
 ORDER BY 1,2,3

Minimizing the impact of a connection storm
The key is to minimize the delta between the avg number of connections and the maximum number that can be established during a slowdown.
The following techniques have proven useful:
Application Server layer:
1. Avoid an unnecessarily high WebContainer pool size. If 25 active threads is enough to overload a JVM. there is no reason for setting the pool to a larger size
2. Maintain a high number of established connections. Disable the aged timeout (be mindful with firewalls killing connections) and increased or disable the unused timeout. Use a high number of Minimum connections.
3.  Change the Purge Policy to FailingConnectionOnly instead of EntirePool. For details: Connection pool settings
4. Consider using "Surge Threshold" and "Surge Creation Interval". For details: Connection pool advanced setting
Database layer (DB2):
1.   Review the NUM_INITAGENTS settting. If made equal to MAX_CONNECTIONS, there will be initial overhead when the DB is activated, but then there wont be overhead to create new agents as the JVMs connect. For details:  num_initagents - Initial number of agents in pool configuration parameter
     
2.  The NUM_POOLAGENTS setting can be used to keep the agents in memory after the application disconnects. For details: num_poolagents - Agent pool size configuration parameter

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SG15B","label":"Database management"},"Component":"","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"ALL","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
05 March 2020

UID

ibm13133773