Slow and hung request detection

The requestTiming-1.0 feature provides diagnostic information when the duration of any request exceeds the configured threshold.

The latest documentation about slow and hung request detection is available on the Open Liberty website.

The request timing feature can track the duration of every request that is coming into the system. You can configure the feature to watch for slow and hung requests.

Slow request detection
Hung request detection
requestTiming-1.0 feature restrictions

Slow request detection

When a request runs for longer than it was configured, a warning message is written in the messages log file. Details about the request and events that made up the request are captured.

The following sample shows the log message for a request that crossed the slow request threshold. The default is 10 seconds:

[12/1/14 11:58:09:629 IST] 0000001d com.ibm.ws.request.timing.SlowRequestTimer
W TRAS0112W: Request AABjnS+lIn0_AAAAAAAAAAb has been running on thread 00000021 for at least 10003.571ms. 
The following stack trace shows what this thread is currently running.

 	 at java.util.HashMap.getEntry(HashMap.java:516)
	 at java.util.HashMap.get(HashMap.java:504)
	 at org.apache.derby.iapi.store.access.BackingStoreHashtable.get(Unknown Source)
	 at org.apache.derby.impl.sql.execute.HashScanResultSet.getNextRowCore(Unknown Source)
	 at org.apache.derby.impl.sql.execute.NestedLoopJoinResultSet.getNextRowCore(Unknown Source)
	 at org.apache.derby.impl.sql.execute.ProjectRestrictResultSet.getNextRowCore(Unknown Source)
	 at org.apache.derby.impl.sql.execute.DMLWriteResultSet.getNextRowCore(Unknown Source)
	 at org.apache.derby.impl.sql.execute.DeleteResultSet.setup(Unknown Source)
	 at org.apache.derby.impl.sql.execute.DeleteResultSet.open(Unknown Source)
	 at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(Unknown Source)
	 at org.apache.derby.impl.sql.GenericPreparedStatement.execute(Unknown Source)
	 at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown Source)
	 at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(Unknown Source)
	 at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeUpdate(Unknown Source)
	 at com.ibm.ws.rsadapter.jdbc.WSJdbcPreparedStatement.executeUpdate(WSJdbcPreparedStatement.java:626)
	 at com.ibm.websphere.samples.trade.direct.TradeDirect.resetTrade(TradeDirect.java:1832)
	 at com.ibm.websphere.samples.trade.web.TradeConfigServlet.doResetTrade(TradeConfigServlet.java:65)
	 at com.ibm.websphere.samples.trade.web.TradeConfigServlet.service(TradeConfigServlet.java:348)
	 at javax.servlet.http.HttpServlet.service(HttpServlet.java:668)
	 at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1275)
	 ....
	 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1121)
	 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:614)
	 at java.lang.Thread.run(Thread.java:769)
 
Duration    Operation
10007.571ms + websphere.servlet.service | TradeWeb | TradeConfigServlet?action=resetTrade
    3.923ms       websphere.datasource.psExecuteUpdate | jdbc/TradeDataSource | delete from holdingejb where holdingejb.account_accountid is null
    0.853ms       websphere.datasource.psExecuteUpdate | jdbc/TradeDataSource | delete from accountprofileejb where userid like 'ru:%'
 5271.341ms +     websphere.datasource.psExecuteUpdate | jdbc/TradeDataSource | delete from orderejb where account_accountid in (select accountid from accountejb a where a.profile_useri like 'ru:%')

The request continues to be monitored, and a further warning is logged if the request is running beyond another 10 seconds. The log messages are in the following format:

TRAS0112W: Request <(Request ID)> has been running on thread <THREADID> for at least <DURATION>. The following stack trace shows what this thread is currently running.
<STACK TRACE>
<DURATION AND OPERATIONS Table>

REQUEST ID: This same ID can be used to search for log and trace messages corresponding to the request. In particular, if you use binary logging, you can search for log and trace entries with the same requestID extension by using the binary log command.
STACK TRACE: Indicates the method that is running. In the previous sample, you can see a stack trace of the current request after TRAS0112W line.
DURATION AND OPERATIONS Table: After the stack trace, you can find the tabular format of the request that shows the duration and operation (also referred to as the event). The Duration column indicates the time that is taken by the corresponding operation of the request. The plus sign (+) indicates events within the request that are still running. The next line shows the duration without +, which indicates that the corresponding operation completed in the specified duration. Operation shows the EVENT TYPE and CONTEXT INFO (optional) for that operation. For more information about event types and context information, see Event Logging.

By analyzing the messages, you can figure out why the request is slow. However, it might be difficult to determine whether the request is stuck at that point or is still running slowly. Hence you can see three messages that are logged for any slow request at the interval of specified <slowRequestThreshold>. Using the three different stack trace and request data, you get a better insight into the issue. After the third warning, no further warnings are logged about the request unless the duration of the request crosses the hung request detection threshold.

Hung request detection

If the request exceeds the default hungRequestThreshold or the configured threshold value, a warning message is written in the messages log file along with the details about the request. Details about the request and events that made up the request are captured. For a hung request detection, a series of three thread dumps (javacores) is taken with a 1-minute delay between them. The following log message sample shows the log messages for a request that crossed the hung request detection threshold. The default duration value is 10 min. The value that is configured in the following example is 4 min.

[WARNING ] TRAS0114W: Request AAA7WlpP7l7_AAAAAAAAAAA was running on thread 00000021 for at least 240001.015ms. The following table shows the events that have run during this request.

Duration       Operation
240001.754ms + websphere.servlet.service | TestWebApp | TestServlet?sleepTime=480000
     0.095ms       websphere.session.setAttribute | mCzBMyzMvAEnjMJJx9zQYIw | userID
     0.007ms       websphere.session.setAttribute | mCzBMyzMvAEnjMJJx9zQYIw | visitCount

If a request gets completed after it was detected as hanging initially, a message similar to the following example is logged:

TRAS0115W: Request AAA7WlpP7l7_AAAAAAAAAAA

It was previously detected to be hung and completed after 479999.681 s.

Note: When a request is detected to be hanging, a series of three thread dumps is initiated. After the completion of the three thread dumps, further thread dumps are created only if the new requests are detected to be hanging.

If interruptHungRequests is set to true in the request timing configuration, an attempt is made to interrupt the hung request after the first thread dump has completed. When the server is using an IBM JRE, the classes in the com.ibm.jvm package, such as InterruptibleIOContext and InterruptibleLockContext, are used to interrupt the request. Other methods might be used to interrupt the request depending on what features are configured in the server, and what code the application is executing when the hung request is detected. In most cases, the result of an interrupt is a declared exception that is thrown to the application code that was in control when the hung request was detected. The application can then handle the exception and return an appropriate error to the caller.

Note: If multiple hung requests are detected concurrently, each request waits until the first thread dump for the first hung request is complete before attempting to interrupt that request.

Configuring the slow and hung request thresholds

The slow and hung request thresholds are configured by setting slowRequestThreshold and hungRequestThreshold in the <requestTiming/> configuration in the server.xml file. This setting sets a single slow and hung request threshold for all requests that are timed.

You can also set different slow and hung request thresholds for different types of requests. An embedded <servletTiming/> or <jdbcTiming/> configuration in the server.xml file overrides the configured slow and hung request threshold that are defined in <requestTiming/>. A <servletTiming/> configuration can set the slow and hung request threshold by application name, servlet name, or both. A <jdbcTiming/> configuration can set the slow and hung request threshold by datasource name, sql string, or both. In both cases, the names are taken from the context information for the request, as reported by the eventLogging-1.0 feature.

The event log contains the type and context information for each request that was processed by the server. With type and context information, you can create a <servletTiming/> or <jdbcTiming/> configuration in the server.xml file for the requests that require different threshold values.

For more information about what context information is provided for each request, see Event Logging in the product documentation. A <servletTiming/> or <jdbcTiming/> configuration can only be used to time a root level request. For example, if a servlet request makes a JDBC query, the JDBC query cannot be timed because the servlet request is the root level request.

See the following <requestTiming/> configuration with embedded <servletTiming/> configuration for one specific servlet and one specific web application:

<requestTiming slowRequestThreshold="10s" hungRequestThreshold="10m">
  <servletTiming servletName="Arm10Servlet" hungRequestThreshold="20s"/>
  <servletTiming appName="SecurityExchangeApp" hungRequestThreshold="2h"/>
</requestTiming>

requestTiming-1.0 feature restrictions

For the requestTiming-1.0 feature, the following restriction applies:

The requestTiming-1.0 feature, when activated, has been shown to have a 4% adverse effect on the maximum possible application throughput when measured with the DayTrader application. While the effect on your application might be more or less than that, you should be aware that some performance degradation might be noticeable.