This document applies to Rational DOORS Next Generation 6.0.6 Ifix003 and later.
It is possible to construct a query across a large data set which runs for a long period of time. Such a long-running query may consume resources and cause the system to become unstable.
A new Rogue Query Monitor has been introduced to capture these queries.
It is now possible to abort a long-running query before it impacts system stability.
Queries which exceed a specified timeout produce a warning message. This is displayed to the end user in the web UI and also seen in the rm.log on the server. Since 6.0.6 iFix 003 it is possible to automatically abort queries that exceed a defined timeout.
Note: Some RM internal queries are excluded from monitoring because we know these are expected to be long-running. These include ETL jobs for Reporting.
This document details how to use the timeout values and how to disable this feature, if required.
The following steps apply to the new Rogue Query Monitor functionality introduced as a stability fix within Rational DOORS Next Generation 6.0.6 ifix 003.
The RM server uses the following advanced properties for the Rogue Query Monitor
1) Rogue Query Monitor run interval
- Name - Rogue Query Monitor run interval (in seconds)
- Description - Run interval in seconds for the Rogue Query Monitor. A value of 0 will disable the Rogue Query Monitor
- Default value - 30 seconds
- Changing the value requires a server restart to become active
- A value of 0 seconds will result in no query monitor running. This will return the pre-Ifix003 behaviour and should only be used under the direction of IBM Support.
2) Rogue Query timeout
- Name - Rogue SPARQL Query abort timeout (in ms)
- Description - Enables the RM Rogue Query Monitor to abort exceedingly long running SPARQL queries. If the execution of a query exceeds the amount of time in milliseconds specified here, the query execution will abort to avoid locking up the server
- Default value - 60000 (1 minute)
- Can change value on running server
- No minimum setting, if set to 0 or -1 (ms), it will be added (or deducted for negative values) from the client time out value
3) Web UI Query timeout (exists already)
- Name - query.client.timeout
- Description - Value for which the web UI expects a query to timeout
- Default value - 30000 milliseconds (30 seconds)
Allowed runtime calculation
Logic used by the RM Rogue Query Monitor to calculate query abort:
Query starttime + (Web UI Query timeout + Rogue Query timeout) > current time : abort query in Jena using current Thread Id
Using this calculation
the minimum query runtime with the default settings is (30 seconds client timeout + 60 seconds rogue timeout) + 1 second for the rogue query monitor interval = 91 seconds
the maximun query runtime with the default settings is (30 seconds client timeout + 60 seconds rogue timeout) + 30 seconds for the rogue query monitor to kick in = 120 seconds (plus a minimal delay while the query monitor is iterating through each the running queries at that moment)
The RM admin debug page for running SPARQL queries now has a checkbox to allow for long-running queries.
There is additional logging that accompanies this functionality that is set by default, as well as advanced logging which can be set via a log4j property.
Informational logging that will be available in rm.log
At server startup:
CRRRS8752I The RM query monitor task started. The run interval for the task is set to 30 seconds. The maximum query run time is set to 1 minute 30 seconds.
CRRRS8753I The RM query monitor task is disabled. The run interval for the task is set to 0 seconds.
when a rogue query is detected:
CRRRS8754W The RM query monitor detected and will cancel a query that started running at 8/23/18 3:50 PM and has been running for 1 minute 37 seconds. The query ID is 33f9b07d-d771-437a-854b-3a3437a2b0ed with thread 574.
Debug logging can be invoked in order to fully understand what is occurring when operations and queries are timing out, via:
If you are asked to run a specific tool by IBM Support to troubleshoot a situation, or run a corrective procedure, you may need to adjust these settings. If it appears that nothing is happening in the GUI after 2 minutes, then please refer to the rm.log.
An example would be ReqIF export. Jazz.net defect: https://jazz.net/jazz03/web/projects/Requirements%20Management#action=com.ibm.team.workitem.viewWorkItem&id=126574/ APAR PH05142 is an example of a user action impacted by this. The advice is to amend the rogue query timeout to a value which allows your exports to complete.
IBM Support will advise whether to turn off this feature temporarily, or whether an adjustment to the time-out will suffice.
Internal Use Only
Areas where we may need to make temporary adjustments to these settings:
See self-host: https://bit.ly/2QW3D93 for further details, including on how to use a header for SPARQL queries to be exempt
A customer hit these timeouts when running the Module analyzer tool when trying to list the modules. it did not error, nothing was returned. We will create an APAR for this and list it here
1. From doing test runs on a RM server with a high workload, we saw that relative aggressive, short runtime timeout settings are needed to avoid a server from hanging with CPU being pegged to 100%. Hence the 30 + 60 seconds default settings.
2. On such a server, when setting higher timeout values, one rogue query can result in queries piling up in the Jena query engine, exhausting CPU and memory, and soon pushing the server to a hang.
3. On a server with lower workload, this will work with queries running for a long time (have seen a RM view which resulted in a complex query running for 45 minutes with 0 results when finished). Set a value for the "Rogue SPARQL Query abort timeout" advanced property, to avoid a user waiting on a long-running query to finish and probably without results
4. There is no limit on highest setting for the query timeout. If the customer never sees queries taking many minutes, an option is to disable the query monitor. If the do have queries that take that a long time, but it’s a low load server, they query timeout could be increased. However, to prevent the whole server hanging, the earlier the query is killed, the better
RDNG; Rational DOORS Next Generation
13 November 2018