This document explains issues that can arise under heavy load and simultaneous executions of the same Job agent or report on distributed installs of Cognos IBM platform.
Errors such as CNC-ASV-0025 CNC-ASV-0025 Agent Condition invalid, RSV-DST-0014 Unable to deliver the output of the report and CM-SYS-5192 deadlock error.
It then outlines the solutions that have been put in place post Sept 2013 some best practice before going into some detail of the underlying architecture of the agent and job service.
Agents and Jobs under load
Distributed system with multiple job and monitor services running under heavy load
Each system is using sql via JDBC to access and manipulate queue data
When conflicting updates occur the database tries to resolve them via locking. Usually locking a row but sometimes a page which contains multiple row. Transactions usually just have to wait for a lock, but sometimes a deadlock occurs and one of the transactions is rejected.
EMF components have been tested at distributed load and has handling code to retry transactions to cope with the deadlock exception from DB.
Simultaneous Jobs and Agents under load
If you schedule jobs or agents faster than they are run, they stack up and end up on the queue, also this can be achieved by simultaneous triggering of the same job or agent. You may even have introduced recursion by nesting a job or agent in an agent that calls the parent agent. This type of recursion is not recommended.
So Jobs and agents can run simultaneously on a distributed system and collide at the db level as separate systems call into CM and EMF databases trying to access and modify the same rows.
For example schedule an agent to run every minute, then create a job that runs the same agent and schedule that for every minute. This can quickly build up a queue of pending runnables and cause the same runnable to be simultaneously executed on separate installs.
This will lead to potential deadlock issues at the db level. Both content manager and EMF have strategies to cope with this kind of scenario.
Just to clarify we are talking about simultaneous execution of the same agent
Agents query for business events that are rows in a db table by running a report, then storing report output in cm then asynchronously runnng tasks, then writing history details to cm.
Agents cannot run simultaneously as the event information they search for may be corrupted or lost, also the concurrent access of content manager objects can cause problems.
For releases prior to Sept 2013 this sort of agent collision caused sporadic problems such as null pointer from agent service and report service errors regarding the condition report output. Post Sept 2013 releases will have the following enhancements.
Solution part 1 – Agent Serialization
We added a db backed locking mechanism for EMF to use. Agent service detects that it is in a distributed system and switches its single server agent lock to use the DB backed lock.
The lock used is the agent store id so the lock only blocks simultaneous runs of the same agent. Now you can overlap runs of agents and the lock will serialise them.
Agents must wait for a lock in order to run, the lock used is their store id so agents only lock themselves out. There is a lock timeout period that defaults to 10 minutes. If an agent cannot get the lock within the timeout period it means that it is already running. The agent will stop waiting and mark itself as succeeded. If the lock it is waiting for has itself been waiting for longer than the timeout period then the agent is marked as a fail.
In most cases 10 minutes is plenty of time to wait for a lock and any longer period will mean there is a slow / overloaded system.
In rare cases of highly concurrent parameterized agents that must not fail. This timeout can be extended with the advanced property com.cognos.jsmcommon.lock.wait.timeout which is in milliseconds so the 10 minutes default is
com.cognos.jsmcommon.lock.wait.timeout = 600000
Solution part II – History write serialization
A more rare problem can occur with simultaneous runs from the history write from monitor service for the agent overlapping and causing issues at high load. The lock mechanism is used and is now on an advanced property to be used in systems where customers are liable to try running the same thing on multiple servers simultaneously.
So now agents runs and history writes are serialised and will not tangle in content manager
1. Use Triggers
using a triggered schedule rather than a frequent polling schedule for agents if possible to minimise contention.
2. Ensure resources
ensure enough system resources available for cm / authentication source and notification db access.
On heavily loaded systems async conversations may be abandoned due to non response of target server – or plain socket timeout errors. this indicates resource starvation.
3. If using rapid schedules (or concurrent triggers)
Use a number based history / output retention (rather than duration based).
Use the audit db to track run histories to reduce history read / write contention
Ensure that the history / output retention is high enough to cope with the number of simultaneous runs
Usethe MS advanced property advanced.history.write.lock = true
Use these cm advanced properties
If using highly concurrent parameterized agents,
that are passed data that must not be lost (as opposed to agents that query a datasource for their data) then the lock timeout may mean that they do not run at all, in this case the lock timeout period can be increased from the default 600000 default value
com.cognos.jsmcommon.lock.wait.timeout = 600000