This text describes how to detect overlapping scheduler job executions, and presents a solution to prevent the problem on a cluster scope, and lower. The solution is valid for WebSphere Commerce 7.
The WebSphere Commerce job scheduler is a background service which executes commands at scheduled intervals. A common job configuration problem is that sometimes the job executions of the same command overlap. This usually results in problems such as deadlocks/timeouts, optimistic concurrency exceptions, high CPU usage, and server out of memory errors.
Scheduled jobs can be configured using the Commerce Administration Console. Each job runs as a separate thread, and multiple jobs can be scheduled to run simultaneously. Examples of jobs are below:
Messages queued with sendTransacted() sending service are sent by the SendTransactedMsgCmd command. If an exception is encountered during message sending, the MSGSTORE.RETRIES will be decremented by 1 and next message found will be attempted. The failed message will be re-attempted as long as the RETRIES value is larger than 0.
Any file uploaded to WebSphere Commerce using the catalog and marketing tools, or the fileloader utility, becomes a managed file. Managed files are initially written to the WebSphere Commerce database. Once certain criteria are met, managed files are copied (promoted) from the WebSphere Commerce database to the WebSphere Commerce EAR file by the ScheduledContentManagedFileEARUpdateCmd command.
When a new job is being configured using the Commerce Administration console, a default Application type can be selected:
Picture 1: Scheduling a job - Application type selection
A fixed number of threads is assigned to each Application type. Application type maps to Work Manager definition in the WebSphere Application Server Integrated Solutions Console. Usually, the jobs are configured to use a common out-of-the-box Application type, such as Default, which is configured for maximum number of threads larger than one. In that case, there is nothing to prevent the scheduler to start a new command execution of the same type while the previously scheduled command is still running.
How to detect overlapping scheduler job executions
Below are examples of jobs which sometimes run longer than predicted:
The job's execution duration can be extended due to a larger than usual email count, email server connection problems, or general server overload. This becomes a problem if the job is scheduled to execute in short intervals.
The job's execution duration can be extended due to the number of managed files which have to be promoted from the database to the file system (CMFILE table). This becomes a problem when the job schedule is not configured to accommodate the number of unpromoted files.
To detect overlapping job executions, check the database if intervals between start times for a specific job were shorter than execution durations for the same. An example for one job - the ScheduledContentManagedFileEARUpdate, is below:
SELECT * FROM SCHCONFIG WHERE SCCPATHINFO LIKE 'ScheduledContentManagedFileEARUpdateCmd%' WITH UR
Take note of the value in the SCCJOBREFNUM column, and use it in the query below:
SELECT SCSSTATE, COUNT(*) FROM SCHSTATUS WHERE SCSJOBNBR=<value> AND SCSSTATE IN ('R', 'C', 'CF') GROUP BY SCSSTATE WITH UR
The above query will return execution details for the job; 'R' – job is running, 'C' – job completed successfully, 'CF' – job completed with an exception during the execution. If status 'R' has count value larger than one, we have confirmed the problem; if not, we continue to the next step, which is to see if the scheduled intervals were shorter than the execution duration (in seconds):
SELECT SCSSTATE, SCSPREFSTART, (SCSEND-SCSACTLSTART) AS DURATION FROM SCHSTATUS WHERE SCSJOBNBR=<value> ORDER BY SCSPREFSTART DESC WITH UR
SCSSTATE SCSPREFSTART DURATION
-------- -------------------------- ----------------------
CF 2013-09-09-15.30.00.000000 2311.616398
CF 2013-09-09-15.00.00.000000 5317.250398
CF 2013-09-09-14.30.00.000000 12317.733398
CF 2013-09-09-14.00.00.000000 15317.261398
CF 2013-09-09-13.30.00.000000 22317.396398
CF 2013-09-09-13.00.00.000000 25317.504398
C 2013-09-09-12.30.00.000000 30915.618000
CF 2013-09-09-12.00.00.000000 35317.683398
C 2013-09-09-11.30.00.000000 13911.514000
CF 2013-09-09-11.00.00.000000 45317.669398
C 2013-09-08-22.00.00.000000 20034.241000
We can see from the above example that some executions of the job last much longer than the scheduled interval, and that the execution duration has a very wide range, which makes it difficult to predict.
Overlapping scheduler jobs can be detected by means other than by validating the job's configuration in the database; an administrator may spot the following warning messages in the SystemOut.log file:
[8/20/09 16:06:23:154 PDT] 000000bc EJSJDBCFinder E CNTR0040E: Finder failure as a result of exception COM.ibm.db2.jdbc.DB2Exception: [IBM][CLI Driver][DB2/AIX64] SQL0911N The current transaction has been rolled back because of a deadlock or timeout. Reason code "68". SQLSTATE=40001
The above error message may be an indication for an overlapping scheduled job because it contains the deadlock/timeout error (trying to lock a resource already locked in another instance of the same command) and the scheduler namespace. Troubleshooting of Commerce locking problems should not be limited to scheduler perspective at all; read the following article to learn more: Troubleshooting locking problems in WebSphere Commerce with DB2
If a high CPU usage or an out of memory problem is being investigated using javacores, see if any scheduler threads were active at the time. If such threads exist, validate if the threads are executing the same command (e.g. ScheduledContentManagedFileEARUpdateCmdImpl); if yes - again, the scheduled job's configuration should be validated by comparing the execution duration with the scheduled interval.
Configuring scheduler jobs to prevent overlapping executions
To prevent overlapping scheduler job executions of the same type of command, a simple solution is to configure a job, or a group of jobs, if we want to make sure that the jobs execute serially, to either use a longer interval, or execute on a dedicated thread. To do this, we need to create a new Work manager on a desired scope (e.g. cell, cluster), set it's maximum number of threads to one, and then (ripple) restart WebSphere instances in that scope. Then, when configuring a scheduled job, we select the new Work manager's name from the 'Application type' drop down menu. When defining a new Work manager, keep the name short, JNDI name not longer than wm/commerce/scheduler/12345678 (e.g. wm/commerce/scheduler/ibmstmsg, where 'ibmstmsg' is what you can customize).
Picture 2: Creating a new Work manager – selecting scope
Picture 3: Creating a new Work manager – JNDI name must not be longer than wm/commerce/scheduler/12345678
Picture 4: Creating a new Work manager – select Service names as depicted, and limit the Maximum number of threads to 1
Picture 5: Configuring a job to execute using the new Work manager definition
EDIT 19th February 2015: Limit the job to one cluster member by editing the Allowed host field to enable work managers for a cluster. Alternative for using a work manager is to include an additional job parameter maxThreads=1 if you have Commerce 7 fixpack 8. This parameter can be applied to any job type and works across the cluster.
Creating work managers:
Troubleshooting locking problems in WebSphere Commerce with DB2:
Thanks to Andres Voldman for valuable comments.