Technical Blog Post
WebSphere Java Batch common issues and how to troubleshoot
Occasionally, you might encounter behavior in the IBM WebSphere Batch component that is not expected. These are the some of most common problems and the corresponding resolutions are in addition to those documented as technotes and available in the product documentation. This is not intended to be a comprehensive list of all of the issues that you might encounter while working with the WebSphere Batch component.
1. Lifecycle Consistency
1.1. Job stuck in submitted / restartable / executing state
a. Job stuck in submitted: When you see a job is stuck in submitted state, the common reason being the endpoint is not active or the application is not started..OR the PGCController not initialized. You can check in the part.0.log file of the job in question and you can see the details. Also the SystemOut.log will have the entries with explanation.
b. Job stuck in restartable state: For jobs stuck in restartable state, the reason could be a runtime issue. You can find the reason in the part.1.log file for that particular jobid.
c. Job stuck in executing state: Usually if there is a communication issues such as no hart beating, or if the database cannot be updated, or the communication disconnect between scheduler and endpoint, the job goes into executing status.
1.2. Job Capacity Leak detection & recovery(PI07496):
Over a period of time and under certain circumstances (server being down unexpectedly or cleanup not being performed) job capacity can be lost. When this occurs, jobs will remain in submitted state and the part.0 job log will mention no capacity, without any jobs running (or not enough jobs running to account for no capacity).
This is due to Job Scheduler may not keep track of ended jobs and update its internal capacity counters. This can cause job dispatches to fail due to full capacity. To address this issue, the fix PI07496 went into all the releases of WebSphere V8.5 starting V126.96.36.199. But to enable the feature you need to add Job Scheduler custom properties.
We have 3 different properties for the frequency with which this detection occurs, just the job class capacity detection & job class capacity detection and recovery.
- The frequency with which this detection occurs is configurable with scheduler custom property: MaxJobCapacityLeakDetectionFrequencyMinutes.
- To enable just the job class capacity detection, enable scheduler custom property: EnableJobCapacityLeakDetection.
- To enable job class capacity detection and recovery, enable scheduler custom property: EnableJobCapacityLeakDetectionAndRecovery.
1.3. Database sequence generator to create job number
Prior to WebSphere V188.8.131.52, some of the customers were getting issues with job numbers roll over and are reclaimed. Using database sequence generator, It provides a streamlined process for the Job Scheduler to obtain a job number. It is aimed at reducing the complexity at runtime. Enable this feature in 2 steps:
a. By updating the DB tables using updateSchedulerDBtoUseSequenceDB2.ddl script.
b. and by adding the job scheduler custom property “UseSequenceForJobNumber” value to "true"
1.4. New Endpoint joblog location and behavior
In the previous fixpacks, the default location of joblogs generated on the endpoint was <was_home/joblogs/<server_name>/<section>/<jobId>. If a job was restarted on the same endpoint server, the joblogs of the restarted execution were placed in the same location as the previous execution.
Starting in WebSphere V184.108.40.206, the default behavior is all endpoint joblogs will be in a similar path as the scheduler joblogs, where each execution will have its joblogs in a separate directory.
The new default path is <was_home>/joblogs/<server_name>/<section>/<jobid>/<timestamp>
For example: /WebSphere/ND/AppServer3/profiles/default/joblogs/endpoint_ndnode3/section1/XDCGIVT_000009/20140914_152254
2. Resiliency & Recovery
When you do not define the database or When you reference a wrong DB name, you will get exceptions related to Scheduler not initialized.
The Caused by section of the stack in the SystemOut log will have the information about these error messages such as UnableToInitializeException or StaleConnectionExceptions. Based on the messages in the logs, you need to make appropriate changes to resolve these issues.
Eg: For StaleConnectionException: Database <db name> not found. DSRA0010E: SQL State = XJ004, Error Code = 40,000
If an endpoint goes down while processing jobs, after 5 minutes, the scheduler will update it's status to Unknown. When the endpoint comes back up, it will mark all executing jobs as restartable or execution failed. The scheduler will resync its status with that of endpoint.
3.1 How to suppress log lines to the server logs, job logs or both types of logs.
If you want to suppress writing of log lines to the server logs, job logs, or both types of logs. You can achieve this by using the job log filter SPI and Implementing JobLogFilter interface
Please review the product documentation topic Job scheduler System Programming Interfaces (SPI) for spi.job.log.filter.
3.2 How to set TransactionTimeout value settings in the xJCL:
The transaction time out value in the xJCL is expressed in seconds. WebSphere allows you to configure both a maximum and default transaction timeout. The transaction timeout specified in the xJCL will override the default transaction timeout, but cannot exceed the configured WebSphere maximum transaction timeout.
3.3 How to Configurable transaction modes in xJCL:
Use the transaction mode to define whether job-related artifacts are called in global transaction mode or local transaction mode. This can be specified in the xJCL by com.ibm.websphere.batch.transaction.policy property.
a. When global is specified, all job-related artifacts (callbacks, BDSes, checkpoint algorithm, etc) are called in global transaction mode.
b. When local is specified, all job-related artifacts (callbacks, BDSes, checkpoint algorithm, etc) are called in local transaction mode.
c. When compat is specified, BatchJobStep is invoked in global transaction mode.
4.1 Load is always going to the same endpoint.
If the load goes to same endpoint means, either the job submission rate is low or the job is a small one or the endpoint is finishing the job quicker. This is expected behavior as the endpoint selection logic depends on equalization line logic, it's not exactly round robin.
We calculate the logic bsed on server weight and number of outstanding job count of an endpoint.
Equalization line = server weight - # outstanding job count of a endpoint.
Server weight is static value, it is set when a server is created. You can view the value on the admin console.
OutstandingJobs is the in memory counter of how many jobs the scheduler thinks is running on that particular endpoint.
4.2 Memory-overload protection for the endpoint servers
If you define GRID_MEMORY_OVERLOAD_PROTECTION WebSphere variable, it delays the running of a job in the endpoint server if insufficient Java heap memory is available to run the job. The job is delayed until other currently running jobs complete and free up enough memory.
The endpoint server determines the amount of available memory by querying the Java virtual machine (JVM) and assessing the memory requirements of all active jobs currently running within the server.
You can specify the memory requirement for a job by defining the memory attribute of the job element in the xJCL. If you do not specify the memory attribute, then the value of the GRID_MEMORY_OVERLOAD_PROTECTION WebSphere variable is used as the default. If you define the GRID_MEMORY_OVERLOAD_PROTECTION WebSphere variable as ?, then the endpoint server estimates the average job memory requirement by assessing the current active job count and the amount of memory currently in use.
If you do not define the GRID_MEMORY_OVERLOAD_PROTECTION WebSphere variable, then memory-overload protection is disabled.
This is a scheduler custom variable when set to true, fetching job logs for wsgrid jobs will use a more optimized path. the default value is false.
The Job Management Console (JMC) is not secured even though Global security is enabled.
Solution: Check the “Enable application security” check box
If you allow mapped users as All authenticated in Application’s Realm in Special subjects, the JMC will not challenge.
When you use a Service Integration Bus(SIBus) and WebSphere Application Server security is enabled for the server or cell, by default the service integration bus queue destination inherits the security characteristics of the server or cell. So if the server or cell has basic authentication enabled, then the client request fails.
To resolve the problem
• Disable Bus Security
• Change from "Restrict the use of defined transport channel chains to those protected by SSL" to "Allow the use of all defined transport channel chains"
• Change SSL-required to SSL-supported.
If you select SSL-required, then the server opens an SSL listener port only and all inbound requests are received using SSL.
If you select SSL-supported, then the server opens both a TCP/IP and an SSL listener port and most inbound requests are received using SSL.
The Job.LogFile.Mapping file was read sequentially. As the size of the file grows, finding the correct entry for the job log directory takes longer causing a slow down in processing time.
This behavior was changed in WebSphere V8.5 and the storage of job class and job log file mapping information is changed from files to database tables.
With this change, runtime update that requires new tables JOBCLASSREC & JOBLOGREC. Both the JobScheduler and Endpoint servers depended on the update tables. There are DDLs & SPUFIs available to assist with the table changes for the various databases that are supported with WebSphere Batch.
If you see any exception CWLRB3020E: [Long Running Job Scheduler <scheduler name>] failed: java.sql.SQLSyntaxErrorException: 'JOBCLASS' is not a column in table or VTI 'LRSSCHEMA.JOBSTATUS', runtime update that requires new tables JOBCLASSREC, JOBLOGREC in WebSphere V220.127.116.11, V18.104.22.168, V22.214.171.124, V126.96.36.199 & V188.8.131.52.
Please update the tables by executing appropriate DDLs. These files can be found in the WAS_HOME/util/Batch directory.
See this technote FAQ: Database table changes for WebSphere Batch in WebSphere V8.5.x
7. Parallel Job Manager(PJM)
If you are running parallel jobs and the JOBCLASSMAXCONCJOBS count reaches it's maximum capacity. In this scenario, the jobs in that jobclass do not get dispatched.
This issue can be resolved by specifying different jobclass values for top level jobs and subjobs in xJCL.
For parallel jobs using ComputeGrid V8 xJCL with the “run element”, the top level job will run under the job class that is specified by the "class" attribute of the job element.
To specify a job class for the subjobs or parallel steps, add the "job-class" attribute to the run element of the xJCL.
If the "class" attribute of the job element or "job-class" attribute of the run element is not specified, the job will run under the default job class.
8. Discovery & Heartbeat
Could not send heart beat to:
Scheduler to detect that an endpoint is no longer available, it depends on the heart beat communication from the endpoint. Scheduler will mark endpoint inactive after missing heart beat for a certain threshold. At that point, it will take necessary steps to put all jobs to Unknown state and inform the JobSchedulerMDB. JobSchedulerMDB will issue a cancel when for any job status notification of Unknown.
To resolve the issue, please make sure the endpoint is up and running.
WebSphere Compute Grid WSGrid job submission script fails with JCLException if the specified xjcl contains elements that are split across multiple lines.
If the xjcl file specified on the WSGrid invocation command line contains elements that are split across multiple lines, the following exception is seen:
error:JobSchedulerMDB.SecureSubmitter: caught exception com.ibm.websphere.longrun.JCLException: Element type "prop" must be followed by either attribute specifications, ">" or "/>".
The WSGrid reads xJCL one line at a time, trimming off white space at the beginning and end of the line and then concatenating the results. This leads to elements that are split across lines being smashed together which then go on to fail parsing.
To avoid this occurrence, xJCL files should not contain elements split across multiple lines.
Example xJCL that will avoid this occurrence: The <prop> elements are in one single line.
<prop name="supportclassIn" value="com.ibm.websphere.batch.devframework.datastreams.patterns.TextFileReader" />
<prop name="supportclassOut" value="com.ibm.websphere.batch.devframework.datastreams.patterns.TextFileWriter" />
Please note that this issue also affects jobs submitted via RAD (not sure of the official name here) when invoked as follows with xJCL that contain line split elements:
Run As -> Modern Batch Job