WebSphere Java Batch common issues and how to troubleshoot
Mahesh Reddy 060000BDHD Comments (3) Visits (15917)
Occasionally, you might encounter behavior in the IBM WebSphere Batch component that is not expected. These are the some of most common problems and the corresponding resolutions are in addition to those documented as technotes and available in the product documentation. This is not intended to be a comprehensive list of all of the issues that you might encounter while working with the WebSphere Batch component.
1. Lifecycle Consistency
1.1. Job stuck in submitted / restartable / executing state
a. Job stuck in submitted: When you see a job is stuck in submitted state, the common reason being the endpoint is not active or the application is not started..OR the PGCController not initialized. You can check in the part.0.log file of the job in question and you can see the details. Also the SystemOut.log will have the entries with explanation.
1.2. Job Capacity Leak detection & recovery(PI07496):
Over a period of time and under certain circumstances (server being down unexpectedly or cleanup not being performed) job capacity can be lost. When this occurs, jobs will remain in submitted state and the part.0 job log will mention no capacity, without any jobs running (or not enough jobs running to account for no capacity).
We have 3 different properties for the frequency with which this detection occurs, just the job class capacity detection & job class capacity detection and recovery.
1.3. Database sequence generator to create job number
Prior to WebSphere V184.108.40.206, some of the customers were getting issues with job numbers roll over and are reclaimed. Using database sequence generator, It provides a streamlined process for the Job Scheduler to obtain a job number. It is aimed at reducing the complexity at runtime. Enable this feature in 2 steps:
a. By updating the DB tables using upda
1.4. New Endpoint joblog location and behavior
In the previous fixpacks, the default location of joblogs generated on the endpoint was <was
Starting in WebSphere V220.127.116.11, the default behavior is all endpoint joblogs will be in a similar path as the scheduler joblogs, where each execution will have its joblogs in a separate directory.
The new default path is <was
2. Resiliency & Recovery
When you do not define the database or When you reference a wrong DB name, you will get exceptions related to Scheduler not initialized.
The Caused by section of the stack in the SystemOut log will have the information about these error messages such as Unab
Eg: For Stal
If an endpoint goes down while processing jobs, after 5 minutes, the scheduler will update it's status to Unknown. When the endpoint comes back up, it will mark all executing jobs as restartable or execution failed. The scheduler will resync its status with that of endpoint.
3.1 How to suppress log lines to the server logs, job logs or both types of logs.
If you want to suppress writing of log lines to the server logs, job logs, or both types of logs. You can achieve this by using the job log filter SPI and Implementing JobLogFilter interface
Please review the product documentation topic Job scheduler System Programming Interfaces (SPI) for spi.job.log.filter.
3.2 How to set TransactionTimeout value settings in the xJCL:
The transaction time out value in the xJCL is expressed in seconds. WebSphere allows you to configure both a maximum and default transaction timeout. The transaction timeout specified in the xJCL will override the default transaction timeout, but cannot exceed the configured WebSphere maximum transaction timeout.
3.3 How to Configurable transaction modes in xJCL:
Use the transaction mode to define whether job-related artifacts are called in global transaction mode or local transaction mode. This can be specified in the xJCL by com.
a. When global is specified, all job-related artifacts (callbacks, BDSes, checkpoint algorithm, etc) are called in global transaction mode.
4.1 Load is always going to the same endpoint.
If the load goes to same endpoint means, either the job submission rate is low or the job is a small one or the endpoint is finishing the job quicker. This is expected behavior as the endpoint selection logic depends on equalization line logic, it's not exactly round robin.
We calculate the logic bsed on server weight and number of outstanding job count of an endpoint.
Equalization line = server weight - # outstanding job count of a endpoint.
Server weight is static value, it is set when a server is created. You can view the value on the admin console.
OutstandingJobs is the in memory counter of how many jobs the scheduler thinks is running on that particular endpoint.
If you define GRID
The endpoint server determines the amount of available memory by querying the Java virtual machine (JVM) and assessing the memory requirements of all active jobs currently running within the server.
You can specify the memory requirement for a job by defining the memory attribute of the job element in the xJCL. If you do not specify the memory attribute, then the value of the GRID
If you do not define the GRID
This is a scheduler custom variable when set to true, fetching job logs for wsgrid jobs will use a more optimized path. the default value is false.
The Job Management Console (JMC) is not secured even though Global security is enabled.
If you allow mapped users as All authenticated in Application’s Realm in Special subjects, the JMC will not challenge.
When you use a Service Integration Bus(SIBus) and WebSphere Application Server security is enabled for the server or cell, by default the service integration bus queue destination inherits the security characteristics of the server or cell. So if the server or cell has basic authentication enabled, then the client request fails.
• Disable Bus Security
The Job.LogFile.Mapping file was read sequentially. As the size of the file grows, finding the correct entry for the job log directory takes longer causing a slow down in processing time.
This behavior was changed in WebSphere V8.5 and the storage of job class and job log file mapping information is changed from files to database tables.
With this change, runtime update that requires new tables JOBCLASSREC & JOBLOGREC. Both the JobScheduler and Endpoint servers depended on the update tables. There are DDLs & SPUFIs available to assist with the table changes for the various databases that are supported with WebSphere Batch.
7. Parallel Job Manager(PJM)
If you are running parallel jobs and the JOBCLASSMAXCONCJOBS count reaches it's maximum capacity. In this scenario, the jobs in that jobclass do not get dispatched.
This issue can be resolved by specifying different jobclass values for top level jobs and subjobs in xJCL.
For parallel jobs using ComputeGrid V8 xJCL with the “run element”, the top level job will run under the job class that is specified by the "class" attribute of the job element.
To specify a job class for the subjobs or parallel steps, add the "job-class" attribute to the run element of the xJCL.
8. Discovery & Heartbeat
Could not send heart beat to:
Scheduler to detect that an endpoint is no longer available, it depends on the heart beat communication from the endpoint. Scheduler will mark endpoint inactive after missing heart beat for a certain threshold. At that point, it will take necessary steps to put all jobs to Unknown state and inform the JobSchedulerMDB. JobSchedulerMDB will issue a cancel when for any job status notification of Unknown.
To resolve the issue, please make sure the endpoint is up and running.
WebSphere Compute Grid WSGrid job submission script fails with JCLException if the specified xjcl contains elements that are split across multiple lines.
If the xjcl file specified on the WSGrid invocation command line contains elements that are split across multiple lines, the following exception is seen:
The WSGrid reads xJCL one line at a time, trimming off white space at the beginning and end of the line and then concatenating the results. This leads to elements that are split across lines being smashed together which then go on to fail parsing.
To avoid this occurrence, xJCL files should not contain elements split across multiple lines.
Example xJCL that will avoid this occurrence: The <prop> elements are in one single line.
Please note that this issue also affects jobs submitted via RAD (not sure of the official name here) when invoked as follows with xJCL that contain line split elements:
Run As -> Modern Batch Job