APAR status
Closed as new function.
Error description
Several improvements to WSGRID takeover
Local fix
Problem summary
**************************************************************** * USERS AFFECTED: All users of IBM WebSphere Application * * Server * * for z/OS Java Batch * **************************************************************** * PROBLEM DESCRIPTION: WSGRID Java Batch jobs on z/OS are not * * taken over if the owning scheduler * * becomes unavailable * **************************************************************** * RECOMMENDATION: * **************************************************************** When a Java Batch job is submitted on z/OS using the WSGRID native client, communication back to the client from the Job Scheduler is done with MQ messages issued by an MDB which monitors status of the job through that same Job Scheduler. If that Job Scheduler becomes unavailable for some reason, there is periodic checking that occurs if a specified lock name (ENQ name) is held. If it is gone, then the client knows the Job Scheduler is down and the WSGRID proxy job is ended. By default on z/OS if this happens, the java batch job itself on the batch endpoint server is also stopped, thus keeping the status in sync. If the WebSphere custom property com.ibm.websphere.batch.policy.EndJobWhenSchedulerEnds is set to false, then the java batch job will continue to run and another active Job Scheduler will take over receiving status updates from the job. To an external scheduler which is submitting WSGRID proxy jobs, it sees the job as ended abnormally since it is monitoring the status of the proxy job. Although the job may complete successfully on the batch endpoint, other steps are required to determine and surface the status of the job to any external scheduling tools. If the WSGRID job could remain up and continue to receive status messages, additional steps to get the jobs back in sync can be avoided.
Problem conclusion
A code update has been made in both the WebSphere for z/OS WSGRID native client and the Java Batch runtime to allow the take over of status messaging back to the WSGRID native client. Since the original Job Scheduler is gone, and thus the MDB which was monitoring the job, the messaging back to the client after a takeover is done with a new code path. Job log streaming back to the client is not done with this new code path, and a messaging indicating such is issued. The job logs still exist on the file system, and can be retrieved via other means. New properties on the WSGRID native client side have been added: endJobWhenSchedulerEnds=<true/false> (default true): This property controls whether the WSGRID proxy job should end if it is determined the Job Scheduler is no longer active. takeover-timeout=<timeout_in_milliseconds>: Specifies how long the proxy job should wait to receive a message that another scheduler has taken over the job before timing out and ending abnormally as it would have before. useTakeoverReturnCodes=<true/false> (default false): Indicates whether the proxy job should use special return codes for completed, restartable or failed jobs after a takeover. This allows for differentiation between, for example, a job that completed and a job that was taken over and then completed. On the Java Batch Job Scheduler side, a new Scheduler custom property was added: com.ibm.websphere.batch.allow.wsgrid.takeover=<true/false> (Default false): Indicates that if a job submitted initially by WSGRID is taken over by another active Job Scheduler, it should take over sending messages back to the WSGRID client. The WebSphere for z/OS documentation for WSGRID and custom properties will be updated with this new information. The fix for this APAR is targeted for inclusion in fix pack 9.0.5.12 and 8.5.5.22. For more information, see 'Recommended Updates for WebSphere Application Server': https://www.ibm.com/support/pages/node/715553
Temporary fix
Comments
APAR Information
APAR number
PH36899
Reported component name
WEBSPHERE FOR Z
Reported component ID
5655I3500
Reported release
900
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2021-05-03
Closed date
2022-03-23
Last modified date
2022-03-23
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
WEBSPHERE FOR Z
Fixed component ID
5655I3500
Applicable component levels
[{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SS7K4U","label":"WebSphere Application Server for z\/OS"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"900"}]
Document Information
Modified date:
24 March 2022