ZWSTECHNOTE : Job Tracking : Preventing the loss of TRACKING DATA ( event records )

Troubleshooting

Problem

An operation for z Workload Scheduler didn't track properly, and it appears that Tracking Data was lost, or ignored?

Also known as JES JOBLOG shows job completed, but zWS shows the job in STARTED status.

Symptom

The status of a Job does not match what we are seeing based on JES / JOBLOG results.
(example - JES2 joblog shows that the job completed, but the zWS dialog shows the same job in status S ( started ))

Cause

One of the (5) tracking data records, that are used to mark a job complete... were "lost".

Diagnosing The Problem

In general terms, with regard to Loss of Tracking Data for zWS, please find the following:
.
How z Workload Scheduler Tracking works...

A job is marked complete by zWS if the controller receives from ANY of the trackers the event records IJ1, A1, A2, A3J, A3P.

The IJ1 record is cut from whatever LPAR the job was submitted on.
The A2 and A3J records are cut specifically from the LPAR where the job actually executed.
The A1 and A3P records are cut by WHICHEVER LPAR happens to hold the JES2 checkpoint file at the time.
Review TABLE 33 of the Planning and Installation Guide for additional details on these records and their usage.

Points to consider, to help AVOID the LOSS of tracking data:
.
- LPAR IPL:
The ONLY way to prevent the loss of the A1 and A3P records when an LPAR is going to be IPLed, as documented in PQ38839,
is to place the LPAR into INDEPENDENT MODE prior to shutting down the tracker prior to the IPL
    .
  As documented (this is in a section called "How to make sure that events are not lost" in Managing the Workload):

  > 1. Remove the system being stopped from the JES2 MAS by
         placing it into INDEPENDENT mode via the command
        "$T MEMB,IND=Y"
  > 2. Allow jobs currently running on this system to complete.
  > 3. Stop the tracker (P OPCx).
  > 4. Stop JES.
  > 5. Re-IPL.
  > 6. Restart JES.
  > 7. Restart the tracker.
  > 8. Resume normal work (issue $T MEMB,IND=N).
    .
If you are NOT doing this for EACH of your LPARs (assuming you IPL them one at a time) you can lose tracking records and jobs will not be marked complete.
If you haven't seen this before you are quite lucky as it could have happened with any IPL.
.
- What also sometimes happens is that there is ANOTHER LPAR like a test LPAR which only occasionally joins into the MAS, but DOES NOT have a Tracker.

If this happens, this LPAR could create tracking records but if there is no TRACKER running on it, the controller never receives the record.

You can check via command $DMEMBER if there are any such LPARs possible in your MAS.
.

Resolving The Problem

If after reviewing this information you STILL have concerns or can't explain the loss, here is what we will need...
.
(1) The controller and tracker MLOGs that cover the timeframe from just before, to just after you noticed the tracking data was missing.
.
(2) Either, the EQQTROUT (tracklog) from the time in question (from before any
of the affected jobs were submitted until sometime after your controller was back up following an IPL - (at least 30 minutes)
AND/OR an EQQAUDIT report covering the same time period (if you ran one previously - both is even better)
.
(3) The SMF type 26 records (JES2 PURGE) for the jobs in question (be prepared to check and collect for each TRACKER in the JES MAS).
.
(4) SYSLOGS at the time the job did not track (a short time before through to a short time afterwards).

(5) The EQQEVDS (event data sets) from all the trackers that are connected to the controller.
These must be sent tersed. If you have the EQQEVDS allocated as DSORG=PSU, before you can terse
you must make a NON-PSU copy of the file with JCL similar to the following:

//ALLOCS2 EXEC PGM=IEBGENER
//SYSPRINT DD SYSOUT=*
//SYSUT1   DD DISP=SHR,DSN=TWSZ.V9R3M0.EV
//SYSUT2   DD DISP=(NEW,CATLG),DSN=TWSZ.V9R3M0.EV.NONPSU,
//         UNIT=SYSDA,LIKE=TWSZ.V9R3M0.EV,DSORG=PS
//SYSIN    DD DUMMY

Then terse the "NONPSU" file.
.
The tracklog/audit report will show what tracking records the controller    received for each of the jobs. The SMF type 26 record in the ACTUALS section contains fields that indicate on which LPAR (SMF id) the job went through INPUT, CONVERSION, EXECUTION and OUTPUT processing.
This is useful because if the AUDIT shows that for instance an A3P record was missing, the SMF26OID field will show where the job went through OUTPUT processing which is were the A3P record should have been generated
.
Also, check the status of the exits ---

$DEXIT51
$DEXIT7
$DLOADMOD(TWSXIT51)
$DLOADMOD(OPCAXIT7)
D PROG,EXIT
D PROG,EXIT,EX=SYS.IEFACTRT,DIAG
D PROG,EXIT,EX=SYS.IEFUJI,DIAG
D PROG,EXIT,EX=SYS.IEFU83,DIAG
D PROG,EXIT,EX=SYSJES2.IEFACTRT,DIAG
D PROG,EXIT,EX=SYSJES2.IEFUJI,DIAG
D PROG,EXIT,EX=SYSJES2.IEFU83,DIAG

If the LOAD MODULE exits found by these commands are the DEFAULT ones (IEFACTRT, IEFUJI, IEFU83)
then it is necessary to do some TSO ISRDDN commands to see if the exit module is the expected one or the DEFAULT one.
The DEFAULT ones (in SYS1.LPALIB) are small modules (like 20 bytes) that show a compile date from 1989 or 1990
Using these default load modules would cause loss of tracking data.

To avoid any issues with these LPA load modules, use a different load module name (like ZWSACTRT instead of IEFACTRT)

Document Location

Worldwide

[{"Type":"SW","Line of Business":{"code":"LOB35","label":"Mainframe SW"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSRULV","label":"IBM Workload Scheduler for z\/OS"},"ARM Category":[{"code":"a8m50000000L2tUAAS","label":"Documentation"}],"ARM Case Number":"","Platform":[{"code":"PF035","label":"z\/OS"}],"Version":"All Version(s)"}]

Tips

ZWSTECHNOTE : Job Tracking : Preventing the loss of TRACKING DATA ( event records )

Troubleshooting

Problem

Symptom

Cause

Diagnosing The Problem

Resolving The Problem

Document Location

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?