IBM Support

PI14673: ISSUES WITH SUBJOB STATE FOR PARALLEL JOBS (USING PJM FUNCTION).

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • Issues with subjob state for parallel jobs (using PJM function).
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED:  Users of WebSphere Extended Deployment      *
    *                  Compute Grid 8.0 and the batch function of  *
    *                  WebSphere Application Server who use the    *
    *                  parallel job manager function.              *
    ****************************************************************
    * PROBLEM DESCRIPTION: Restart and stop/cancel of a            *
    *                      top-level job do not propagate          *
    *                      correctly to the subjobs, especially    *
    *                      for not-yet-submitted subjobs.          *
    ****************************************************************
    * RECOMMENDATION:                                              *
    ****************************************************************
    With the parallel job manager (PJM) function, a failure during
    subjob submission should result in a halt to further subjob
    submission and a failure of the top-level job.  On restart of
    the top-level job, any subjobs that had been dispatched and
    begun executing should be resubmitted, any subjobs that had
    not yet been submitted should be submitted for the first time,
    and no action should be taken for any subjobs that had
    completed.
    Instead, on restart of a top-level job, the PJM is wrongly
    failing to resubmit subjobs that had never been submitted (or
    whose submission failed) during the original top-level job
    execution.  The top-level job may then complete in 'ended'
    state on the restart even though not all subjobs were ever
    executed.  A related problem is that sometimes a subjob may be
    resubmitted on top-level job restart but end up with a
    NullPointerException on resubmission.
    Another related, minor problem, is that during restart the
    status of the top-level job may appear in the Job Management
    Console as 'restartable' after it has already appeared as
    'executing', while it is still in fact executing along with
    the remaining subjobs.
    Along with the restart operation, the PJM function has a
    similar problem in the case that a top-level job is stopped or
    cancelled by the user before all subjobs have been submitted,
    dispatched, and begun executing on an endpoint.
    The stop or cancel does not propagate correctly to the
    subjobs, which are allowed to continue to be submitted and to
    complete dispatch and execution even after the top-level job
    has been stopped or cancelled.
    

Problem conclusion

  • The PJM function was fixed to correctly manage failures before
    all subjobs are submitted and dispatched to correctly allow
    restart of the top-level job.
    
    Also the stop/cancel was fixed to correctly propagate to the
    not yet submitted subjobs and prevent their submission so that
    the top-level job can be restarted cleanly.
    
    Finally, if the original top-level job fails so early in the
    initial execution that complete subjob information hasn't yet
    been persisted (in order to allow for restart) the top-level
    job is placed into EXECUTION_FAILED status, so that it is
    clear it cannot be restarted from that point.
    
    APAR PI14673 is currently targeted for inclusion in Service
    Level (Fix Pack) 8.0.0.4 of WebSphere Compute Grid 8.0.
    
    Please refer to the Recommended Updates page for delivery
    information:
    http://www.ibm.com/support/docview.wss?uid=swg27022998
    

Temporary fix

  • An interim fix is available upon request.
    

Comments

APAR Information

  • APAR number

    PI14673

  • Reported component name

    WXD COMPUTE GRI

  • Reported component ID

    5725C9301

  • Reported release

    800

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2014-03-28

  • Closed date

    2014-06-06

  • Last modified date

    2014-06-06

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    WXD COMPUTE GRI

  • Fixed component ID

    5725C9301

Applicable component levels

  • R800 PSY

       UP

[{"Business Unit":{"code":"BU029","label":"Software"},"Product":{"code":"SSFVRM","label":"WebSphere Extended Deployment Compute Grid"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"8.0"}]

Document Information

Modified date:
28 April 2022