PM74855: HUNG ENDPOINT SERVER CAUSES WSGRID FUNCTION IN COMPUTE GRID JOB SCHEDULER SERVER TO STOP SENDING OUTPUT.

Fixes are available

8.0.0.3: WebSphere Extended Deployment Compute Grid V8.0 Fix Pack 3
8.0.0.4: WebSphere Extended Deployment Compute Grid V8.0 Fix Pack 4
8.0.0.5: WebSphere Extended Deployment Compute Grid V8.0 Fix Pack 5

APAR status

Closed as program error.

Error description

Customer was running Compute Grid 8.0 in their production
environment. The jobs in this environment are triggered via
WSGrid, and they bserved that several jobs had active WSGrid
sessions that  were not reflected as jobs in the JMC. There was
only one job was in the executing state in the JMC, but its
joblog indicated that it should have been in a different  state.
Once they cycled the endpoint appserver that  this job had run
on, at which point the normal flow of jobs through the
environment via WSGrid resumed.

Local fix

Problem summary

****************************************************************
* USERS AFFECTED:  All users of WebSphere Extended Deployment  *
*                  Compute Grid Version 8.                     *
****************************************************************
* PROBLEM DESCRIPTION: Job log streaming for jobs submitted    *
*                      via WSGrid (e.g. via an external        *
*                      scheduler) appears to stop due to a     *
*                      hang or a slowdown on an endpoint       *
*                      server executing already-submitted      *
*                      (via WSGrid) jobs.                      *
****************************************************************
* RECOMMENDATION:                                              *
****************************************************************
The problem can happen when an endpoint server executing
jobs dispatched via the WSGrid interface experiences a
slowdown, e.g. because it is thrashing en route to running out
of memory,
If an endpoint slows down enough, it might not respond to
the scheduler's requests to receive job log updates and status
updates and send them back to the WSGrid client (e.g. the
external scheduler).
The scheduler threads may hang for a long time, waiting for the
endpoint's response.  Since there is only a single thread pool
in the scheduler used to stream the output from all endpoints,
this can lead to the situation where there is no output being
received over the WSGrid interface at all, (since all the
relevant scheduler threads are hung waiting for output from a
single bad server).
However, the jobs submitted to the other (good) endpoints
should
still have executed normally in this scenario, although the
output is not handled properly and sent back to the WSGrid
client.

Problem conclusion

The scheduler threads streaming output from the endpoint server
for WSGrid-submitted jobs back to the WSGrid client will now
timeout rather than hanging indefinitely.   So a single bad
endpoint can slow down output streaming, but only in proportion
to the number of jobs on these endpoints compared to the total
jobs managed by this scheduler, rather than preventing
streaming of all WSGrid output.
The fix for this APAR is currently targeted for inclusion in
fixpack 8.0.0.3. Please refer to the Recommended Updates page
for delivery information:
http://www.ibm.com/support/docview.wss?uid=swg27022998

Temporary fix

Comments

APAR Information

APAR number
PM74855
Reported component name
WXD COMPUTE GRI
Reported component ID
5725C9301
Reported release
800
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt
Submitted date
2012-10-11
Closed date
2013-01-02
Last modified date
2013-04-19

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

PM75190

Fix information

Fixed component name
WXD COMPUTE GRI
Fixed component ID
5725C9301

Applicable component levels

R800 PSY
UP

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSFVRM","label":"WebSphere Extended Deployment Compute Grid"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"8.0","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
29 October 2021

Tips

PM74855: HUNG ENDPOINT SERVER CAUSES WSGRID FUNCTION IN COMPUTE GRID JOB SCHEDULER SERVER TO STOP SENDING OUTPUT.

Fixes are available

Subscribe

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

R800 PSY

Document Information

Share your feedback

Need support?