Fixes are available
APAR status
Closed as program error.
Error description
This issue effects parallel jobs running in SMP, MPP, or YARN environments where there is firewall set up with timeout between conductor node and section leader processes. If the job's execution time is longer than the firewall timeout, the conductor node can fail to capture messages from section leader processes. Even if the parallel engine processes complete successfully, the job reports a fatal error because of timeout occuring in the PX process manager In a YARN configuration, ther is an error message is of the form : ---- Item #: iiiii Event ID: eeeee Timestamp: YYYY-MM-DD HH:MM:SS Type: Fatal User Name: <USERNAME> Message Id: IIS-DSEE-TFPM-00520 Message: main_program: Fatal Error: Failed to read data from Application Master sending container status message. Check the health of YARN/Hadoop. Socket error: Connection timed out. Look into Application Master's logs for more information at <ENGINE_SERVER_NAME>:/<HADOOP_FILE_SYSTEM_ROOT>/hdfs/data8/nmlog /application_1556627420495_0229/container_1556627420495_0229_01_ 000001/oshjob.0229_0. These logs will be moved to a log aggregation directory after job completion if Hadoop log aggregation is enabled. ---- In that configuration, a keep-alive mechanism is the only possible way to make the Parallel Job complete successfully.
Local fix
Adjust firewall timeout to appropriate value longer to the longest Job's execution.
Problem summary
This issue effects parallel jobs running in SMP, MPP, or YARN environments where there is firewall set up with timeout between conductor node and section leader processes. If the job's execution time is longer than the firewall timeout, the conductor node can fail to capture messages from section leader processes. Even if the parallel engine processes complete successfully, the job reports a fatal error because of timeout occuring in the PX process manager In a YARN configuration, ther is an error message is of the form : Item #: iiiii Event ID: eeeee Timestamp: YYYY-MM-DD HH:MM:SS Type: Fatal User Name: Message Id: IIS-DSEE-TFPM-00520 Message: main_program: Fatal Error: Failed to read data from Application Master sending container status message. Check the health of YARN/Hadoop. Socket error: Connection timed out. Look into Application Master's logs for more information at <ENGINE_SERVER_NAME>:/<HADOOP_FILE_SYSTEM_ROOT>/hdfs/data8/nmlog /application_1556627420495_0229/container_1556627420495_0229_01_ 000001/oshjob.0229_0. These logs will be moved to a log aggregation directory after job completion if Hadoop log aggregation is enabled. In that configuration, a keep-alive mechanism is the only possible way to make the Parallel Job complete successfully.
Problem conclusion
A patch is available which add a keep-alive function which fixes the problem. The patch implements socket level keepalive connection setting by overriding the system default values using TCP_KEEPCNT TCP_KEEPIDLE TCP_KEEPINTVL Keepalive probe is send from the section leader / player process in compute nodes to conductor on regular intervals as per the above setting so that some traffic is maintained in connections. Above envrionment variables are set from Infosphere Infomationserver using APT_TCPKEEPCNT APT_TCPKEEPIDLE APT_TCPKEEPINTVL
Temporary fix
Comments
APAR Information
APAR number
JR61174
Reported component name
WIS DATASTAGE
Reported component ID
5724Q36DS
Reported release
B70
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2019-06-05
Closed date
2019-07-30
Last modified date
2019-07-30
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Modules/Macros
SERVER
Fix information
Fixed component name
WIS DATASTAGE
Fixed component ID
5724Q36DS
Applicable component levels
RB71 PSY
UP
[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSVSEF","label":"InfoSphere DataStage"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"11.7","Line of Business":{"code":"LOB10","label":"Data and AI"}}]
Document Information
Modified date:
17 October 2021