IBM Support

JR61174: ADD KEEP-ALIVE OPTION FOR INFORMATION SERVER PARALLEL ENGINE CONDUCTOR/SECTION-LEADER MESSAGES

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • This issue effects parallel jobs running in SMP, MPP, or YARN
    environments where there is firewall set up with timeout
    between conductor node and section leader processes.  If the
    job's  execution time  is longer than the
    firewall timeout, the conductor node can fail  to capture
    messages from section leader processes.
    Even if the  parallel engine processes complete
    successfully, the job reports  a fatal error because of timeout
    occuring in the PX process manager
    In a YARN configuration, ther is an error message is of the
    form :
    ----
       Item #: iiiii
       Event ID: eeeee
       Timestamp: YYYY-MM-DD HH:MM:SS
       Type: Fatal
       User Name: <USERNAME>
       Message Id: IIS-DSEE-TFPM-00520
       Message: main_program: Fatal Error: Failed to read data from
    Application Master sending container status message. Check the
    health of YARN/Hadoop. Socket error: Connection timed out. Look
    into Application Master's logs for more information at
    <ENGINE_SERVER_NAME>:/<HADOOP_FILE_SYSTEM_ROOT>/hdfs/data8/nmlog
    /application_1556627420495_0229/container_1556627420495_0229_01_
    000001/oshjob.0229_0. These logs will be moved to a log
    aggregation directory after job completion if Hadoop log
    aggregation is enabled.
    ----
    In that configuration, a keep-alive mechanism is the only
    possible way to make the Parallel Job complete successfully.
    

Local fix

  • Adjust firewall timeout to appropriate value longer to the
    longest Job's execution.
    

Problem summary

  • This issue effects parallel jobs running in SMP, MPP, or YARN
    environments where there is firewall set up with timeout
    between conductor node and section leader processes. If the
    job's execution time is longer than the
    firewall timeout, the conductor node can fail to capture
    messages from section leader processes.
    Even if the parallel engine processes complete
    successfully, the job reports a fatal error because of timeout
    occuring in the PX process manager
    In a YARN configuration, ther is an error message is of the
    form :
    Item #: iiiii
    Event ID: eeeee
    Timestamp: YYYY-MM-DD HH:MM:SS
    Type: Fatal
    User Name:
    Message Id: IIS-DSEE-TFPM-00520
    Message: main_program: Fatal Error: Failed to read data from
    Application Master sending container status message. Check the
    health of YARN/Hadoop. Socket error: Connection timed out. Look
    into Application Master's logs for more information at
    <ENGINE_SERVER_NAME>:/<HADOOP_FILE_SYSTEM_ROOT>/hdfs/data8/nmlog
    /application_1556627420495_0229/container_1556627420495_0229_01_
    000001/oshjob.0229_0. These logs will be moved to a log
    aggregation directory after job completion if Hadoop log
    aggregation is enabled.
    In that configuration, a keep-alive mechanism is the only
    possible way to make the Parallel Job complete successfully.
    

Problem conclusion

  • A patch is available which add a keep-alive function which
    fixes the problem.
    The patch implements socket level keepalive connection setting
    by overriding the system default values using
    TCP_KEEPCNT
    TCP_KEEPIDLE
    TCP_KEEPINTVL
    Keepalive probe is send from the section leader / player
    process in compute nodes to conductor on regular intervals as
    per the above setting so that some traffic is maintained in
    connections.
    Above envrionment variables are set from Infosphere
    Infomationserver using
    APT_TCPKEEPCNT
    APT_TCPKEEPIDLE
    APT_TCPKEEPINTVL
    

Temporary fix

Comments

APAR Information

  • APAR number

    JR61174

  • Reported component name

    WIS DATASTAGE

  • Reported component ID

    5724Q36DS

  • Reported release

    B70

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2019-06-05

  • Closed date

    2019-07-30

  • Last modified date

    2019-07-30

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Modules/Macros

  • SERVER
    

Fix information

  • Fixed component name

    WIS DATASTAGE

  • Fixed component ID

    5724Q36DS

Applicable component levels

  • RB71 PSY

       UP

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSVSEF","label":"InfoSphere DataStage"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"11.7","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
17 October 2021