IBM Support

JR35910: AFTER APPLYING FP1, SOME JOBS HAVE INTERMITTENTLY THE FOLLOWING FATAL ERRORS.

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • [Problem]
    After applying FP1, some jobs have intermittently the following
    FATAL errors. (since I translated the Japanese message to
    English, the wording might not be accurate...)
    Jobs were running successfully before applying FP1.
    
    
       Item #: 22
       Event ID: 153
       Timestamp: 2010-01-26 17:32:26
       Type:FATAL
       Username: dsadm
       Message ID: IIS-DSEE-TFIO-00231
       Message: /gpf/data/mid/GPF_DS_CD/GPFDSU1200_2.ds,1:
    Configured timeout of 600 seconds reached for accepting player
    connections for pid 13,636. Pending fifo count: 0. Pending
    shared memory count: 1.  This is most likely due to the failure
    of an upstream operator.
    
       Item #: 23
       Event ID: 154
       Timestamp: 2010-01-26 17:32:26
       Type: FATAL
       Username: dsadm
       Message ID: IIS-DSEE-TFPM-00123
       Message: /gpf/data/mid/GPF_DS_CD/GPFDSU1200_2.ds,1: Fatal
    Error: Cannot start  ORCHESTRATE network connection on Node
    node2 (gpfds). APT_PMConnectionSetup::acceptConnection: Cannot
    accept the connection.
    
     [Additional info.]
    - The same issue happens on several jobs.
    - This happens intermittently. Some times the job aborts but
    some times the job finishes without any problem even though the
    same job and same data is used.
    - The error message shows 600sec Timeout, but it does not take
    600 sec. when the issue happens.
    - If the number of node is 1, the issue does not happen even if
    he tries to test 10 times. But the issue happens when the number
    of node is more than 2.
    - now I'm confirming if there is any change on the system around
    when applying FP1.
    - I'm requesting the job design by using which it is possible to
    reproduce the issue.
    

Local fix

Problem summary

  • When using multi-node APT_CONFIG_FILE, a job or jobs may abort
    with following error even the time interval is much less than 10
    minute (600 seconds.)
    
    Message ID: IIS-DSEE-TFIO-00231
    Message: <the-stage-name with node-number>: Configured timeout
    of 600 seconds reached for accepting player connections for pid
    <the-pid>. Pending fifo count: 0. Pending shared memory count:
    1.  This is most likely due to the failur of an upstream
    operator.
    

Problem conclusion

  • Install the patch.
    

Temporary fix

  • Using 1 node configuration file.
    

Comments

APAR Information

  • APAR number

    JR35910

  • Reported component name

    WIS DATASTAGE

  • Reported component ID

    5724Q36DS

  • Reported release

    810

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2010-03-14

  • Closed date

    2011-05-13

  • Last modified date

    2011-05-13

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    WIS DATASTAGE

  • Fixed component ID

    5724Q36DS

Applicable component levels

  • R810 PSY

       UP

  • R850 PSY

       UP

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSVSEF","label":"InfoSphere DataStage"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"8.1","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
12 October 2021