IBM Support

PK97793: AFTER APPLYING FIX FOR RATLC01299590, PROCESSES HANG WITHIN A ZO MBIE STATE FOR SEVERAL MINUTES UNDER CERTAIN CONDITIONS

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • Processes hang within a zombie state for several minutes under
    certain conditions
    
    ClearCase 7.0.x
    
    Any Linux OS
    
    
    Description of Problem:
    
    Previous history of fix applied:
    ***   ***   ***
            When a process is killed, no matter what reason, all
    files opened by said process (and not voluntarily closed) are
    automatically closed by the OS:  However, the OS does not remove
    the process from the process list until the entire file cleanup
    completes. On a regular file, the close operation is just done
    and the process exits.  However, when the file being closed is
    from within an MVFS mount, MVFS must first to communicate with
    the VOB//View server, using RPC calls.
    
    The problem arises when the process is signaled (such as a
    ctrl-c):  Under these circumstances, the OS denies RPC
    initiation (which is a big problem as we rely on the RPC
    communication to complete our tasks).  Since we try and
    communicate and are inhibited, we retry again after 5 seconds.
    To wait those 5 seconds, (in 7.0.x) we are using mdelay, which
    is a busy loop -- This is why the CPU's usage propels up to 100
    percent.
    ***   ***   ***
    
    Specific problem now with the code modification:
    
            Because of the events that occur within the defect
    above, after each failed communication attempt we spent 5
    seconds waiting before each RPC retry to the VOB/View --
    remember: only when the process has opened files and pending
    fatal signals will this delay occur.  In reviewing, each close
    call will expand to 2 or more calls: a 'flush' call for each
    time the file was opened (at least 1) and a 'release' call when
    there are no more references to that file (the 'last' close, or
    the real close). So, a single close operation requires more than
    one set of RPC communications to the VOB/View server plus the
    time required by the underlying FS to actually close the
    cleartext file.  In reviewing all of the retries and what is
    occurring with the minute or so delay, this is to be expected
    with the current MVFS internal processes.
    
    
    Work Around:  Signal handle ctrl-c (as well as others) to
    gracefully close the open files and exit cleanly.
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    ****************************************************************
    * RECOMMENDATION:                                              *
    ****************************************************************
    When a process is being terminated by a signal, the RPC
    layer blocks any further RPCs from that process, returning
    the error ERESTARTSYS to the calling layer.  MVFS includes
    retry logic for certain RPC errors, and that retry loop,
    compounded by the number of files the process had open, was
    adding significant time to the process termination.
    The MVFS RPC code now handles this case by checking for the
    pending signal.
    

Problem conclusion

  • Fixed in ClearCase 7.1.1.8, 7.1.2.5, and 8.0.0.1.
    

Temporary fix

Comments

APAR Information

  • APAR number

    PK97793

  • Reported component name

    CLEARCASE UNIX

  • Reported component ID

    5724G2901

  • Reported release

    701

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2009-10-02

  • Closed date

    2011-12-16

  • Last modified date

    2011-12-16

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    CLEARCASE UNIX

  • Fixed component ID

    5724G2901

Applicable component levels

  • R701 PSN

       UP

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSSH27","label":"Rational ClearCase"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"7.0.1","Edition":"","Line of Business":{"code":"LOB36","label":"IBM Automation"}}]

Document Information

Modified date:
16 December 2011