PK97793: AFTER APPLYING FIX FOR RATLC01299590, PROCESSES HANG WITHIN A ZO MBIE STATE FOR SEVERAL MINUTES UNDER CERTAIN CONDITIONS

Fixes are available

APAR status

Closed as program error.

Error description

Processes hang within a zombie state for several minutes under
certain conditions

ClearCase 7.0.x

Any Linux OS


Description of Problem:

Previous history of fix applied:
***   ***   ***
        When a process is killed, no matter what reason, all
files opened by said process (and not voluntarily closed) are
automatically closed by the OS:  However, the OS does not remove
the process from the process list until the entire file cleanup
completes. On a regular file, the close operation is just done
and the process exits.  However, when the file being closed is
from within an MVFS mount, MVFS must first to communicate with
the VOB//View server, using RPC calls.

The problem arises when the process is signaled (such as a
ctrl-c):  Under these circumstances, the OS denies RPC
initiation (which is a big problem as we rely on the RPC
communication to complete our tasks).  Since we try and
communicate and are inhibited, we retry again after 5 seconds.
To wait those 5 seconds, (in 7.0.x) we are using mdelay, which
is a busy loop -- This is why the CPU's usage propels up to 100
percent.
***   ***   ***

Specific problem now with the code modification:

        Because of the events that occur within the defect
above, after each failed communication attempt we spent 5
seconds waiting before each RPC retry to the VOB/View --
remember: only when the process has opened files and pending
fatal signals will this delay occur.  In reviewing, each close
call will expand to 2 or more calls: a 'flush' call for each
time the file was opened (at least 1) and a 'release' call when
there are no more references to that file (the 'last' close, or
the real close). So, a single close operation requires more than
one set of RPC communications to the VOB/View server plus the
time required by the underlying FS to actually close the
cleartext file.  In reviewing all of the retries and what is
occurring with the minute or so delay, this is to be expected
with the current MVFS internal processes.


Work Around:  Signal handle ctrl-c (as well as others) to
gracefully close the open files and exit cleanly.

Local fix

Problem summary

****************************************************************
* USERS AFFECTED:                                              *
****************************************************************
* PROBLEM DESCRIPTION:                                         *
****************************************************************
* RECOMMENDATION:                                              *
****************************************************************
When a process is being terminated by a signal, the RPC
layer blocks any further RPCs from that process, returning
the error ERESTARTSYS to the calling layer.  MVFS includes
retry logic for certain RPC errors, and that retry loop,
compounded by the number of files the process had open, was
adding significant time to the process termination.
The MVFS RPC code now handles this case by checking for the
pending signal.

Problem conclusion

Fixed in ClearCase 7.1.1.8, 7.1.2.5, and 8.0.0.1.

Temporary fix

Comments

APAR Information

APAR number
PK97793
Reported component name
CLEARCASE UNIX
Reported component ID
5724G2901
Reported release
701
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt
Submitted date
2009-10-02
Closed date
2011-12-16
Last modified date
2011-12-16

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
CLEARCASE UNIX
Fixed component ID
5724G2901

Applicable component levels

R701 PSN
UP

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSSH27","label":"Rational ClearCase"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"7.0.1","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
16 December 2011

Tips

PK97793: AFTER APPLYING FIX FOR RATLC01299590, PROCESSES HANG WITHIN A ZO MBIE STATE FOR SEVERAL MINUTES UNDER CERTAIN CONDITIONS

Fixes are available

Subscribe

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

R701 PSN

Document Information

Share your feedback

Need support?