RE: [cciug] DO Corruption

From: Mark Keil (Mark.Keil@cportcorp.com)
Date: Thu Feb 03 2000 - 18:52:22 EST


DO corruption can happen for a number of reasons.
1. Network hiccup
2. NFS bugs
3. Build system crashes/hiccups
4. Tool bugs
5. OS bugs
6. Caching Raid box bugs/Caching software raid bugs
7. Other

1. Yup, network hardware can and does mess up packets on occasion.
   Scan back through the cciug archives, and you will
   find cases of network hardware that has caused a variety
   of failures. (though I don't specifically remember DO cases)
2. There used to be the NFS page of nulls problem where a
   page aligned page of 0's would randomly show up in files
   Only occured in push cases instead of pulls (reads).
   I havn't heard of this one in a while.
3. If one of the builders crashes or hiccups during a build
   a DO can be corrupted. Clearcase catches most of these I believe.
4. gcc 2.8.1 had a bug where it would not remove the .o file if
   there was an error. We had to disable DO creation so these
   things didn't get shared until it was fixed.
   (NOTE, this was on NT, and may have been peculiar to our version
    of gcc)
   Build steps that create a DO in several steps, and don't
   do return status checking at each step could also be a cause.
   Do you have clean builds? Or do you have "OK" errors.
   The corrupt DO's may have been accompanied by ignored
   error/warning messages of some sort. "OK" errors make these hard
   find because developers are used to ignoring anomolies.
5. OS's have been known to scramble data sometimes...
   One way to make SCM systems go fast is to to have extra memory for
   large page buffer caches.
6. Bugs happen.
7. Your installation is unique. You may have interesting factors
   in your setup

Mikes problem could be an OS caching bug. Is the corruption at all
alligned?

Things to do...

1. Network hardware hiccups are hard to find...
2. If it were the page of nulls problem,
   a page of nulls checker could be written.
   Some of these are caught in the link phase.
3. Doing a ls -l of a particular DO kind
   can find short DO's that are a result
   of an incomplete build. These can then be removed with rmdo
4. Turn off DO creation for a class of build types
   if a pattern is found.
5. Check OS vendor bug lists
6. Check Raid vendor bug lists.
7. Understand your setup.
   Ask, where could an intermittant problem crop up?

ALSO:
   Check the CR's of the corrupt DO's
   Is there a pattern?
   A machine?
   A network switch?
   A tool?

   And check the cciug archives.

-Mark

-----Original Message-----
From: Mike Krupicka [mailto:krupicka@tellabs.com

According to lwillard@packetengines.com:
   We have been doing Clearmake builds in preparation for a RELEASE so there
   has been a lot of checkins and changes over the last 24 hours. Most of
the
   builds have been working fine but a few of our engineers have had
problems
   with the archive part of the one section of the build command. It seems
to
   fail on just a few files of the build. One of the main files that it
fails
   on is mem_map.h and modid.h. We figured out if we did the clearmake with
   the -u option we could do the builds by rebuilding everything from
scratch
   but this is kind of like hammering a nail with a sledge hammer. The
   question that I have is How should I have gone about trying to figure out
   what the problem was with the few builds that failed? Is there a way to
   make part of the build more verbose, or a log file I could look into? Or
   this there a way to tell if a checked in file is corrupt in some fashion.
   All permissions both UNIX and through clearcase looked fine.

What is the contents of the corrupt DO? I currently have a case open
with Rational support where a couple of DO's were corrupted with data
that looks like it is part of a configuration record from another DO
built at the same time (using distributed cmake).

If this is what you are seing this please contact Rational Support (and
let me know). The bad news is that there are no audits that can
determine that your DO got hosed. You could run the UNIX file command
on all of the DO's in the VOB storage to verify that the file type is
correct, but that is slow and it is a constantly moving target.

To clean up the problem, the offending DO's should be removed with ct-rmdo.
But if you're DO's are corrupt, talk to Rational. My case has been
open for a while, but unless they get more samples, they can't do much
about it (other than grasping at straws).

-- 

Mike Krupicka krupicka@tellabs.com - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -



This archive was generated by hypermail 2b29 : Sun May 06 2001 - 00:23:01 EDT