DO corruption can happen for a number of reasons.
1. Network hiccup
2. NFS bugs
3. Build system crashes/hiccups
4. Tool bugs
5. OS bugs
6. Caching Raid box bugs/Caching software raid bugs
7. Other
1. Yup, network hardware can and does mess up packets on occasion.
Scan back through the cciug archives, and you will
find cases of network hardware that has caused a variety
of failures. (though I don't specifically remember DO cases)
2. There used to be the NFS page of nulls problem where a
page aligned page of 0's would randomly show up in files
Only occured in push cases instead of pulls (reads).
I havn't heard of this one in a while.
3. If one of the builders crashes or hiccups during a build
a DO can be corrupted. Clearcase catches most of these I believe.
4. gcc 2.8.1 had a bug where it would not remove the .o file if
there was an error. We had to disable DO creation so these
things didn't get shared until it was fixed.
(NOTE, this was on NT, and may have been peculiar to our version
of gcc)
Build steps that create a DO in several steps, and don't
do return status checking at each step could also be a cause.
Do you have clean builds? Or do you have "OK" errors.
The corrupt DO's may have been accompanied by ignored
error/warning messages of some sort. "OK" errors make these hard
find because developers are used to ignoring anomolies.
5. OS's have been known to scramble data sometimes...
One way to make SCM systems go fast is to to have extra memory for
large page buffer caches.
6. Bugs happen.
7. Your installation is unique. You may have interesting factors
in your setup
Mikes problem could be an OS caching bug. Is the corruption at all
alligned?
Things to do...
1. Network hardware hiccups are hard to find...
2. If it were the page of nulls problem,
a page of nulls checker could be written.
Some of these are caught in the link phase.
3. Doing a ls -l of a particular DO kind
can find short DO's that are a result
of an incomplete build. These can then be removed with rmdo
4. Turn off DO creation for a class of build types
if a pattern is found.
5. Check OS vendor bug lists
6. Check Raid vendor bug lists.
7. Understand your setup.
Ask, where could an intermittant problem crop up?
ALSO:
Check the CR's of the corrupt DO's
Is there a pattern?
A machine?
A network switch?
A tool?
And check the cciug archives.
-Mark
-----Original Message-----
From: Mike Krupicka [mailto:krupicka@tellabs.com
According to lwillard@packetengines.com:
We have been doing Clearmake builds in preparation for a RELEASE so there
has been a lot of checkins and changes over the last 24 hours. Most of
the
builds have been working fine but a few of our engineers have had
problems
with the archive part of the one section of the build command. It seems
to
fail on just a few files of the build. One of the main files that it
fails
on is mem_map.h and modid.h. We figured out if we did the clearmake with
the -u option we could do the builds by rebuilding everything from
scratch
but this is kind of like hammering a nail with a sledge hammer. The
question that I have is How should I have gone about trying to figure out
what the problem was with the few builds that failed? Is there a way to
make part of the build more verbose, or a log file I could look into? Or
this there a way to tell if a checked in file is corrupt in some fashion.
All permissions both UNIX and through clearcase looked fine.
What is the contents of the corrupt DO? I currently have a case open
with Rational support where a couple of DO's were corrupted with data
that looks like it is part of a configuration record from another DO
built at the same time (using distributed cmake).
If this is what you are seing this please contact Rational Support (and
let me know). The bad news is that there are no audits that can
determine that your DO got hosed. You could run the UNIX file command
on all of the DO's in the VOB storage to verify that the file type is
correct, but that is slow and it is a constantly moving target.
To clean up the problem, the offending DO's should be removed with ct-rmdo.
But if you're DO's are corrupt, talk to Rational. My case has been
open for a while, but unless they get more samples, they can't do much
about it (other than grasping at straws).
--Mike Krupicka krupicka@tellabs.com - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
This archive was generated by hypermail 2b29 : Sun May 06 2001 - 00:23:01 EDT