Level: Intermediate Lewin Edwards (sysadm@zws.com), Design Engineer, Freelance
14 Nov 2006 While per-transistor failure rates may be down, overall reliability hasn't declined as much as people sometimes assume, and modern systems are often much harder to repair than older ones. Following up on a previous article, Lewin Edwards reviews more of the problems modern engineers face.
Most people reading this article will be familiar with operating system
wars -- perhaps even combatants (I was a Team OS/2 member myself). The
frontline combatants in these wars frequently make statements like,
"Operating system A is cheaper than our product, but the total cost of
ownership is much higher because A needs more support." These statements
are quasi-useless because they are generally backed up by some extremely
subjective data collected under minutely controlled, nay contrived
conditions.
On the other hand, most people unquestioningly accept as a modern truth
that mass-produced goods manufactured today are vastly more reliable
than "mass-"produced (or even artisan-made) goods of previous decades.
Many people also believe that modern industrial machinery is much more efficient
at producing consumer goods with less waste than in times of yore. My
goal in this article is to show you the potential fallacy of some of these assumptions. In fact, when you look
closely, the above ideas are just as much smoke-and-mirrors arguments as the weaponry
used in operating systems warfare. I'll stick to the tried-and-true list
format that most of you enjoyed so much in the recent "Top five engineering hints you'll rarely hear" articles (see Resources).
The underlying theme in my arguments below is that commoditization
itself is a force that works against reliability. When single
transistors cost multiple dollars apiece, they and the designs that
contained them were tested with relative care. Now, we are rapidly
approaching the point where it is cheaper to build excess stock of a
product to service warranty claims than it is to test the product
thoroughly before it leaves the factory.
With that said, here are my specific cases:
#1: Sustained high production volume from an established vendor is not,
by itself, a guarantee of quality
I want to get this point out of the way, because I have observed a lot
of magical, quasi-teleological thinking from people who ought to know
better. Many people seem to use the "ten million flies can't be wrong"
philosophy when they're thinking about product reliability. The
underlying cognitive process apparently goes something like this:
Manufacturer XYZ is making and selling a million units a day.
Manufacturer XYZ is still in business after many years.
Therefore, the output quality of the production process must be good.
Refuting this argument is easy, at least with respect to electronics,
by looking at a contradictory example from an industry that most of you are probably unfamiliar with, at least from an engineering
perspective: electronic toys. (I used to work in this field many years
ago; in Resources you'll find a link to an article I wrote in early
2001, describing the design process for a simple electronic toy. It's
really another world from "real" embedded engineering.)
The microcontrollers used in most speaking toys contain fairly large
memory arrays to hold all that recorded speech data; sixteen megabits is
not unheard of (though of course, many parts are much smaller). That's
an awful lot of die area where production errors could occur. These
parts are also practically always shipped as raw dice, which are mounted
directly on the target PCB, bonded out and encapsulated, often by hand.
Electronic toys are made in enormous volume. A toy company would rarely
consider making a mass-market toy unless the anticipated volumes were at
least a couple of hundred thousand pieces. You'd think they use the best
possible components to keep their return rates down and their margins
up, right?
Wrong. Toys are profitable because they are built using the cheapest
processes and materials that money can (grudgingly) buy. In fact, the
ICs used in these applications are sometimes not even sample-tested by
the foundry; instead, standard industry practice is to ship an
overage of as much as 10% to compensate for defects. Buying these chips
is a bit like bidding on an "as-is" auction on eBay. Long-term
reliability isn't even characterized, let alone guaranteed.
#2: Requirements that are not driven by engineering goals are constantly
injected into the design process
The ramifications of these
requirements are almost always ill understood, and the consequences are
therefore unpredictable.
Of course, this sort of annoyance has been going on since long before
Leonardo da Vinci bowed to the demands of his patrons. Unfortunately,
the consequences are becoming increasingly serious, partly because of
their (initial) invisibility and unpredictability, and mainly because
everybody's life is governed to a much greater degree by "high"
technology than it was a couple of centuries ago. RoHS is perhaps one of
this era's best examples of massive legal changes to the engineering
process, spearheaded by people utterly unable to analyze the
consequences of these changes. You'll find a link in Resources to a
fairly recent Standards and Specs article here on developerWorks, which
references the tin whisker problem with lead-free electronic assemblies.
Briefly, the problem is that metal crystals grow between component leads
and eventually short them out. This problem, which has been well known
for many years, can be mitigated to a great degree by adding lead to
solder mixtures. The RoHS directive removes that preventative. Moreover,
many of the substances used in RoHS components are handily toxic in
their own right; choice of which materials to ban appears to have been
made on a "how long is a piece of string?" basis.
Another issue with RoHS compliance, which people don't discuss quite as
much as the vaunted whisker problem, is that lead-free solders just
don't behave the same way as normal solders. They don't flow and fillet
the same way as leaded solders, they require significantly hotter
soldering profiles, and they result in joints that have a frosted,
"cold" appearance. Inspection of these joints requires retraining
personnel, and the fiery soldering profiles are a real process headache
for some types of components.
As a result of both of these issues, you'll be seeing a large number of
prematurely deceased consumer appliances over the next few years, either
due to whiskers or to production-related issues; what price reliability?
A good analogy to this situation, by the way, is the removal of
tetraethyl and tetramethyl lead from gasoline mixtures. These compounds
increase the octane rating of fuels, and they are referred to as
anti-knock additives. In 1979, these lead-containing compounds were
phased out in favor of MTBE (methyl tertiary-butyl ether). Unrelated
environmental legislation changes in 1992 caused the amount of MTBE in
gasoline blends to increase significantly. Unfortunately, about 20
years after the large-scale introduction of this fascinating compound,
it was determined that MTBE is making its way into the water table, and,
like everything else in the known universe, it may cause cancer in
laboratory rats. MTBE is therefore being replaced with a dash of ethanol
(in some areas, this change is mandatory, but it's likely that the oil
industry will permanently change all its formulations nationwide to
avoid liability issues that have grown up surrounding MTBE). Long term,
ethanol is probably one of the least unwholesome additives we're ever
going to find, but it has the side effect of lowering fuel economy,
resulting in greater net pollution levels.
#3: Purchasing and engineering are frequently separated
In fact, purchasing and engineering are frequently not even in the same
country! This leads to a lack of
communication, and odd mistakes occur, despite procedural barriers.
Separating the component procurement and engineering arms of a hi-tech
manufacturer leads to problems that might not become apparent to quality
control, but are very evident when you look into the longevity
statistics for the company's products.
Take a look around your junk pile. Chances are, you'll find one or more
dead el-cheapo DVD players in there. (Not related to this article, I
took a quick survey in a couple of rows of cubes in the engineering
department where I work during the day and found that 31 such
units were sitting in the basements of 12 engineers.) I picked DVD
players, by the way, because these are very large-volume consumer
articles that also happen to include some high-speed digital and
mixed-signal electronics, as well as a switching power supply and fairly
complex precision mechanical components; they're a veritable cornucopia
of miscellaneous parts, and hence we can expect to see a plethora of
failure modes represented in such an appliance.
As part of this unscientific study, I acquired 14 of the units in
question and inspected them. One unit had a lubrication problem on the
worm gear driving the laser head, and one unit was just dead -- I
observed address and data bus activity on power-up, but the unit didn't
boot past a certain point. I suspect that the flash device holding the
firmware was corrupted.
All 12 of the remaining units had been retired due to failed
electrolytic capacitors in their power supply sections. Interestingly,
all of these failed devices for which I was able to find datasheets were
standard-temperature-grade, standard-life span capacitors rated to only
+85 degrees Celsius. These bottom-of-the-range parts are typically rated
for 1000 or 2000 hours of operation at their maximum temperature (the
life span increases if your application doesn't get quite so warm).
Conversely, if you exceed the temperature rating, the expected life span
plummets drastically. This sort of application would more typically use
extended-temperature parts (+105 degrees Celsius).
Now, I don't actually know the corporate history behind these specific
products (and I'll charitably assume the device was originally
engineered properly), but I have on many occasions observed a similar
problem occur due to the activity of a purchasing department; in brief,
engineering will initially design in the right part for a particular
job, but someone else will later substitute a cheaper version that looks
similar but is missing some critical parameter. Everything works fine
until the units have been out in the field for a while. (I wrote about
this problem at some length in my first "Top five engineering hints..."
article -- see Resources.)
This problem is on a steady downward spiral as companies become more
globalized.
#4: Modern components are designed for modern mass-production techniques
After reading this heading, you're probably asking yourself "so what?".
The reason this fact is important is that tolerances for everything --
component placement, PCB trace-space separation, component lead spacing,
and so forth -- are shrinking. The trend towards completely leadless
packages (QFN, MLF, and BGA variants, for instance) is a big part of
this. Unfortunately, one fact that comes along for the ride here is that
dense packages with high pin counts now require multilayer PCBs to achieve fanout. These boards are more difficult to fabricate, and
more susceptible to damage, than older two-layer boards. BGA solder
joints are also notoriously difficult to inspect, and faulty BGA solder
jobs are frequently a source of intermittent failures in fielded
products (see Resources for information on this topic).
Process variations that would have been minor -- perhaps undetectable --
on a 1980s-era production line are very serious issues today, and not
all of these issues are caught by outgoing quality control inspections.
#5: Service information is becoming increasingly difficult to obtain and
less useful when you do manage to find it
Even as late as the mid-1990s, many entry-level consumer television sets
(by way of example) included a schematic diagram and sketches of the
waveforms you would expect to see at various test points in the circuit.
This information is not included with modern sets, but it would be
pretty much useless anyway. Much like every other appliance, a modern TV
set consists of a fistful of generic parts surrounding one or two
application-specific or even product-specific components that are only
available from the set's manufacturer. The rework cost alone to remove
some of these devices is prohibitive, even if replacement parts were
available (which they frequently aren't).
Simply finding a way to probe internal signals and diagnose a fault is a
real challenge on modern equipment. Most of the signal paths are buried
inside custom mixed-signal integrated circuits, and the remainder all
too frequently exit a chip package in the middle of a BGA footprint,
dive directly into an internal layer of the PCB, and only surface again
under another BGA. In between, they travel through those internal layers
and blind vias, quite impossible to probe.
As a result, most modern equipment is serviced (if at all) only on the
subassembly level. To appreciate what sort of costs and anguish this can
generate, you need look no further than the Usenet groups where notebook
computers are discussed. A US$1,000 notebook that suffers a single
component failure on the mainboard may be more or less undiagnosable.
The "fix" is to replace the mainboard, typically at a cost of US$500 or
more. Users who ask for detailed service manuals are frequently
shocked to find that even the most detailed "secret" service manuals
provided to authorized repair centers only cover the most basic sort of
subassembly-level disassembly and troubleshooting.
#6: The feedback loop between customers and engineering is often
nonexistent today
In today's hi-tech marketplace are many "shake and bake"
companies that scarcely manufacture a single product. All of the
manufacturing, and frequently the design, are outsourced. This is
not inherently a bad thing, but unfortunately it means that the design
flow is all one-way. Particularly in the case of fast-moving products
like personal computers, there may be no feedback given to engineering
on the success or failure of its latest design.
One example of this I encountered some time ago is a certain
little-known brand of laptop computer. If you were to browse
through eBay and similar venues, you would see that practically every
machine of this model up for sale has major damage to the bottom
plastics around the LCD hinges. The reason for this is that the hinges
are brittle, cast parts. They have two connection points: one screw that
goes up the center of a post, and another screw that goes through a tab
projecting out the side of the post. As long as both of these attachment
points are secure, everything is fine. Unfortunately, over time the
following events occur (apparently inevitably):
The hinges collect dirt, corrode, or the lubricant is displaced. The
hinge mechanism begins to stiffen.
The side-projecting tab on each hinge breaks off the post.
Enormous stress (during closing and opening of the LCD) is concentrated
on the screw that runs through the bottom of the post.
The bottom of one post breaks out of the back of the lower plastic.
The other post eventually breaks out as well.
The mechanical design of this particular computer was developed by an
outside design house. If it was an internally generated design, direct
customer feedback might have filtered back to engineering and would have
led to design improvements.
Putting it all
together
In conclusion then: While individual components are certainly becoming
more reliable (on a "failures per transistor" rate, anyway), the
complexity of applications and interconnects, the high density of most
modern electronics, commercial priorities of the manufacturer, and
administrative complications around the production process, are draining
away a large proportion of the reliability benefits. Field
serviceability of consumer electronics (in particular) has also reached
near-zero.
The idea that modern processes and components have led to a net increase
in product longevity seems, therefore, to be something of a sophism. The
consequences of this go considerably beyond a few dead DVD players.
While my intention is not to climb on a soapbox about the deleterious
environmental effects of a consumption-driven culture, the cost in energy and raw materials is enormous when we manufacture devices
that require a large entropy input (so to speak) to build and then become
useless junk shortly afterwards. I'm not even counting the march of
progress making appliances obsolete, either; that's a whole separate story.
Resources
About the author  | |  |
Lewin A.R.W. Edwards works for a Fortune 50 company as a wireless security/fire safety device design engineer. Prior to that, he spent five years
developing x86, ARM and PA-RISC-based networked multimedia appliances at
Digi-Frame Inc. He has extensive experience in encryption and security
software and is the author of two books on embedded systems development.
He can be reached at sysadm@zws.com.
|
Rate this page
|