Skip to main content

skip to main content

developerWorks  >  Power Architecture technology  >

Don't let these disasters happen to you: A pox on modern engineering, Part 2

Service life and the design cycle

developerWorks
Document options

Document options requiring JavaScript are not displayed

Discuss


Rate this page

Help us improve this content


Level: Intermediate

Lewin Edwards (sysadm@zws.com), Design Engineer, Freelance

14 Nov 2006

While per-transistor failure rates may be down, overall reliability hasn't declined as much as people sometimes assume, and modern systems are often much harder to repair than older ones. Following up on a previous article, Lewin Edwards reviews more of the problems modern engineers face.

Most people reading this article will be familiar with operating system wars -- perhaps even combatants (I was a Team OS/2 member myself). The frontline combatants in these wars frequently make statements like, "Operating system A is cheaper than our product, but the total cost of ownership is much higher because A needs more support." These statements are quasi-useless because they are generally backed up by some extremely subjective data collected under minutely controlled, nay contrived conditions.

On the other hand, most people unquestioningly accept as a modern truth that mass-produced goods manufactured today are vastly more reliable than "mass-"produced (or even artisan-made) goods of previous decades. Many people also believe that modern industrial machinery is much more efficient at producing consumer goods with less waste than in times of yore. My goal in this article is to show you the potential fallacy of some of these assumptions. In fact, when you look closely, the above ideas are just as much smoke-and-mirrors arguments as the weaponry used in operating systems warfare. I'll stick to the tried-and-true list format that most of you enjoyed so much in the recent "Top five engineering hints you'll rarely hear" articles (see Resources).

The underlying theme in my arguments below is that commoditization itself is a force that works against reliability. When single transistors cost multiple dollars apiece, they and the designs that contained them were tested with relative care. Now, we are rapidly approaching the point where it is cheaper to build excess stock of a product to service warranty claims than it is to test the product thoroughly before it leaves the factory.

With that said, here are my specific cases:

#1: Sustained high production volume from an established vendor is not, by itself, a guarantee of quality

I want to get this point out of the way, because I have observed a lot of magical, quasi-teleological thinking from people who ought to know better. Many people seem to use the "ten million flies can't be wrong" philosophy when they're thinking about product reliability. The underlying cognitive process apparently goes something like this:

Manufacturer XYZ is making and selling a million units a day. Manufacturer XYZ is still in business after many years. Therefore, the output quality of the production process must be good.

Refuting this argument is easy, at least with respect to electronics, by looking at a contradictory example from an industry that most of you are probably unfamiliar with, at least from an engineering perspective: electronic toys. (I used to work in this field many years ago; in Resources you'll find a link to an article I wrote in early 2001, describing the design process for a simple electronic toy. It's really another world from "real" embedded engineering.)

The microcontrollers used in most speaking toys contain fairly large memory arrays to hold all that recorded speech data; sixteen megabits is not unheard of (though of course, many parts are much smaller). That's an awful lot of die area where production errors could occur. These parts are also practically always shipped as raw dice, which are mounted directly on the target PCB, bonded out and encapsulated, often by hand.

Electronic toys are made in enormous volume. A toy company would rarely consider making a mass-market toy unless the anticipated volumes were at least a couple of hundred thousand pieces. You'd think they use the best possible components to keep their return rates down and their margins up, right?

Wrong. Toys are profitable because they are built using the cheapest processes and materials that money can (grudgingly) buy. In fact, the ICs used in these applications are sometimes not even sample-tested by the foundry; instead, standard industry practice is to ship an overage of as much as 10% to compensate for defects. Buying these chips is a bit like bidding on an "as-is" auction on eBay. Long-term reliability isn't even characterized, let alone guaranteed.

#2: Requirements that are not driven by engineering goals are constantly injected into the design process

The ramifications of these requirements are almost always ill understood, and the consequences are therefore unpredictable.

Of course, this sort of annoyance has been going on since long before Leonardo da Vinci bowed to the demands of his patrons. Unfortunately, the consequences are becoming increasingly serious, partly because of their (initial) invisibility and unpredictability, and mainly because everybody's life is governed to a much greater degree by "high" technology than it was a couple of centuries ago. RoHS is perhaps one of this era's best examples of massive legal changes to the engineering process, spearheaded by people utterly unable to analyze the consequences of these changes. You'll find a link in Resources to a fairly recent Standards and Specs article here on developerWorks, which references the tin whisker problem with lead-free electronic assemblies. Briefly, the problem is that metal crystals grow between component leads and eventually short them out. This problem, which has been well known for many years, can be mitigated to a great degree by adding lead to solder mixtures. The RoHS directive removes that preventative. Moreover, many of the substances used in RoHS components are handily toxic in their own right; choice of which materials to ban appears to have been made on a "how long is a piece of string?" basis.

Another issue with RoHS compliance, which people don't discuss quite as much as the vaunted whisker problem, is that lead-free solders just don't behave the same way as normal solders. They don't flow and fillet the same way as leaded solders, they require significantly hotter soldering profiles, and they result in joints that have a frosted, "cold" appearance. Inspection of these joints requires retraining personnel, and the fiery soldering profiles are a real process headache for some types of components.

As a result of both of these issues, you'll be seeing a large number of prematurely deceased consumer appliances over the next few years, either due to whiskers or to production-related issues; what price reliability?

A good analogy to this situation, by the way, is the removal of tetraethyl and tetramethyl lead from gasoline mixtures. These compounds increase the octane rating of fuels, and they are referred to as anti-knock additives. In 1979, these lead-containing compounds were phased out in favor of MTBE (methyl tertiary-butyl ether). Unrelated environmental legislation changes in 1992 caused the amount of MTBE in gasoline blends to increase significantly. Unfortunately, about 20 years after the large-scale introduction of this fascinating compound, it was determined that MTBE is making its way into the water table, and, like everything else in the known universe, it may cause cancer in laboratory rats. MTBE is therefore being replaced with a dash of ethanol (in some areas, this change is mandatory, but it's likely that the oil industry will permanently change all its formulations nationwide to avoid liability issues that have grown up surrounding MTBE). Long term, ethanol is probably one of the least unwholesome additives we're ever going to find, but it has the side effect of lowering fuel economy, resulting in greater net pollution levels.

#3: Purchasing and engineering are frequently separated

In fact, purchasing and engineering are frequently not even in the same country! This leads to a lack of communication, and odd mistakes occur, despite procedural barriers.

Separating the component procurement and engineering arms of a hi-tech manufacturer leads to problems that might not become apparent to quality control, but are very evident when you look into the longevity statistics for the company's products.

Take a look around your junk pile. Chances are, you'll find one or more dead el-cheapo DVD players in there. (Not related to this article, I took a quick survey in a couple of rows of cubes in the engineering department where I work during the day and found that 31 such units were sitting in the basements of 12 engineers.) I picked DVD players, by the way, because these are very large-volume consumer articles that also happen to include some high-speed digital and mixed-signal electronics, as well as a switching power supply and fairly complex precision mechanical components; they're a veritable cornucopia of miscellaneous parts, and hence we can expect to see a plethora of failure modes represented in such an appliance.

As part of this unscientific study, I acquired 14 of the units in question and inspected them. One unit had a lubrication problem on the worm gear driving the laser head, and one unit was just dead -- I observed address and data bus activity on power-up, but the unit didn't boot past a certain point. I suspect that the flash device holding the firmware was corrupted.

All 12 of the remaining units had been retired due to failed electrolytic capacitors in their power supply sections. Interestingly, all of these failed devices for which I was able to find datasheets were standard-temperature-grade, standard-life span capacitors rated to only +85 degrees Celsius. These bottom-of-the-range parts are typically rated for 1000 or 2000 hours of operation at their maximum temperature (the life span increases if your application doesn't get quite so warm). Conversely, if you exceed the temperature rating, the expected life span plummets drastically. This sort of application would more typically use extended-temperature parts (+105 degrees Celsius).

Now, I don't actually know the corporate history behind these specific products (and I'll charitably assume the device was originally engineered properly), but I have on many occasions observed a similar problem occur due to the activity of a purchasing department; in brief, engineering will initially design in the right part for a particular job, but someone else will later substitute a cheaper version that looks similar but is missing some critical parameter. Everything works fine until the units have been out in the field for a while. (I wrote about this problem at some length in my first "Top five engineering hints..." article -- see Resources.)

This problem is on a steady downward spiral as companies become more globalized.

#4: Modern components are designed for modern mass-production techniques

After reading this heading, you're probably asking yourself "so what?". The reason this fact is important is that tolerances for everything -- component placement, PCB trace-space separation, component lead spacing, and so forth -- are shrinking. The trend towards completely leadless packages (QFN, MLF, and BGA variants, for instance) is a big part of this. Unfortunately, one fact that comes along for the ride here is that dense packages with high pin counts now require multilayer PCBs to achieve fanout. These boards are more difficult to fabricate, and more susceptible to damage, than older two-layer boards. BGA solder joints are also notoriously difficult to inspect, and faulty BGA solder jobs are frequently a source of intermittent failures in fielded products (see Resources for information on this topic).

Process variations that would have been minor -- perhaps undetectable -- on a 1980s-era production line are very serious issues today, and not all of these issues are caught by outgoing quality control inspections.

#5: Service information is becoming increasingly difficult to obtain and less useful when you do manage to find it

Even as late as the mid-1990s, many entry-level consumer television sets (by way of example) included a schematic diagram and sketches of the waveforms you would expect to see at various test points in the circuit. This information is not included with modern sets, but it would be pretty much useless anyway. Much like every other appliance, a modern TV set consists of a fistful of generic parts surrounding one or two application-specific or even product-specific components that are only available from the set's manufacturer. The rework cost alone to remove some of these devices is prohibitive, even if replacement parts were available (which they frequently aren't).

Simply finding a way to probe internal signals and diagnose a fault is a real challenge on modern equipment. Most of the signal paths are buried inside custom mixed-signal integrated circuits, and the remainder all too frequently exit a chip package in the middle of a BGA footprint, dive directly into an internal layer of the PCB, and only surface again under another BGA. In between, they travel through those internal layers and blind vias, quite impossible to probe.

As a result, most modern equipment is serviced (if at all) only on the subassembly level. To appreciate what sort of costs and anguish this can generate, you need look no further than the Usenet groups where notebook computers are discussed. A US$1,000 notebook that suffers a single component failure on the mainboard may be more or less undiagnosable. The "fix" is to replace the mainboard, typically at a cost of US$500 or more. Users who ask for detailed service manuals are frequently shocked to find that even the most detailed "secret" service manuals provided to authorized repair centers only cover the most basic sort of subassembly-level disassembly and troubleshooting.

#6: The feedback loop between customers and engineering is often nonexistent today

In today's hi-tech marketplace are many "shake and bake" companies that scarcely manufacture a single product. All of the manufacturing, and frequently the design, are outsourced. This is not inherently a bad thing, but unfortunately it means that the design flow is all one-way. Particularly in the case of fast-moving products like personal computers, there may be no feedback given to engineering on the success or failure of its latest design.

One example of this I encountered some time ago is a certain little-known brand of laptop computer. If you were to browse through eBay and similar venues, you would see that practically every machine of this model up for sale has major damage to the bottom plastics around the LCD hinges. The reason for this is that the hinges are brittle, cast parts. They have two connection points: one screw that goes up the center of a post, and another screw that goes through a tab projecting out the side of the post. As long as both of these attachment points are secure, everything is fine. Unfortunately, over time the following events occur (apparently inevitably):

The hinges collect dirt, corrode, or the lubricant is displaced. The hinge mechanism begins to stiffen. The side-projecting tab on each hinge breaks off the post. Enormous stress (during closing and opening of the LCD) is concentrated on the screw that runs through the bottom of the post. The bottom of one post breaks out of the back of the lower plastic. The other post eventually breaks out as well.

The mechanical design of this particular computer was developed by an outside design house. If it was an internally generated design, direct customer feedback might have filtered back to engineering and would have led to design improvements.

Putting it all together

In conclusion then: While individual components are certainly becoming more reliable (on a "failures per transistor" rate, anyway), the complexity of applications and interconnects, the high density of most modern electronics, commercial priorities of the manufacturer, and administrative complications around the production process, are draining away a large proportion of the reliability benefits. Field serviceability of consumer electronics (in particular) has also reached near-zero.

The idea that modern processes and components have led to a net increase in product longevity seems, therefore, to be something of a sophism. The consequences of this go considerably beyond a few dead DVD players. While my intention is not to climb on a soapbox about the deleterious environmental effects of a consumption-driven culture, the cost in energy and raw materials is enormous when we manufacture devices that require a large entropy input (so to speak) to build and then become useless junk shortly afterwards. I'm not even counting the march of progress making appliances obsolete, either; that's a whole separate story.



Resources



About the author

Lewin A.R.W. Edwards works for a Fortune 50 company as a wireless security/fire safety device design engineer. Prior to that, he spent five years developing x86, ARM and PA-RISC-based networked multimedia appliances at Digi-Frame Inc. He has extensive experience in encryption and security software and is the author of two books on embedded systems development. He can be reached at sysadm@zws.com.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top