The Internet has exactly one and a half zillion articles (I counted them) containing lists of things to do and not to do in embedded systems design. Many of these articles focus on well-understood topics like switch debouncing and how to estimate maximum stack depth. Dozens of these articles quote our old favorites in terms of embedded disasters: Therac, Ariane, and the Mars Polar Lander. This article gives you a few choice tidbits of advice you won't find mentioned quite so frequently in other places. It also includes some anecdotes that will show you just how easily your work life can turn into the inspiration for a Dilbert cartoon if you're not careful. (Having just finished writing a book on related topics, I'm brimming with anecdotes I didn't have room to print earlier).
Practically nothing can be done effectively in a large company unless a defined formal process for achieving the goal is in place. You might not like this fact, but it's true. As an engineer, you're working with a head full of specialized knowledge while you design your widget, but at the other end of the equation there's more than likely a factory employee or even a subcontractor following "put flap A in slot B" instructions. The huge, bureaucratic, and often exasperating machine that turns your schematic into a finished product consists largely of formal processes that turn engineering documents into flap A, slot B manufacturing and quality control documents.
While a product is in the development stage, engineering needs to be able to make quick and sometimes speculative changes. Accordingly, processes that support engineering development need to be simple and streamlined. It needs to be easy for engineering to request a build exception -- for example, "build 50 of these boards; 25 matching the schematic, and the other 25 with different resistor values at certain locations."
On the other hand, when something is in actual large-scale production, you want to be very leery indeed of making changes. Any change to the bill of materials -- say, removing a resistor from your board -- that appears minor from the engineering standpoint affects a huge number of production processes. The schematic has to be changed, and the bill of materials needs to be updated so manufacturing knows how many of that resistor it needs to be requesting to meet each month's demand. Purchasing might need to renegotiate contracts with the resistor vendor, or return excess inventory. The pick-and-place programming needs to be altered so the head won't try and place that part. The in-circuit tester needs to be told not to check for that component. In some cases, optical inspection equipment might need retraining to recognize good boards. Depending on where that resistor is, and what the product is, you might also need to resubmit the device for type approval -- this can be amazingly expensive and certain regulatory approvals, even for simple consumer equipment, can take a flabbergastingly long time (considerably more than a year in some cases).
Here's a (true) story of what happens when the wrong process rules are applied to a situation. A certain product was required to carry a type approval number on its injection-molded plastic housing. This approval number was on the outside of the housing, visible to the end user. On the interior (invisible) face of the part, standard procedure was to mold the part number and revision, among other information. Quality control would look at that internal part number to determine how to stock these parts as they were delivered; in other words, that internal part number was the deciding factor as to whether these parts would be received and appear in the inventory control system, as version A or version B. For regulatory reasons, this number had to change every time the circuit changed. The device was in full-scale production and all was well. Then, somebody started to develop an enhanced version of the product, with a new circuit board and hence a new approval number. Somehow, the engineering prototype changes were applied to the live tool for the production plastic parts, bypassing all the normal safeguards on this sort of thing -- a classic example of engineering exceptions being injected into the production process by mistake.
I'll call the old plastic version "A" and the new version "B." Since the mold was to be changed, it would no longer be possible to make version A. Unfortunately, there was sufficient inventory of version A in the system that by the time purchasing noticed a short supply, the mold had already been converted to version B. Customers were screaming for product, and marketing people were scurrying to and fro in panic mode.
The engineering lead on the product was called in to solve the problem. A self-adhesive label with the correct information was suggested, but there were several complications:
- The existing lettering on the product was raised, not engraved. This made it hard to ensure that a label would stay on properly.
- There was a screw hole inside the text window on this product.
- The screw hole had to be accessible because the end user would need to remove it in order to change the battery.
- There wasn't time to get a custom die-cut label specified and produced -- it was necessary to find a solution that would use a stock Avery label. This raised an intractable problem with trying to place the label accurately enough to ensure that whatever was printed on it would not be obliterated by the screw.
- The manufacturing facility was in another country with a different native language, introducing delays and training problems.
- The plastics vendor was in a third country and a radically different time zone.
Eventually -- after numerous video conferences and several full person-weeks of work -- it was decided to stick a small label over just the approval number. Labels were printed and sent to the factory. The factory guys ran a test batch and responded that the process seemed to work. They then asked respectfully why engineering required them to stick on a label when the label said the same thing as the plastic. (Cue surprise and amazement in engineering.)
It turned out -- after the factory sent photos of the parts they had on the shelf -- that the vendor of the plastic part had updated the inside of the mold to show the new "B" version, but had forgotten to update the actual visible text on the exterior, so the parts were externally correct, despite being labeled and stocked as the wrong thing. This was clearly a case of two wrongs making a right, but the waste of engineering, marketing, and manufacturing resources was horrific. Had correct procedures been followed, this engineering development change would not have been permitted to impact production tooling.
This one is somewhat self-explanatory. It really boils down to "measure twice, cut once." Many sorts of products -- in fact, the vast majority of embedded systems -- need to have regulatory-type approval testing: FCC, CE, UL, and so forth. Many very talented people have spent entire lifetimes trying to develop standardized tests that can be described completely and reproduced exactly at different sites. Unfortunately, nobody has unarguably reached that goal yet for a lot of tests (particularly tests that measure signal propagation: sound pressure, radio, magnetic field strength, and so on). Worse, many of these tests require expensive and complicated equipment, and they yield results that can be terribly site-specific and highly counter-intuitive.
What this means to you is that it can be very hard to establish a high confidence level in test results generated internally, especially for the types of tests I highlighted above. This is an annoyance for new project development because it means you have to wait for official test results before you can release the product. However, it can be a very serious and costly problem if you have to revise something that's currently in production. Say you have a component that suddenly goes obsolete without much warning or opportunity for an end-of-life buy. (This happens more frequently than you might think.) For some sorts of type approval, there's a process where you can submit revised samples and paperwork and start shipping the modified product immediately as long as you have in-house test data to support its continued compliance. Great, if your new design really does pass testing after your paperwork gets through the queue -- but in the case of some of these "fuzzy" tests, there's a significant chance that it won't. You're then in a bad position because you've shipped some number of units that carry the applicable approval logo, but are known not to be compliant. This often means product recalls, enormous costs, and pain.
It's usually desirable to perform your type-approval tests in-house if possible, because you won't be subjected to queue time at the regulatory body (so, if your device fails the first time around, you can investigate why and modify the design quickly). Sometimes, regulators will be happy with this approach if you simply let them send around an invigilator to watch the tests in progress. Usually, however, you need to undergo some very complicated site certification before your test results can be considered valid, in which case you need to weigh the cost and benefit of such certification the same way you might consider the purchase of a large and expensive piece of test equipment.
A reasonable second-best to doing the testing in-house is to contract to a certified third-party laboratory to do the testing for you. Although this is not typically cheap, the third-party labs will usually give you a lot of detail on any problems they encounter during the process. In some cases, they'll even work with you to tweak your design, providing intelligent recommendations on how to pass. This can be more than worth the money if you don't have a lot of domain knowledge about the particular type approval being sought.
I don't have a particularly funny anecdote about the problems you can get into here (at least, not one I can mention in public), but I have seen them crop up more than once. For example, at a now-defunct company with which I'm familiar, a product needed to have UL and CE logo testing. Units were sent off to a third-party lab, and a separate group, not part of the normal product development engineering crew, was working on the approval project. After about eighteen months of futzing, the approval crew proudly delivered an approvable prototype covered in copper tape and bristling with ferrite beads on every wire. Unfortunately, the device in question was based around a PC motherboard. PC peripherals have a much shorter life cycle than eighteen months, so there was literally not a single internal component of the approvable version that could still be purchased.
Remember the movie Gremlins, featuring cute fluffy creatures that turn into evil demonic beasts if you feed them after midnight? One of the rules for owning a gremlin (in its fuzzy, lovable form) was "But the most important thing, the thing you must never forget... no matter how much they cry, no matter how much they beg, never, never feed them after midnight!"
Marketing and Sales personnel can be very similar to this. Volumes have been written about the relationship between marketing and engineering, but I'd like to focus on the line between keeping marketing happy and keeping your own sanity.
Marketing -- and even more so, Sales -- has a lot of direct customer contact. As such, they hear a lot of feedback and get a lot of requests for special products. The problem is that the design and development process in a large company is like one of those dinosaurs with such a slow nervous system that it needed a brain in its tail so the head wouldn't eat the other end by accident. The information that your marketing people are receiving from customers is mostly only useful in a tactical context -- Fred wants a product now that has twice as much battery life; Sue needs a version of your product made of baby-blue plastic to score a specific contract.
Tactical information is of little or no value to engineering, because product development is so slow (in most hi-tech corporations, typically between one and two years) that it is inherently strategic. Fred or Sue cannot realistically get what they want in time for it to be useful to them. Marketing and sales personnel that really know their stuff realize this and realize that their job is to aggregate opinions like Fred's and Sue's and provide strategic steering that says, "People need more battery life" and "There's a better market for this product if we can customize the colors."
The place where this can bite engineering the hardest is that critical window just after a specification is defined, but before development has started. (After development has started, any specification change should be forced to go through review and approval. Just Say No to feature creep. You have a perfectly legitimate comeback to Marketing: development has started already!) Inside the aforementioned window, there's a constant danger that Marketing is going to mention the product at a focus group and get "constructive" input, which they will then attempt to back-door into the specification without proper review. All of a sudden you'll find yourself at the end of the project with unresolvable quality assurance (QA) problems due to conflicting goals. Perhaps the real answer here is to start development as soon as you can after the specification is finalized. That way, you close the window quickly!
All companies that exceed a certain critical point in production volume have an internal part numbering system. This facilitates tasks such as approving second sources for parts. If a second source appears, the specification and approved vendor list for the house part number are updated; the schematics that reference the part number do not need to be updated.
Unfortunately, there is almost no such thing as a guaranteed 100% drop-in substitute for a given component, even in those cases where a manufacturer explicitly claims drop-in replacement for some other company's product. You really need to examine every application where the part is used before you can declare that a new version is compatible. Some of this examination will be cursory (for example, a 10K pullup resistor on a micro input line probably isn't going to need much analysis), but you need to invest the effort to think about every application of the part because occasionally you will run into some very special situations.
Many years ago, I was involved in designing a very compact product that used a slightly magical (hard-to-find) latching relay in its circuit. This relay was the largest component in the design, and its size was the driving factor for the spacing between two circuit boards and hence the entire size of the product's housing. The relay was specified and purchased without incident, and over the course of a couple of years, other products were designed around the same relay.
One day, a rep wandered into the building and offered Component Engineering a cheaper, pin-compatible version of the same relay. All the specified dimensions and ratings were the same or better, and the price was much cheaper; there was much rejoicing. This jubilation lasted right up to the moment at which the field failure rate on my product went through the roof. Units were coming back with cracked joints under a ball-grid array (BGA) chip on one of the boards.
I'll skip over about four months spent looking at boards under an X-ray machine and analyzing purported manufacturing issues and cut right to the punch line. It transpired that there was a via (a hole in the board connecting one layer to another) on the PCB immediately above the relay, which caused a solder blob to sit slightly proud of the plane of the board in that spot. On the relay we originally specified, there was a dimple in the plastic shroud, exactly fitting the position of that via. That dimple wasn't specified in the drawing for the relay, and it didn't appear in our 3D model of the part -- it was just one of those serendipitous things. Unfortunately, the replacement relay had a piece of mold sprue instead of the dimple, and it protruded in exactly the right place to hit the via on the board above it. When the unit was assembled, this stump of sprue hit the via and flexed the upper board slightly. The BGA part in question lay right on the flexure line and had mechanical stress transmitted directly to its balls -- leading to premature failure.
I'm sure you've heard the old axiom that it is impossible to design an idiot-proof device, because nature continues to develop better idiots. Unfortunately, the general idea stated there is applicable even to highly technical people. With power electronics, my two pieces of advice are:
- Design connectors and switches so that it is impossible to miswire things.
- Recognize that the impossible is inevitable and design the circuit so that it will survive miswiring.
I have about a million stories on this topic, more's the pity, but here's one of my particular favorites. I was working with a product (not my own) that had a lead-acid backup battery. The board had two sets of spade lugs on it to support two batteries in parallel if necessary. The positive lugs were mounted vertically; the negative lugs ran horizontally. Investigating an unrelated problem, I watched a QA technician set up the board. She plugged the negative lead into one of the negative lugs, then brought the positive lead from the battery up to the other negative lug on the board! As it approached, it spat sparks and the technician jumped. Then, gathering her resolve, she pushed the lead firmly onto its connector. The battery cables were instantly called upon to deliver several dozen amps; they fried off their insulation, burned a pattern on the desk, and sat there glowing sullenly until someone managed to tug them off the battery with a pair of pliers. The coolest part was the way a perfectly defined smoke shape went up to the ceiling, preserving the exact shape of the cables as they were lying on the desk (highly theatrical).
The failures here were:
- Connectors for positive and negative battery straps were not keyed.
- There was no diode on the inputs to protect against this sort of thing. (There was a reverse battery protection diode, but it was downstream from the input terminals; the terminals were just hardwired in parallel. Adding a single diode and altering the wiring a little would have prevented the problem I described).
- Almost an incidental issue, but the technician's training was obviously somewhat lacking.
Don't let this happen to you.
In this article, I've covered the absurd, the annoying, and the physically dangerous, and have given you a few hints for avoiding all of those sorts of situations. Most of what I've said here is aimed at engineers working in large corporations. In the next article in this series, I'll give the same sort of treatment to a few issues that more directly affect freelancers and engineers working at small corporations.
- Participate in the discussion forum.
- Not all jellybeans are simple resistors. The MIT Jellybean
Machine is composed of parts that are considerably more complicated,
but nevertheless considered jellybeans.
- Traditional references always talk about the Therac-25
- There's an entire science of analyzing failures
in BGA package-to-board solder joints.
- People tell war stories about engineering disasters
because it's a great way to learn. Every industry has these; although it's
not semiconductor engineering, the delightful archive of heating, piping,
and air conditioning war stories HPAC maintains has a lot of relevance
to the transition from laboratory testing to the real world.
- It's not just physical engineering; system admins have war stories
Lewin A.R.W. Edwards works for a Fortune 50 company as a wireless security/fire safety device design engineer. Prior to that, he spent five years developing x86, ARM and PA-RISC-based networked multimedia appliances at Digi-Frame Inc. He has extensive experience in encryption and security software and is the author of two books on embedded systems development. He can be reached at email@example.com.