The previous installment in the Don't let these disasters happen to you series listed five not-so-commonly-stated -- yet invaluable -- hints for embedded engineers. Most of those hints were intended to be of interest to the engineer working in a large corporation. In this article, I cast before you another five pearls of wisdom, focusing more on topics that interest the freelance contractor or small engineering shop.
By the way, I wouldn't exactly call it a resource as such, but I heartily recommend the series of autobiographical books by James Herriot, a Yorkshire veterinarian. His hilarious stories of field failures and irksome clients are all animal-related, but there's really not much qualitative difference -- any engineer will recognize elements of these war stories and appreciate the victories.
Without further ado, here are my next five slightly off-the-beaten-track pieces of engineering advice.
Tip 1: Do your own integration
You'll usually save time in the long run if you can deliver a complete, functional system rather than testing pieces of it in vitro and letting the customer integrate them.
Depending on what section of industry you work in, you might find that this piece of advice sounds somewhat gratuitous. Many projects by their nature cannot be developed and tested in vitro; you need to have all the pieces your device will talk to in order to be sure that everything will work together. However, there do occur "simple" projects which -- at least according to the specification -- have inputs and outputs that are easy to simulate and relatively difficult to create for real. Some projects also exist -- and I'm thinking here of my past experience in the electronic toy industry -- where the product as a whole consists of a complex mechanical system with a few electromechanical pieces in it. You can often, for development purposes, simulate such systems by simply putting all the electromechanical pieces on the bench, without the other physical bits to which they are supposed to connect.
I'll illustrate some of the downsides to this approach with a project I completed in the not too distant past. This project was a control system for a moderately complicated piece of automatic manufacturing equipment. While the plant itself was structurally complex (note, by the by, that I am using the word "plant" in its technical jargon sense as employed in control system theory), it was possible to model the machine very easily. The plant consisted entirely of relays, lamps, switches, and push buttons, plus a mono speaker output providing audio feedback. There were no analog sensors or high-speed inputs of any sort; in fact, the customer's original proof-of-concept implementation was executed by a simple relay controller. There was no critical timing to speak of (except audio output); everything was very relaxed.
Since the plant itself was huge, heavy, and a thousand miles away, it wasn't really practical to ship it or build a working copy at my lab. The ideal route would have been for me to travel to the site and work directly there, but I had other commitments that made this impossible. So, I built a simulation fixture in my lab that used push buttons and LEDs to simulate the input and output devices on the real hardware. The integration difficulties I experienced when the customer tried to hook it all up to real machinery fell into four categories:
a) Problems due to real-world analog effects not adequately modeled by my little blinkenlights simulator. For example, there were problems with motor noise getting into the audio output from the controller, as well as a problem with an external high-current relay elsewhere in the machine that wasn't properly snubbed; every time that relay opened, it glitched the DC supply rail enough to reset the controller. These were the problems I expected to have to deal with, the symptoms and fixes were obvious, and they really weren't much trouble to fix remotely.
b) Issues with assembling and wiring the device. Several control boards went up in smoke because the customer's wiring harness had errors. If I had been working on this installation, I would have verified a few wires at a time, then connected just those few wires to the controller.
c) Problems caused by the fact that the customer and I made different assumptions about certain things during troubleshooting, and both of us believed that the other person had made the same assumptions. For example, the project included a removable Flash card and DOS-compatible file system. The customer was having an apparently bizarre problem -- even though the card was almost empty, the machine couldn't create a new log file on it. Some days of research led to the discovery that the customer was trying to store his log files in the root directory of the card, and there were already a large number of small files in the root directory. FAT12 and FAT16 volumes have a fixed (and small) root directory size; even if there are gigabytes of free space, once you've used up the available entries in the root directory, you can't create another file. (This is one of the reasons why digital cameras put their files inside a subdirectory; subdirectories can be expanded indefinitely). It would never have occurred to me to store a large collection of files on the root directory of a Flash card, because this FAT limitation is ingrained in my soul. Note that FAT32 treats the root directory like a subdirectory, so the limitation is very specific to FAT12 and FAT16.
d) Silly difficulties caused by not being able to oversee what the customer was doing. These were terribly annoying and wasted a lot of time. One example that piqued me in particular revolved around a capacitative proximity sensor used in the machine. The customer required all inputs to be optoisolated, which was no problem, but he also wanted the minimum possible number of connections running to the board. This meant wiring up the LED sides of the optocouplers in groups with a common cathode (or anode) -- which in turn meant that these inputs were polarity-sensitive. Before I laid out the board, I read the datasheet for the customer's specified proximity sensor and decided on the correct polarity. While the boards were being etched and assembled, the customer bought a large number of these (expensive) sensors. When everything was assembled, these sensors didn't work when connected to my control board.
I spent absolutely ages puzzling over this -- the sensor worked fine by itself, but the boards in my customer's hands simply would not recognize it. Of course, the sample I had in my office worked just fine with or without the board. Then one day I was on the phone with the customer trying still more debugging steps, when he casually mentioned that they had actually ordered a slightly different part number from what I had, "but it's probably something quite minor; looks like a revision level." The change in question was that the last five digits of the part number went from AN6X2 to AP6X2. One next-day air parcel later, looking at the wiring diagram printed on top of the new sensor, I realized that the customer had specified an open-collector NPN device, but actually bought an open-collector PNP sensor! Grrr!! Hundreds of boards needed to be hand-patched to cope with the reversed polarity sensor, since the customer couldn't return the ones he had bought.
Tip 2: Customers and end users never tell you what they're really doing.
Sometimes they don't even know what they're doing. Your life will be easier -- and your frequent flyer mileage will be lower -- if you include diagnostic features that can give you insight into what's really going on when your product gets behind closed doors with the customer.
Whether you're dealing with well-intentioned individuals cursed with a poor memory, or end users who are actively trying to conceal the addlepated things they do to your products behind closed doors, there can be a lot of frustration before you finally get the "eureka!" moment of successfully debugging a field problem. Following are a few useful things I've added to my projects in the past, that have helped me with bizarre field issues:
- The ability to log engineering data to a removable media type. This is a hugely important feature. You might choose to use a Flash media card (SD or MMC are generally the favorites, because they're very easy to use and only require a few I/O pins), or simply a socketed serial EEPROM device.
- Water-sensitive labels that change color when the product is immersed. These are commonly used on cellular phones; they look like a regular white thermally printed label until they are soaked, at which point they typically turn pale blue. The label is coated with a waterproof layer. Moisture can only ingress through the edges, so the ink won't be triggered by humidity, but only by actual immersion.
- Liberal use of a microcontroller's on-chip thermal diode, if present, to determine and log operating temperature for post-mortem purposes (or even just to display to the operator). In some applications, the internal temperature of the unit can be wildly different from ambient. Having thermal log data will help you see a correlation between failure and internal temperature. These sorts of problems just leap off a graph and into your face when you have the right data collected.
- Use of thermal indicator paint. These paints have an accurately calibrated melting point. You paint a dot onto the unit and it assumes a matte finish after the solvent evaporates. If the temperature point for that paint is exceeded, the paint melts and surface tension causes it to develop a glossy surface which is retained on cooling. These paints have helped me to locate inadequate heatsinks and identify customers who failed to follow airflow guidelines during installation, among other problems.
Tip 3: Never order just one of any untested prototype
In fact, never order just two of any such prototype. When the budget is lean, some people prefer to err on the side of lower up-front expenditure when it comes to prototypes. This is particularly true when the project in question has a high bill-of-materials cost. I can tell you quite unequivocally that this sort of thinking is a false economy. You can identify the newbies in Usenet and other electronics discussion forums because they are the people asking for inexpensive ways of building "just one or two" prototype PCBs. This attitude ignores two basic rules:
- Time -- for instance, time wasted waiting for fresh samples to arrive because you blew up the first one -- is expensive. Gyrations like this can also rapidly erode goodwill with your customers.
- The price of a prototype assembly is dominated by setup costs, and as you order more and more units, the per-unit price goes down dramatically.
I could go on about this topic for thousands of words, but in brief, here's the process I follow for all my projects that use a custom printed circuit board. This applies even to little one-off personal hobby projects, by the way.
Assuming that the customer has only one location, I order a minimum of five completely populated boards and one blank board. One of these boards is permanently reserved for my software development purposes, and one of them is permanently reserved as a spare for the same purpose. After initial bring-up, two boards are sent to the customer -- so he or she has a live test board and a spare. If anything happens during integration testing, and rework is required, the customer has a second bite at the cherry if the rework goes bad. The last board is used for testing cut-and-jump type hardware hacks before reworking the "good" boards. (By the end of a project, that board is usually festooned with patch wires.) The blank board is used in the worst-case scenario where I can't get any of the complete boards to work -- in which case I bring up the good board by hand-assembling it a few parts at a time, testing after each new functional block is installed.
One special exception applies in the case of projects that have big BGA-package chips on them. Since most people -- including myself -- don't have the equipment to assemble or inspect these devices, I have an outside contractor populate those chips. For projects of this kind, I order two blank boards; one has all the BGAs populated, so I can use it for the bring-up operation I described above, and the other is left completely blank and used solely to help with tracing out problems on the live boards.
Tip 4: Treat field returns with extreme caution!
Clients are frequently less than forthcoming with critical need-to-know details about the failure modes they have observed.
Failure to heed this advice has, on occasion, been quite exciting for me. A few years ago, I got a call from a customer saying that he had just built a new batch of a particular product, and these new boards weren't working; he needed me to debug the problem and advise what sort of rework would be necessary. The board in question was entirely machine-assembled surface-mount components, and it wasn't a prototype; the customer had built several batches in the past. When you get the recipe right for a product like this, assuming there are no supplier changes, not much can go wrong. Hence, I assumed that the most likely problem was simply that a wrong reel of parts was loaded onto the pick-and-place machine for this batch. Sniffing out this sort of problem can be annoying, especially if the culprit part is an unmarked surface-mount inductor or capacitor, but it's not usually exciting.
I received a few samples, connected the first one to a power supply of the type used in the real fielded machines and attached the JTAG debugger, powered everything up and... WHOMP! The room was instantly filled from floor to ceiling with drifting, evil-smelling snowflakes. Several large electrolytic capacitors had exploded, and the shredded paper dielectric was the "snow." (I later found the aluminum can from one of these devices embedded in a ceiling tile. Fortunately I wasn't bending over the device when I turned it on. I never did manage to get the smell of electrolyte out of the room.)
After recovering from this incident, I called the customer to ask him what he'd done to my design, and to explain what had happened to my test unit. He responded laconically that he'd seen the same problem, and "well, it didn't work right after, did it? So, it's a failure -- right?" As it turns out, some salesman from a new contract manufacturing house had dropped by to offer the customer a special introductory rate on PCB etch and assembly. Unfortunately, they inserted their own logo in the board artwork -- which is normal operating practice, by the way -- and violated the design rules for the board. When the device was properly installed in its heatsink, a short developed across the tracks in the PCB house's logo, which is why those electros blew.
In the same vein, I must confess that in my salad days (which were spent in Australia, where the AC line voltage is 240V), I was shocked countless times by devices that the customer had seen fit to "improve." It took me a while to learn from these electrical stimuli just how homicidal my customers were -- I guess I wouldn't make a very talented lab rat -- but eventually I developed enough discipline to follow a strict procedure to avoid this kind of pain.
Let all this be a lesson to you to bring up these sorts of boards slowly and carefully. Use a GFCI (ground fault circuit interrupter) on mains-powered appliances; make it something you can reset from inside your lab, if possible, to avoid a trip down to the electrical panel. For DC-powered appliances, your approach will depend on the design of the power supply. For devices that use linear regulators, your safest path is usually to use a current-limited lab power supply and turn up the current slowly, checking carefully for heat and smoke and actuating the reset hardware frequently as you go. You should have some idea of how much current the board is likely to want when operating normally; if you get far outside this window, you know something's wrong.
For devices that use a switch-mode supply, the above approach might not be advisable; undesirable things might happen while you're running with an undervoltage condition on the input side. The safer course of action is to isolate the power supply section from the rest of the circuit and test each side independently. First put a dummy load on the power supply (switchmode supplies won't start up without some load) and check that it's putting out good voltages. Then use your lab supply to power the remainder of the circuit, following the directions above. If both of these tests check out, you can reconnect the PSU and circuit and see if they play together nicely.
It goes without saying that you should double-check circuit-protective elements such as fuses before you power up the device, and have the minimum possible amount of apparatus connected to the unit as you power it up for the first time.
Tip 5: Plan and document your projects; have a process and follow it
In a big company, you have management and other people to beat you up if you don't follow process; in a small company or one-person shop, you probably need to beat yourself.
I'm closing with this point because it's the most important. As I get older and progressively more jaded, this particular issue becomes more and more of a hobby-horse for me. In my opinion, formal education doesn't expose students to enough (read: any) of the planning phase of an engineering project. Textbook problems and class assignments have little relationship to the big bad world, where anything can happen and probably will.
If you read the previous article I wrote on this topic, you'll note that I described the procedures in a big company as a vast bureaucratic machine. Now, many people get into engineering -- particularly firmware -- through the hobby route. To such people, bureaucracy is something to be avoided, and this is why you have a depressingly large number of ad-hoc projects floating about in any engineering department. These also tend to be the projects that stick in the development process for ten years and never really work right.
When you're working for yourself or in a small company, it's critical that you get the project right the first time around, at least in the major details. I'm not saying that it's easy -- there's an alligator pit of problems to be crossed in any nontrivial project -- but if you have a clear idea of engineering requirements at the beginning, you can avoid a huge percentage of those end-of-the-project "oops!" moments. Sometimes the microsecond-level details are impossible to determine up front, and you might be forced to over-dimension the system to cater for possible critical problems to be determined during development. This is nothing to be ashamed of! The important fact here is that you considered the likely problems to be encountered and designed a performance margin into the product to cover these possibilities. Just make sure you document the anticipated performance capabilities of the system versus the likely critical parameters and your conscience should be clear.
Unless the costs are really outrageous, no customer is going to berate you for having too much design margin in the product you deliver. A working design can always be optimized for cost reduction when it goes to mass-production, but a design that "almost-sorta" works carries a lot of ill-will for the engineer who delivered it.
I hope this article and its predecessor have given you a little armor, which will protect you against acquiring further stories like this from your own experience. For my next trick, I will be demonstrating how no electronic device has been designed since the 1980s.
- Participate in the discussion forum.
-
See all installments of the series, Don't let these disasters happen to you, by Lewin Edwards.
- Not sure what a "switch-mode power supply" is?
This introductory-level
piece on switched mode power supplies describes the technology in some
detail.
- Find an example of detailed analysis from thermal log monitoring in this IBM
article on the BladeCenter® server series.
- Here's an interesting
article on analyzing power budgets for the storage media used in
logging, in the context of long-term biological sensor data gathering (but
applicable to all sorts of other things, too).
- Sam Goldwasser has published an excellent article on how GFCIs work.
- PCBs are just the beginning; the
habit of embedding
interesting images goes all the way to individual chips, as shown in the
Silicon Zoo collection
of semiconductor art.
Lewin A.R.W. Edwards works for a Fortune 50 company as a wireless security/fire safety device design engineer. Prior to that, he spent five years developing x86, ARM and PA-RISC-based networked multimedia appliances at Digi-Frame Inc. He has extensive experience in encryption and security software and is the author of two books on embedded systems development. He can be reached at sysadm@zws.com.





