Tips on building & debugging embedded hardware & software designs: Part 2

Shhhh! Listen to the hum. That’s the sound of the incessant information processing that subtly surrounds us, that keeps us warm, washes our clothes, cycles water to the lawn, and generally makes life a little more tolerable. It’s so quiet and keeps such a low profile that even embedded designers forget how much our lives are dominated by data processing.

Sure, we rail at the banks’ mainframes for messing up a credit report while the fridge kicks into auto-defrost and the microwave spits out another meal. The average house has some 40 to 50 microprocessors embedded in appliances. There’s neither central control nor networking: each quietly goes about its business, ably taking care of just one little function. This is distributed processing at its best.

Billions and billions of 4- to 16-bit micros find their way into our lives every year, yet mostly we hear of the few tens of millions that reside on our desktops.

Now, I’d never give up that zillion-MIP little beauty I’m hunched over at the moment. We all crave more horsepower to deal with Microsoft’s latest cycle-consuming application. I’m just getting tired of 32-bit hype for embedded applications. Perhaps that 747 display controller or laser printer needs the power. Surely, though, the vast majority of applications do not.

A 4-bit controller that formed the basis for a calculator started this industry, and in many ways we still use tiny processors in these minimal applications. That is as it should be: use appropriate technology for the job at hand.

Derivatives of some of the earliest embedded CPUs still dominate the market. Motorola’s 6805 is a scaled up 6800 which competed with the 8080 back in the embedded Dark Ages.

The 8051 and its variants are based on the almost 20-year-old 8048. 8051s, in particular, have been the glue of this industry, corresponding to the analog world’s old 741 op amp or the 555 timer. You find them everywhere. Their price, availability, and on-board EPROM made them the natural choice for applications requiring anywhere from just a hint of computing power to fairly substantial controllers with limited user interfaces.

Now various vendors have migrated this architecture to the 16-bit world.

I can’t help but wonder if this makes sense, as scaling a CPU, while maintaining backward compatibility, drags lots of unpleasant baggage along. Applications written in assembly may benefit from the increased horsepower; those coded in C may find that changing processor families buys the most bang for the buck.

Microchip, Atmel, and others understand that the volume part of the embedded industry comes from tiny little CPUs scattered with reckless abandon into every corner of the world. These are cool parts! The smaller members offer a minimum amount of compute capability that is ideal for simple, cost-sensitive systems. Higher-end versions are well suited for more complicated control applications.

Designers seem to view these CPUs as something other than computers. “Oh, yeah, we tossed in a couple of PIC16s to handle the microswitches,” the engineer relates, as if the part were nothing more than a PAL. This is a bit different from the bloodied, battered look you’ll get from the haggard designer trying to ship a 68030-based controller. The micro-controller is easy to use simply because it is stuffed into easy applications.

L.A. Gear sells sneakers that blink an LED when you walk. A PIC16C5x powers these for months or years without any need to replace the battery. Scientists tag animals in the wild with expendable subcutaneous tracking devices powered by these parts. In addition to their use to partition the code, there are other compelling reasons as well.

A friend developing instruments based on a 32-bit CPU discovered that his PLDs don’t always properly recover from brown-out conditions. He stuffed a $2 controller on the board to properly sequence the PLD’s reset signals, ensuring recovery from low-voltage spikes. The part cost virtually nothing, required no more than a handful of lines of code, and occupied the board space of a small DIP. Though it may seem weird to use a full computer for this trivial function, it’s cheaper than a PAL.

Not that there’s anything wrong with PALs. Nothing is faster or better at dealing with complex combinatorial logic. Modern super-fast versions are cheap (we pay $12 in singles for a 7-nanosecond 22V10) and easy to use, and their reprogrammability is a great savior of designs that aren’t quite right. PALs, though, are terrible at handling anything other than simple sequential logic.

The limited number of registers and clocking options means you can’t use them for complicated decision making. PLDs are better, but when speed is not critical a computer chip might be the simplest way to go.

As the industry matures, lots of parts we depend on become obsolete. One acquaintance found the UART his company depended on was no longer available. He built a replacement in a PIC16C74, which was pin-compatible with the original UART, saving the company expensive redesigns.

In the good old days of microcomputing, hardware engineers also wrote and debugged all of the system’s code. Most systems were small enough that a single, knowledgeable designer could take the project from conception to final product. In the realm of small, tractable problems like those just described, this is still the case.

Nothing measures up to the pride of being solely responsible for a successful product; I can imagine how the designer’s eyes must light up when he sees legions of kids skipping down the sidewalk flashing their L.A. Gears at the crowds.

Part of the recent success of these parts comes from the aggressive use of Flash and One-Time Programmable (OTP) program memory. OTP memory is simply good old-fashioned EPROM, though the parts come without an erasure window. That small quartz opening typical of EPROMs and many PLDs is very expensive to manufacture.

You can program the memory on any conventional device programmer, but, since there’s no window, you can never erase it. When it’s time to change the code, you’ll toss the part out.

Intel sold OTP versions of their EPROMs many years ago, but they never caught on. A system that uses discrete memory devices—RAM, ROM, and the like—has intrinsically higher costs than one based on a microcontroller. In a system with $100 of parts, the extra dollar or two needed to use erasable EPROMs (which are very forgiving of mistakes) is small.

The dynamics are a bit different with a minimal system. If the entire computer is contained in a $2 part, adding a buck for a window is a huge cost hit. OTP starts to make quite a bit of sense, assuming your code will be stable. This is not to diminish Flash memory, which has all of the benefits of OTP, though sometimes with a bit more cost.

Using either technology, the code can be cast in concrete in small applications, since the entire program might require only tens to hundreds of statements. Though I have to plead guilty to one or two disasters where it seemed there were more bugs than lines of code, a program this small, once debugged and thoroughly tested, holds little chance of an obscure bug. The risk of going with OTP is pretty small.

You can’t pick up a magazine without reading about “time to market.” Managers want to shrink development times to zero. One obvious solution is to replace masked ROMs with their OTP equivalents, as producing a processor with the code permanently engraved in a metalization layer takes months – and suffers from the same risk factors as does OTP. The masked part might be a bit cheaper in high volumes, but this price advantage doesn’t help much if you can’t ship while waiting for parts to come in.

Part of the art of managing a business is to preserve your options as long as possible. Stuff happens. You can’t predict everything. Given options, even at the last minute, you have the flexibility to adapt to problems and changing markets. For example, some companies ship multiple versions of a product, differing only in the code. A Flash or OTP part lets them make a last-minute decision, on the production floor, about how many of a particular widget to build. If you have a half million dollars tied up in inventory of masked parts, your options are awfully limited.

Part of the 8051’s success came from the wide variety of parts available. You could get EPROM or masked versions of the same part. Low-volume applications always took advantage of the EPROM version. OTP reduces the costs of the parts significantly, even when you’re only building a handful.

Microcontrollers do pose special challenges for designers. Since a typical part is bounded by nothing more than I/O pins, it’s hard to see what’s going on inside. Nohau, Metalink, and others have made a great living producing tools designed specifically to peer inside of these devices, giving the user a sort of window into his usually closed system.

Now, though, as the price of controllers slides toward zero and the devices are hence used in truly minimal applications, I hear more and more from people who get by without tools of any sort. While it’s hard to condone shortchanging your efficiency to save a few dollars, it’s equally hard to argue that a 50-line program needs much help. You can probably eyeball it to perfection on the first or second iteration.

Again, appropriate technology is the watchword; 5000 lines of assembly language on a 6805 will force you to buy decent debuggers – and, I’d hope, a C compiler.

You can often bring up a microcontroller-based design without a logic analyzer, since there’s no bus to watch. Some people even replace the scope with nothing more than a logic probe.

An army of tool vendors supply very low-cost solutions to deal with the particular problems posed by microcontrollers. You have options—lots of them—when using any reasonable controller—far more than if you decide to embed a SPARC into your system. Some companies cater especially to the low end. Most do a great job, despite the low cost. I recently looked at Byte Craft’s array of compilers for microcontrollers from Microchip, Motorola, and National. Despite the limited address spaces of some of these parts, it’s clear a decent C compiler can produce very efficient code.

One friend cross-develops his microcontroller code on a PC. Using C frees him from most processor dependencies; compile-time switches select between the PC’s timer/UART, etc., and that contained in the controller. He manages to debug more than 80% of the code with no target hardware.

Working in a shop using mostly midrange processors, I’m amazed at the amount of fancy equipment we rely on, and am sometimes a bit wistful for those days of operating out of a garage with not much more than a soldering iron, a logic probe, and a thinking cap. Clearly, the vibrant action in the controller market means that even small, under- or uncapitalized businesses still can come out with competitive products.

I’m constantly astonished by the utter reliability of computers. While people complain and fume about various PC crashes and other frustrations, we forget that the machine executes millions of instructions per second, even when sitting in an idle loop.

Smaller device geometries mean that sometimes only a handful of electrons represent a one or zero. A single-bit failure, for a fleetingly transient bit of time, is disaster.

Yet these failures and glitches are exceedingly rare. Our embedded systems, and even our desktop computers, switch trillions of bits without the slightest problem.

Problems can and do occur, though, due more often to hardware or software design flaws than to glitches. A watchdog timer (WDT) is a good defense for all but the smallest of embedded systems. It’s a mechanism that restarts the program if the software runs amok.

The WDT usually resets the processor once every few hundred milliseconds unless reset. It’s up to the firmware to reinitialize the watchdog timer, restarting the timing interval. The code tickles the timer frequently, restarting the countdown interval. A code crash means the timer counts down without interruption; at time-out, hardware resets the CPU, ideally bringing the system back on-line.

The first rule of watchdog design is to drive the CPU’s reset input, not an interrupt (such as NMI). A WDT time-out means that something awful happened, something that may have left the CPU in an unpredictable scrambled state. Only RESET is guaranteed to bring the part back on-line.

The non-maskable interrupt is seductive to some designers, especially when the pin is unused and there’s a chance to save a few gates. For better or worse, NMI—and all other interrupt inputs—is not fail-safe. Confused internal logic will shut down NMI response on some CPUs.

On other chips a simple software problem can render the non-maskable interrupt unusable. The 68 K, for example, will crash if the stack pointer assumes an odd value. If you rely on the WDT to save the day, driving an interrupt while SP is odd results in a double bus fault, which puts the CPU in a dead state until it’s reset.

Next, think through the litigation potential of your system. Life-threatening failure modes mean you’ve got to beware of simple watchdog timers! If a single I/O instruction successfully keeps the WDT alive, then there’s a real chance that the code might crash but continue to tickle the timer.

Some companies (Toshiba, for example) require a more complex sequence of commands to the timer; it’s equally easy to create a PLD yourself that requires a fiendishly complex WDT sequence.

It’s also a very bad idea to put the WDT reset code inside of an interrupt service routine. It’s always intriguing, while debugging, to find your code crashed but one or more ISRs still functioning. Perhaps the serial receive routine still accepts characters and echoes them to the sender.

After all, the ISR by definition runs independently of the rest of the code, so will often continue to function when other routines die. If your WDT tickler stays alive as the world collapses around the rest of the code, then the watchdog serves no useful purpose.

This problem multiplies in a system with an RTOS, as a reliable watchdog monitors all of the tasks. If some of the tasks die but others stay alive—perhaps tickling the WDT—then the system’s operation is at best degraded.

In this case write the WDT code as its own task, driven by a timer. All other tasks send messages to the watchdog process, indicating “I’m alive.” Only when the WDT activity sees that all tasks that should have checked in are indeed operating does it service the watchdog.

If you use RTOS-supplied messaging to communicate the tasks’ health—rather than dreaded though easy global variables—there’s little chance that errant code overwriting RAM can create a false indication that all’s OK. Suppose the WDT does indeed find a fault and resets the CPU. Then what? A simple reset and restart may not be safe or wise.

One system uses very high-energy gamma rays to measure the thickness of steel. A hardware problem led to a series of watchdog time-outs. I watched, aghast, as this system cycled through WDT resets about once a second, each time opening the safety shield around the gamma ray source! The technicians were understandably afraid to approach close enough to yank the power cord.

If you cannot guarantee that the system will be safe after the watchdog fires, then you simply must add hardware to put it in a reasonable, nondangerous, mode. Even units that have no safety issues suffer from poorly thought-out WDT designs.

A sensor company complained that their products were getting slower. Over time, and with several thousand units in the field, response time to user inputs degraded noticeably. A bit of research showed that their system’s watchdog properly drove the CPU’s reset signal, and the code then recognized a warm boot, going directly to the application with no indication to the users that the time-out had occurred.

We tracked the problem down to a floating input on the CPU that caused the software to crash—up to several thousand times per second. The processor was spending most of its time resetting, leading to apparently slow user response.

If your system recovers automatically from a WDT time-out, add an LED or status display so users—or at least the programmers!—know that the system had an unexpected reset. Don’t use a bit of clever watchdog code to compensate for software or hardware glitches.

It seems almost traditional to put a reset switch on the back panel of an embedded system. When something horrible happens, hit the reset and retry! Doesn’t this make the customer feel that we don’t trust our own products? Electronic systems never had reset switches until the introduction of the microprocessor. Why add them now?

A reset switch is no substitute for flaky hardware. It’s pretty easy (or, at least possible) to design robust, reliable microprocessor circuits. Any failure is most likely to be a hard fault that a simple reset will not cure.

This argument implies that a reset switch is mostly useful to cure software bugs. We have a choice of writing 100% reliable code or adding some sort of an escape hatch for the user. I hereby proclaim, “We shall all now write correct code.” The problem is now cured.

OK, so perhaps a bug just might creep in once in a while. My feeling is that a reset switch is still a mistake. It conveys the message that no one really trusts the product. It’s much better to include a very robust watchdog timer that asserts a good, hard reset when things fall apart.

The code might still be unreliable, but at least we’re not announcing to the world that bugs are perhaps rampant. Remember when Microsoft eliminated the Unexpected Application Error message from Windows 3.1 – by renaming it?

No watchdog is perfect, but even a simple one will catch 99% of all possible code crashes. Combine this percentage with the (ideally) low probability of a software crash, and the watchdog failure rate falls to essentially zero.

In the bad old days we created wire-wrapped prototypes because they were faster to make than a PCB, and a lot cheaper. This is no longer the case. Except for the very smallest boards, the cost of labor is so high that it’s hard to get a wire-wrapped prototype made for less than $500 to several thousand dollars. Turnaround time is easily a week.

Cheap autorouting software means any engineer can design a PCB in a matter of a couple of days—and you’ll have to do this eventually anyway, so it’s not wasted time. Dozens of outfits will convert your design to a couple of PCBs in under a week for a very reasonable price. How much? Figure $1000–1500 for a 50-square-inch 4- to 6-layer board, with one-week turnaround.

It’s magic. Modem your board design to the vendor, and days later FedEx delivers your custom design, ready for assembly and test. PCBs are much quieter, electrically, than their wire-wrapped brethren. With fast rise times and high clock rates, noise is a significant problem even in small embedded designs.

I’ve seen far too many cases of “Well, it doesn’t work reliably, but that’s probably due to the wire wrap. It’ll probably get better when we go to PC .” These are clearly cases where the prototype does not accomplish its prime objective: identify and fix all risk factors.

Always build your prototype on a PCB, never on wirewrap or other impedance-challenged technologies. And figure on using a multilayer design, with unadulterated power and ground planes. Modern logic is just too fast, too noisy, and too intolerant of ground bounce and other impedance issues to try to mix power and signals on any PCB layer.

The best source for information about speed and noise issues on PC boards is High Speed Digital Design—A Handbook of Black Magic , by Howard Johnson and Martin Graham (1993, PTR Prentice Hall). This is a must-read for all digital engineers.

If you felt that your college electromagnetics was a flunk-out course, one you squeaked through, fear not. The authors do use plenty of math, but their prose descriptions are so lucid you’ll gain a lot of insight by just reading the words and skipping over the equations.

Design your prototype PCB with room for mistakes. Designing a pure surface-mount board? These usually use tiny vias (the holes between layers) to increase the density.

Think about what happens during the prototyping phase: you’ll make design changes, inevitably implemented by a maze of wires. It’s impossible to run insulated wire through the tiny holes! Be sure to position a number of unusually large vias (say, 0.031″) around the board that can act as wiring channels between the component and circuit sides of the board.

Add pads for extra chips; there’s a good chance you’ll have to squeeze another PAL in somewhere. My latest design was so bad I had to glue on five extra chips. Guess who felt like an idiot for a few days???

Always build at least two copies of each prototype PCB. One may lag the other in engineering modifications, but you’ll have options if (when) the first board smokes. Anyone who has been at this for a while has blown up a board or two.

I generally buy three blank prototype PCBs, assemble two, and use the third to see where tracks run. Though sometimes you’ll have to go back to the artwork to find inner tracks, it sure is handy to have the spare blank board on the bench during debug.

It’s scary how often the firmware group receives a piece of “functional” prototype hardware from the designers accompanied by nothing more than the schematics—schematics that are usually incomprehensible to the software folks, made even more abstruse by massive use of PLDs and similar functional blocks plopped down on the page, with perhaps hundreds of connections.

They are documentation black holes—every signal goes in, and presumably something comes out, but without the designer’s suite of design tools even the brightest firmware person will never make sense of the design.

Where does one draw the line between the responsibilities of the hardware designers and those of the firmware folks? Should the designers include device drivers? Seems reasonable to me, since surely they did indeed at least hack together a bit of code to test each device.

Why not structure the development plan to make this test code part of the framework of the final software? The hardware tends to be so complex now that it’s unfair to give “naked iron” to the software people.

At the very least, deliver low-level drivers with well-defined interfaces. If you live and breathe hardware only, do talk to your software counterparts. You may be surprised to learn that all too often your cool new product makes debugging the code practically impossible.

Poor design decisions might seriously affect the firmware schedule. All embedded people must understand that their creation does not exist in isolation; the code and the chips all function together, to form the seamless gestalt that (you hope) delights the user.

After spending a couple of months writing code, it’s a bit of a shock to come back to the hardware world. Fixing bugs is a real pain! Instead of a quick edit/compile, you’ve got to break out a soldering iron, wire, parts, and then manipulate a pin that might be barely visible.

PALs, FPGAs, and PLDs all ease this process to some extent. Many changes are not much more difficult than editing and recompiling a file. It is important to have the right tools available: your frustration level will skyrocket if the PAL burner is not right at the bench.

FPGAs that are programmed at boot time via a ROM download usually have a debugging mechanism—a serial connection from the device to your PC, so you can develop the logic in a manner analogous to using a ROM emulator. Be sure to put the special connector on your design, and buy the little adapter and cable. Burning ROMs on each iteration is a terrible waste of time.

PLDs often come like EPROMs, in ceramic packages with quartz erasure windows. These are great – if you were clever enough either to socket the parts, or to have left room around the part for a socket.

On through-hole designs I generally have the technicians load sockets for every part on the prototype. I want to replace suspected failed devices quickly, without spending a lot of time agonizing over “Is it really dead?”

Sockets also greatly ease making circuit modification. With an 8-layer board it’s awfully hard to know where to cut a track that snakes between layers and under components. Instead, remove the pin from the socket and wire directly to it.

You can’t lift pins on programmable parts, as the device programmer needs all of them inserted when reburning the equations. Instead, stack sockets. Insert a spare socket between the part and the socket soldered on the board. Bend the pins up on this one. All too often the metal on the upper socket will, despite the bent-out pin, still short to the socket on the bottom. Squish the metal in the bottom socket down into the plastic to eliminate this hard-to-find problem.

Surface-mount parts are much more problematic. Get a good set of dental tools and a very fine soldering iron, so you can pry up pins as needed. You’ll need a bright light with magnifier, a steady hand, and abstinence from coffee.

A decent surface-mount rework machine (such as from Pace Electronics) is essential; get one that vectors hot air around the IC’s pins. Don’t even try to use conventional solder on fine-pitch parts; use solder paste instead, and keep it fresh (usually it’s best stored in a fridge).

Since SMT is so tough, I always make prototype boards with tracks on the outer layers. Sure, the final version might reverse this (power and ground outside to reduce emissions), but reverse the layering during debug. It’s easy to cut tracks with an X-Acto knife.

Every engineer needs at least two X-Acto knives. One is for fingernail cleaning, cutting open envelopes, and tossing at the dartboard. The other is only for PCB work and always has a new, sharp blade. Keep 50 or100 spare blades in your drawer, since PCB work invariably breaks the very sharp and very essential pointy end off in no time.

Planning

Engineers have managers, who “run” projects, ensuring that resources are available when needed, negotiate deadlines and priorities with higher-ups, and guide/mentor the developers toward producing a decent product on time.

Planning is one of any manager’s main goals. Too often, though, managers do planning that more properly belongs to the engineers. You know more about what your project needs than your boss ever will; it’s silly, and unfair, to expect him to deal with all of the details.

There are many great justifications for a project running late. In engineering it’s usually impossible to predict all of the technical problems you’ll encounter! However, lousy planning is simply an unacceptable, though all too common, reason.

I think engineers spend too much time doing, and not enough time thinking about doing. Try spending two hours every Monday morning planning the next week and the next month.

What projects will you be working on? What’s their status? What is the most important thing you need to do to get the projects done? Focus on the desired goal, and figure out what you need to do to get there. Do you need to order parts? Tools? Does some of your test equipment need repair or calibration?

Find the critical paths and do what’s required to clear the road ahead. Few engineers do this effectively; learn how, and you’ll be in much higher demand.

When you’re developing a rush project (all projects are rush projects), the first design step is a block diagram of the each board. From this you’ll create the schematic, then do a PCB layout, create a bill of materials, and finally, order parts for the prototype.

Not. The worst thing you can do is have a very expensive quick-turn PCB arrive, with all of the components still on back order. The technicians will snicker about your “hurry up and wait” approach, and management will be less than thrilled to spend heavily for fast-turn boards that idle away the weeks on a shelf.

Buy the parts first, before your design is complete. Surely you’ll know what all of the esoteric parts are—the CPU, odd analog components, sensors, and the like. These are likely to be the hardest and slowest to get, so put them on order immediately.

The nickel and dime components, such as gates and PALs, resistors and capacitors, are hard to pin down until the schematic is complete. These should mostly be in your engineering spares closet.

Again, part of planning is making sure your lab has the basic stuff needed for doing the job, from soldering irons to engineering spares. Make sure you have a good selection of the sort of components your company regularly uses, and avoid the temptation to use new parts unless there’s a good reason.

To read Part 1 , go to:“Test points galore.

This article  is based on material from “Embedded Systems: World Class Design” edited by Jack Ganssle, used with permission from Newnes, a division of Elsevier. Copyright 2008. For more information about this title and other similar books, please visit www.elsevierdirect.com.

With 30 years in this field Jack Ganssle was one of the first embedded developers. He writes a monthly column in Embedded Systems Design about integration issues, and is the author of two embedded books: The Art of Designing Embedded Systems and The Art of Programming Embedded Systems. Jack conducts one-day training seminars that show developers how to develop better firmware, faster.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.