When it comes to troubleshooting, Jack gets downright philosophical. But watch out: human nature can get in the way of careful thinking.
It's often easier to teach than to do. I recently wrote about bringing a new hardware design to life. To quote one section: “Don't forget Vcc! As a young engineer, a wiser, older fellow taught me to always round up the usual suspects before looking for exotic failures. On occasion I've forgotten that lesson, only to wander down complex paths, wasting hours and days. Check power first and often.” (“New Hardware,” May 2001).
Just days after finishing that article, I decided to fix my boat's radar detector, a device that sets off a shrill alarm when it detects an active ship radar, to wake me up and take evasive action if needed. Over the winter it failed, emitting random shrieks and trills.
The unit last failed in 1992 on a sail to England. I can't remember the symptoms but the cause was bad rechargeable batteries. Figuring that maybe the Ni-Cds were again dead, I put a voltmeter on these and found the voltage was just fine.
This is a 20-year-old design, completely microprocessor-free. It even has a schematic in the manual. A bit of ancient 4000-series CMOS logic plus plenty of analog is spread across an easy-to-access circuit board. An oscillation permeated the circuit, showing up suspiciously on too many nodes. How could it be in so many places? Power was okay-the voltmeter proved it. I spent an embarrassing amount of time chasing ghosts before putting a scope probe on the batteries. I saw the same oscillation. Not large, but clearly modulating the six-volt supply. My assumptions of what could be or should be happening continued to obscure reality as I tried to find something coupling into power. Finally, a light dawned: replacing the batteries with a power supply completely cured the problem. Apparently, the batteries' internal resistance increased as they aged. Since the unit consumes just a few milliamps, this condition allowed a signal to couple onto the six volts, rather than cause a reduction in voltage. Which, of course, is what I assumed would be the failure mode of the Ni-Cds.
I always tell young engineers to check Vcc with a scope, not a meter, for reasons my story makes obvious. My assumptions proved false, my unwillingness to use previous history (the 1992 failure) obscured the path to truth, and I wasted a couple of hours ignoring a basic rule of troubleshooting, one I had so grandly and recently written about.
(On the other hand, it was great fun to spend an afternoon fiddling with a non-microprocessor-based product! What a delight to work on a probeable, understandable, SMT-less device. Today most consumer products defy repair even by the original designers.)
So perhaps this is a good time to review a philosophy of troubleshooting and debugging systems. It matters little whether you're working on hardware or delving into firmware bugs. Both require the same approach and mindset. Our goal is to extract truth from the system (what's wrong and why) and then to apply a fix.
Our chief ally in this search for wisdom is the right world-view: a spirit of suspicion tempered by trust in the laws of physics, curiosity dulled only by the determination to stay focused on a single problem, and a zealot's regard for the scientific method.
Debugging and troubleshooting are not random acts carried out by a bewildered engineer. There's a clear process we should follow to ensure that we both find the problems and fix them completely and permanently.
The first step: observe the system's behavior to find the apparent bug. In other words, determine the symptoms. Many problems are subtle and exhibit themselves through a confusing set of symptoms. Be wary. All too often we pursue a problem only to finally hit our heads in frustration as we realize that the system is indeed supposed to act this way.
Be sure the problem manifests itself in a repeatable way. When this is not the case, try to simplify the actions needed to create the problem. A non-repeatable bug is all but impossible to find.
Simplify, simplify, simplify. Work on a single problem at a time. We're not smart enough to deal with multiple bugs all at once, unless they are manifestations of something more fundamental.
Observe collateral behavior
Watch the system to obtain as much related information as possible. What else is going on when the bug appears? Often a correlation exists between the good and the bad. Does the display flicker when firing off a big solenoid? Electrical noise, grounding, and power problems related to the properly functioning solenoid may indeed be the root of the problem.
I once worked on a system that cycled a house-sized motor back and forth every few seconds. So much EMF was generated that microprocessor-based instruments all over the factory acted oddly. Developers from different companies, working on their products on the factory floor, chased erratic system crashes coming from the one big motor source.
Round up the usual suspects
Lots of computer problems stem from the same few sources. Clocks must be stable and meet very specific timing and electrical specs, or else all bets are off. Reset, too, often has unusual timing and electrical parameters. Examine all critical hardware signals with the scope, including NMI, DMA request, clock, wait, and so on. Don't assume these are in known safe states.
I once lost too much youth over a 16-bit system that crashed erratically. Assuming the code was at fault, I searched for any clue as to what was executing at the time of the problem. The culprit was a one-nanosecond glitch-barely detectable with the scope-on the reset input that, in turn, came from a poorly designed power-fail circuit.
And never forget to check Vcc. Don't rely on the voltmeter. Use a scope, as I should have done on my failing radar detector. As an ex-emulator vendor I've seen far too many systems where the power wasn't at the right voltage, being off by just a bit. Or with too much ripple. Modern CPUs are totally intolerant of even the slightest Vcc variations. Check the spec. You may be surprised to see just how little margin the part tolerates. A quarter-volt isn't unusual.
Does the firmware even get to the particular code you suspect? Don't waste time analyzing and theorizing till you're sure that you're working on the right function. Maybe the interrupt never came, in which case debugging that ISR is sort of pointless.
Generate a hypothesis
Amateurs modify things without a deep understanding of why the system is broken. Sure, it's easy to change the code from if(a>=b) to if(a>b) and hope that solves the problem. It's easy, but it's foolish.
I used to watch in awe as analog engineers soldered circuits in three-dimensional arrays of resistors and ICs that looked like a New Age sculpture. Some did indeed create a real design and then just isolated problems by means of this prototyping. Others fiddled with op-amp damping and feedback values without a good design, stopping “when it worked.” Was it any surprise that so many of those creations couldn't survive temperature extremes or production tolerance variations? I remember one system using a homemade switching power supply. When summertime thunderstorms hit the Midwest, thousands of these units failed.
A device built by iterative trial and error will never be as robust as one with a solid, well-understood design. So too for debugging and troubleshooting. The band-aid fix will usually come back to haunt us.
Our tools have gotten too good, and it's to our detriment. Older readers will remember the development environment we used in the mainframe days. We'd laboriously punch a thick deck of cards containing the program and submit them to the high priest. He'd tell us to come back in a day, maybe two, to get the job's results.
That meant the edit, compile, link and test cycle took 24 hours or more. Fast forward to 2001 and watch a typical developer: a 21-inch monitor displays editor, debugger, and compiler windows. He encounters a bug. In a flash he changes something-maybe an == to a !=-compiles, links, and downloads. Five seconds later he's testing again. Was the bug really fixed? Did the engineer understand cause and effect or did the change simply mask the real problem?
Our ability to make changes faster than we can understand them is a problem. We need to slow down and think. The tools enable dysfunctional behavior. One solution to this particular problem is yet another tool, one that's utterly low tech. Use an engineering notebook and write down symptoms as they appear, before implementing a fix. Figure out what is really going on and write it down before changing the code. Note your proposed fix. Then change the code and run the test. The notebook gives us another 30 seconds of perspective on the problem, breaking the cycle of “change something and see what happens.”
Perhaps we don't have enough data to formulate a hypothesis. Use your tools, from BDM to ICE to scope and logic analyzer, to see exactly what is going on. Compare that to what you think should happen. Generate a theory about the cause of the bug based on that comparison.
Test the hypothesis
The scientific method shows us that a theory not backed up by solid experimental evidence is nothing more than a guess. Do you think the system's reset line is noisy? Prove it. Check with a scope. Does it seem the incoming data stream is occasionally corrupt? Instrument the data, or examine it with a debugger, to convince yourself that this is indeed the problem, and that it's truly worth fixing.
It's fine to be wrong. It's inexcusable to be wrong and rampage onward, making changes blindly.
And don't be so enamored of your new grand hypothesis that you miss data that might disprove it. The purpose of a hypothesis is to crystallize your thinking-if it is correct, you'll know what step to take next. If it's wrong, collect more data to formulate yet another theory. When Chernobyl exploded, Moscow sent in the USSR's top reactor experts. They walked through the parking lot, tripping over graphite chunks blown out of the building, shaking their heads and repeating “it can't have blown up.” Yet the evidence was brutally obvious.
One corollary is that a problem that mysteriously goes away tends to just as mysteriously return. When you fix the bug without ever developing an adequate hypothesis, you've likely left a lurking time bomb in the product.
And never use the old “glitch” excuse. There are no glitches. Transient failures come from physical causes; it's our responsibility to find and fix those causes. It's tempting to dismiss an intermittent bug as a glitch. For example, the Mars Pathfinder mission suffered from a software fault as it fell through the planet's atmosphere on its way to land. The mission did land successfully, and the designers later very impressively uploaded fixed code to the spacecraft, 40 million miles away. Amazing. But they saw the failure on Earth during test-twice-and characterized it as a “glitch.”
Fix the bug
There's more than one way to fix a problem. Hanging a capacitor on a PAL output to skew it a few nanoseconds is one approach. Another is to adjust the design to avoid the race condition entirely. We can try to beat that nasty function no one understands into submission (one more time) or we can recode it to make it reliable.
Sometimes a quick-and-dirty fix is worthwhile to avoid getting hung up on one little point if you are after bigger game. Always revisit the kludge and re-engineer it properly. Electronics have an unfortunate tendency to work in the engineering lab and not go wrong until the 5,000th unit is built. Firmware fails when stressed, when exceptions occur. If a fix feels bad, or if you have to furtively look over your shoulder and glue it in when no one is looking, then it is bad.
Finally, never fix the bug and assume all is okay just because the symptom has disappeared. Apply a little common sense and scope the signals to make sure you haven't serendipitously fixed the problem by creating a lurking new one.
Feedback stabilizes systems, be they electronic circuits or our approach to making things, or even how we deal with relationships. (“Okay, doing that annoys her; I'll avoid it in the future.”)
The best developers I know (and there are darn few in this category) fix a problem and then look for ways to never have the same problem again. Not all problems yield to preventative measures, but it's surprising how many do. In my collection of embedded disasters, the most common theme is inadequately tested exception handlers. This tells us, if we listen to the feedback, that testing those portions of the code effectivley pays big dividends. One of the best practices of Extreme Programming is writing tests concurrently with the code, which can help find these problems early.
One developer told me he found a bug where his program attempted to overwrite ROM. After finding the bug, he left the logic analyzer connected, to continue to find any overwrites to ROM. He found seven more cases. That's a great example of employing corrective feedback.
And so, after fixing the radar detector and feeling tremendously foolish, I decided to repair the back-up, a unit cheaply purchased from a surplus shop. Armed with my previous experience, I checked the batteries first, found them bad, and installed new ones. The unit still didn't work, but it was then easy to locate and replace a bad potentiometer. So I guess it is possible to learn, though sometimes human nature gets in the way.
That's one reason we need a disciplined process for debugging and troubleshooting-to guide us when chaos reigns, and when we're apt to slip up. esp
Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. He founded two companies specializing in embedded systems. Contact him at .