11 steps to successful hardware troubleshooting
When the Gantt charts are drawn up at the beginning of a project, perhaps the hardest part for the hardware engineer to estimate is the debug phase of a product development. It is also one of the most ignored sections in planning.
CAD tools have progressed over the years in terms of ease of use and integration into PCB and mechanical. But ultimately, the design work is carried out by a person who is not only fallible, but may also be working with incomplete or incorrect data. Some bugs are inevitable on all but the simplest designs and so the art of troubleshooting these bugs is all-important.
Bugs can range from something going BANG the first time power is applied to intermittent glitches reported in association with completely unrelated things like “it was raining” or “it only happens on his bench, not on mine”. Consequently the ease of fixing bugs similarly ranges from a five-minute job to months of work.
Debugging can be the most fun part of electronics design when it is going well. There is a great of satisfaction in finding and fixing the intractable bugs. But to succeed, it is important to be systematic in the approach taken to fixing bugs.
In this article are listed the steps needed to bring such a systems approach to troubleshooting hardware in product development. To illustrate these principles, I will refer back occasionally to work I performed years ago as a junior engineer on a system that was used for monitoring sixteen analog audio inputs at the same time. It looked something like Figure 1 below:
Occasionally, the DSP would stop receiving interrupts and the whole system would grind to a halt. This could happen days apart, or in a matter of just minutes. Software bugs had been eliminated as the cause, so this looked like a hardware bug and I was asked to investigate.
Step 1: Picture success
An important part of debugging is having the right mental attitude, as persistent problems can grind down your morale. In particular, it feels bad going to work two days in succession with the investigation stuck at exactly the same point. In such a case, ask yourself “Will I still be working on this bug in a year’s time?” The answer: Of course not! This bug isn’t forever, it’s going to be fixed. It’s not that there’s no solution, it’s that I simply haven’t seen it yet.
Step 2: Keep notes
Resist the temptation to dive straight in trying to fix the bug immediately. But it is important to determine first if others have dealt successfully with a similar problem. Collect reports from multiple sources, even though they may sometimes have conflicting data attached. A spreadsheet can work well here to organize what you find.
Step 3: Reproduce the problem
This is often the hardest and most time-consuming part. The frequency that bugs show themselves varies enormously. So at this point, based on the information you have collected, you need to create the conditions by which you can make the bug happen at your command.
At this point diagnosis can begin. The initial bug report may be “It stopped working”, “It crashed”, or other equally vague reports. Keep working until you have all the information you can get from the one who reported the problem and also have enough to narrow down the range of possible causes.
Don’t worry about or speculate on the cause, just focus on reproducing the bug. Be careful at this point. Sometimes, similar but different issues can appear to be caused by the same bug, with one masking the other. If other bugs are uncovered while looking for the initial bug, make a note of them and go back to them later, but don’t get side-tracked.
In the example shown in Figure 1, the bug was extremely sporadic, so the first part of the exercise was to write software that exercised the system vigorously, sampling at the maximum rate This reduced the time between failures and made analysis easier.
Step 4: Gather the evidence
Be methodical and document what you see and what happens. Don’t theorise at this point about causes, just create a table of what aggravates or alleviates the bug as well what as has no effect. Be aware that multiple bugs can have the same symptoms, which can produce contradictory evidence.
Step 5: Try the easy stuff first
So you can reproduce the problem, look for the easy explanations first. For instance, are the connectors wired back-to-front? Are the chip pin-outs as per the data sheet? Are clocks running at the correct frequency? Very often, bugs are caused by mistakes that look dumb only in hindsight.
Remember when you did the design work, you had probably thousands of small decisions to make; most of these were correct. If the error proves to be “obvious” you can correct it and skip to step 9.
Step 6: Break the problem down
So now the easy things have been checked out. You can reproduce the problem, but perhaps only occasionally, or there are conflicting messages. It is important to remember that In complex systems, multiple bugs can show the same symptoms on the surface, but require different cures.
To help clarify the issue, eliminate as much of the system as possible that doesn’t appear to be relevant to the bug. For instance, you could power down devices on the PCB that are unrelated, or unplug cables to other boards. Do this while retesting. If the bug suddenly stops when an unrelated module is taken out of the equation, you have a smoking gun. Document it. Try to reproduce the bug again, with the module and then again without it.
In the DSP system described earlier (Figure 1), the only symptom was that the interrupts from the FPGA sometimes just stopped coming in. We wrote a simple program that just thrashed the system, reading from the interrupt status registers at the highest rate possible with the sample rate as high as possible, but without any other accesses to other parts of the system.
The frequency of the bug dramatically increased and a bug in the FPGA was found, relating to the system reading the interrupt status register at the same time as a new interrupt arrived. We fixed the issue and put the board on for a 24-hour soak test. After 23¾ hours, the bunting was up and we were starting to push the champagne corks out of the bottles….then it crashed again.
Step 7: Talk it over with a colleague
When dealing with what seem to be intractable bugs, just talking it over with someone else can often help, even when that person is from a different engineering discipline. Explaining what you see to another can be all that is required for you to see the bug from a different point of view and realize a crucial fact. At the very least you may come up with inconsistencies that need ironing out or receive suggestions of other things to try.
This conversation is best had away from the action. Go through exactly what the evidence is, one bit at a time, then look for what other experiments or investigations can be carried out. Then go back to the board and carry on.
In my example, we had a meeting with the office hardware guru and ran through the system, drawing it all up on a white board. Looking at how the FPGA logic worked in action, he suggested there might be a meta-stability issue in the FPGA.
Step 8: Apply the fix
You understand the bug and have come up with a rational solution. You run the code and the problem appears to be solved. However, your job isn’t over yet.
Step 9: Try to break it again
Try to break the system again. To be sure you succeeded you will need to put the system through an appropriate series of stress tests an order of magnitude beyond that of the original implementation.
For instance, if a real-time system such as the one above, crashed every ten minutes and never lasted longer than an hour, but now runs for ten hours, the bug is almost certainly fixed.
You may find that the system behaves better, but still crashes. But at this point you may have discovered a new bug that had been masked by the previous bug. You need to treat it as such and go back to step one, creating a fresh investigation on the “cured” system.
Happily, in my example, our resident guru was correct and a simple modification to the FPGA solved the problem. There had been two bugs with one symptom, one which resulted in crashes on a period of about five minutes, the other on a period of many hours (typically about five). The nearly 24 hours in our soak test turned out to be a fluke. We had finally reproduced, analysed, understood and fixed the problem.
Step 10: Remember ‘disappearing’ bugs are still there if you haven’t fixed them
Sometimes bugs just appear to go away by themselves. This can be frustrating, but you can be sure that you haven’t fixed the bug. Either the initial report was incorrect or the bug is still there. These are the sort of bugs that reappear when your boss, his boss, or a customer is present.
It can be tempting to lift up the carpet and sweep these bugs under it, but don’t. Perhaps document it and carry on looking into other issues as it may return by itself. Ultimately though, you need to go back and fix it at some point. So, more effort needs to be applied in aggravating the problem to reproduce the bug.
Step 11: Celebrate
Remember how bad it felt when the bug was grinding you down? Now, celebrate when you win. It’s you: 1, bugs:0. Now the game can move on and you can be sure you'll fix the next bug too.
In the next part in this series I provide some real world examples of how these suggestions led to successful bug resolution.
Dunstan Power is a chartered electronics engineer providing design, production and support in electronics to all of ByteSnap Design's clients. Having graduated with a degree in engineering from Cambridge University, Dunstan has been working in the electronics industry since 1992, and in 2004 founded Diglis Design Ltd, an electronic design consultancy, where he developed many successful electronic board and FPGA designs.