11 steps to successful hardware troubleshooting – Part 2
In Part 1 of this article I outlined ten steps to follow when debugging embedded hardware. Here in Part 2, I will describe the use of these steps with some real bugs that colleagues of mine have encountered. I asked our engineers to tell me of any particularly intractable bugs they have encountered or heard about. I will refer to those engineers collectively as “Fred” and to customers as “Phil.”
First, to review, here below are the 11 steps I described in Part 1:
- Step #1: Picture success
- Step #2: Keep notes
- Step #3: Reproduce the problem
- Step #4: Gather the evidence
- Step #5: Try the easy stuff first
- Step #6: Break the problem down
- Step #7: Talk it over with a colleague
- Step #8: Apply the fix
- Step #9: Try to break it again
- Step #10: Remember ‘disappearing’ bugs are still there if you haven’t fixed them
- Step #11: Celebrate
Bug example #1 - crossed lines
The concept was pretty simple within a complex system. Fred had designed a video surveillance board that took two analog camera inputs, digitized them, compressed them into MPEG4, and then streamed that data over Ethernet to a PC.
The project was running late because the original aim of streaming over USB had been abandoned due to the low performance of the MPEG4 compressor’s USB interface. So the marginally more expensive option of Ethernet was used instead. But now the Ethernet was behaving strangely.
An experienced engineer with years of Ethernet designs under his belt, Fred thought the problem had to do with it being one of those interfaces that “just worked” (unlike a certain other well-known interface with a three-letter name). In this case, the MPEG4 SOCs streamed their data via a 3-port hub. The software supplied with the SoC had turned out to be pretty dodgy and unreliable, but as a core function, the Ethernet solution should have worked out of the box. Instead, something was causing Ethernet output to be horribly corrupted. Time for the ten steps.
Fred’s colleague, John, was writing the software that handled the Ethernet output on the SoCs. He was good at getting to the bottom of bugs so Step 1 was not a problem. (Throughout this article, refer to the list at the top to remind yourself what each step entails, or go to Part 1 for details.) The real question was how long the debugging process would take.
He started by using Wireshark to record what data was actually being received, then he entered this data into a spreadsheet (Steps 2, 3, and 4). Initially this was a mess, as the board was booting and attempting to pass streaming video data to a known IP address. John stripped the software back to the basics (Steps 5 and 6), running just a bootloader to try and establish a connection through a broadcast write.
John wrote to Fred: “I’ve made a hex dump of all frames received by device, and imported them into Wireshark. It seems that they are all valid Ethernet frames (or at least they start as valid Ethernet frames). At the moment I’m comparing what I’ve got with what is actually transmitted in the network (imported data with Wireshark capture on my PC network card). On the first look, there is some similarity, but many frames are missing, and those received are truncated. It doesn’t seem that the data within the frames is damaged, just truncated.
“Further analysis shows that all frames have 1 to 6 bytes chopped off the end of the frame.”
John could see that the problem was on the RX (receive) side of the Ethernet interface. The TX (transmit) data was getting through fine and the RX was sending broadcast packets but not unicasts. However, on RX broadcast receives, the data was strangely truncated. Having already waded through a swamp of software bugs in the vendor’s SDK, John had assumed that it was another software issue. But this looked like a hardware bug. Was it a problem with the hub, or with the SoC itself? It was time to talk it over with Fred (Step 7). Together, they pored over the schematics once again (Figures 1 and 2).
Figure 1 and Figure 2 show the incorrect and correct implementation, respectively, of a segment of the Ethernet hub schematic. Can you spot the difference?
Did you find it?
The answer is that The RX_DV (receive data valid) lines are swapped. These data valid strobes worked to an extent on a broadcast, as the same data was output on both ports. But there was a buffer delay between the ports, hence the truncation of the data. Unicasts didn’t work at all, as the data valid was going active on the wrong port.
Now came Step 8 to resolve this particular bug with scalpel and iron. They tested the fix and it started streaming video data.