In Part 1 of this article I outlined ten steps to follow when debugging embedded hardware. Here in Part 2, I will describe the use of these steps with some real bugs that colleagues of mine have encountered. I asked our engineers to tell me of any particularly intractable bugs they have encountered or heard about. I will refer to those engineers collectively as “Fred” and to customers as “Phil.”
First, to review, here below are the 11 steps I described in Part 1:
- Step #1: Picture success
- Step #2: Keep notes
- Step #3: Reproduce the problem
- Step #4: Gather the evidence
- Step #5: Try the easy stuff first
- Step #6: Break the problem down
- Step #7: Talk it over with a colleague
- Step #8: Apply the fix
- Step #9: Try to break it again
- Step #10: Remember ‘disappearing’ bugs are still there if you haven’t fixed them
- Step #11: Celebrate
Bug example #1 – crossed lines
The concept was pretty simple within a complex system. Fred had designed a video surveillance board that took two analog camera inputs, digitized them, compressed them into MPEG4, and then streamed that data over Ethernet to a PC.
The project was running late because the original aim of streaming over USB had been abandoned due to the low performance of the MPEG4 compressor’s USB interface. So the marginally more expensive option of Ethernet was used instead. But now the Ethernet was behaving strangely.
An experienced engineer with years of Ethernet designs under his belt, Fred thought the problem had to do with it being one of those interfaces that “just worked” (unlike a certain other well-known interface with a three-letter name). In this case, the MPEG4 SOCs streamed their data via a 3-port hub. The software supplied with the SoC had turned out to be pretty dodgy and unreliable, but as a core function, the Ethernet solution should have worked out of the box. Instead, something was causing Ethernet output to be horribly corrupted. Time for the ten steps.
Fred’s colleague, John, was writing the software that handled the Ethernet output on the SoCs. He was good at getting to the bottom of bugs so Step 1 was not a problem. (Throughout this article, refer to the list at the top to remind yourself what each step entails, or go to Part 1 for details.) The real question was how long the debugging process would take.
He started by using Wireshark to record what data was actually being received, then he entered this data into a spreadsheet (Steps 2, 3, and 4). Initially this was a mess, as the board was booting and attempting to pass streaming video data to a known IP address. John stripped the software back to the basics (Steps 5 and 6), running just a bootloader to try and establish a connection through a broadcast write.
John wrote to Fred: “I’ve made a hex dump of all frames received by device, and imported them into Wireshark. It seems that they are all valid Ethernet frames (or at least they start as valid Ethernet frames). At the moment I’m comparing what I’ve got with what is actually transmitted in the network (imported data with Wireshark capture on my PC network card). On the first look, there is some similarity, but many frames are missing, and those received are truncated. It doesn’t seem that the data within the frames is damaged, just truncated.
“Further analysis shows that all frames have 1 to 6 bytes chopped off the end of the frame.”
John could see that the problem was on the RX (receive) side of the Ethernet interface. The TX (transmit) data was getting through fine and the RX was sending broadcast packets but not unicasts. However, on RX broadcast receives, the data was strangely truncated. Having already waded through a swamp of software bugs in the vendor’s SDK, John had assumed that it was another software issue. But this looked like a hardware bug. Was it a problem with the hub, or with the SoC itself? It was time to talk it over with Fred (Step 7). Together, they pored over the schematics once again (Figures 1 and 2 ).
Figure 1 and Figure 2 show the incorrect and correct implementation, respectively, of a segment of the Ethernet hub schematic. Can you spot the difference?
Did you find it?
The answer is that The RX_DV (receive data valid) lines are swapped. These data valid strobes worked to an extent on a broadcast, as the same data was output on both ports. But there was a buffer delay between the ports, hence the truncation of the data. Unicasts didn’t work at all, as the data valid was going active on the wrong port.
Now came Step 8 to resolve this particular bug with scalpel and iron. They tested the fix and it started streaming video data.Bug example #2: Touchscreen malfunction
A year later Fred wasworking on a controller board for an industrial mixing machine. Phil,his customer, had designed the control system for the mixer; Fred hadresponsibility for designing the touchscreen PC that a plant operatorcould use to operate the system. The touchscreen system seemed to workjust fine, except when it was plugged into Phil’s board.
“We have noticed that if the machine is powered up before your control board, the control board won’t boot.”
“Really?”Fred thought for a while. He only had his control board and anyperipherals had run off the same supply as Phil’s board.
Fredknew he would fix the bug (Step 1), but there were a lot of interfacesbetween the two boards, not to mention interconnected power supplies. Sodebugging was going to have to be methodical and well-documented (Step2). Phil thought it was probably a back-powering issue and while thatseemed possible, Fred thought it would be better to keep an open minduntil he could reproduce the problem (Step 3).
Unfortunately,due to the size of the equipment, Fred didn’t have a target board withwhich to work. So he tried to reproduce the issue by connecting up asmany of the interfaces as possible to power supplies: the SPI to amemory device and the serial ports and USB to a PC.
He then triedto power up the boards and devices in different orders. Unfortunately,no luck. His board booted every time. He came to the conclusion that hismock system was not correctly simulating that of the customer. He droveto the customer site, test equipment in hand. Sure enough, they wereable to easily demonstrate the problem (Step 4).
Now on site, hetried the easiest thing (Step 5). He powered his board from the targetboard but held it in reset so that even though it was powered at thesame time, it would boot later. This also triggered the problem. He thentried powering the boards from different power supplies with highercurrent limits, but this made no difference. Now it was time to breakthe problem down (Step 6).
Fred disconnected each interface oneat a time and tried booting. Here came the first break-through. Theboard booted fine with the SPI disconnected. Fred knew that he hadtested the system back at the office with the SPI attached to an SPIslave and had no problem. So there was something different with thisslave that must be causing the issue.
He then disconnected eachSPI line one at a time until he identified the culprit. The MISO line(data from the target into his processor) was causing the boot failure.An oscilloscope showed him it booted up driven low, so it wasn’tpowering his board in any way. However, when he looked at how a standardSPI slave operated, he saw that the MISO line normally tristated if theslave was not selected.
After scratching his head for a while,he called the office to speak to John (Step 7), the firmware engineer.Did he have any suggestions? After all, the boot mode was his domain.
Johntook a look at the data sheet for the module and could find nothingobvious. Delving further into the data sheet for the actual targetprocessor board, he immediately found the problem: the MISO line wasused incorrectly by that processor as a boot mode selection line and wasread at start-up to determine what to boot from.
Luckily, theSPI interface on the target board was in an FPGA and a fix (Step 8) waseasily added to tristate the output at boot time. Fred also added abuffer to his board to tristate the MISO line regardless of what theother board threw at it.
Now, even when they tried their best toreplicate the original problem (Step 9), the boards booted regardless ofthe power order.
Time to celebrate? Not so fast.
Bug example #3 – An SPI drive problem
“Fred,while you are here, there is something else we would like you to lookat. When using the SPI interface, the first read after the board isbooted from cold is always 0xFF, no matter what the data it is trying toread back. Once the first read has taken place, the data is alwayscorrect. We can close down and restart the device on the other end ofthe SPI interface with no effect, which suggests to us that this is notwhere the problem is. We can stop and restart the application on theembedded board used to read the data, again with no effect. Thissuggests to us that there may be an issue with the SPI driver.”
Fredknew he would crack it (Step 1), but initially he was stumped. Theinterface between the two boards was straightforward: an SPI link to anFPGA on the customer board. He had tested it using SPI memory andeverything had looked normal. However, in addition to designing theattached board, the customer had implemented the software application,so there were variables outside of Fred’s control.
The board was booted, Phil ran his code and accessed their board. Sure enough, there was the bad byte in evidence (Step 3).
Time to attach an oscilloscope. Figure 3 shows what he saw:
Onthe face of it, it looks OK. The yellow is the clock, the green is thechip select, and the blue, the output data. The first clock edge haszero data against it. So what’s the problem?
Fred tried attachinga development board to the target with the same result (Step 4). Thisreally did look like a driver issue but the scope trace was showingnothing to support that view. He checked for errata on the processorsent by the manufacturer and there was indeed a bug in the SPIcontroller. It occurred when the clock polarity was set to sample on thefalling edge. The bug said that in this case, at the end of the access,the chip select would not be released.
“Phil, You wouldn’t by any chance be sampling on the falling edge, would you?”
“Of course, we always do that, it is procedure 231A in our company design manual.”
Atthis point, it dawned on Fred that when the customer was using thismode (i.e., falling-edge sampling) to look at the trace, it would resultin an extra clock edge. This could easily cause the problem. But whywas the SPI controller doing this at the beginning of the access? Timeto talk to John, who had the driver source code (Step 7).
Aquick look the driver source code showed the likely cause: A botchedsoftware workaround had reconfigured the chip select as a GPIO and wasactivating it, then turning on the SPI. This meant that the clock was inthe wrong state at the start of the access: a dummy bit read wasexecuting and a one bit offset was causing the byte to be readincorrectly.
A software fix (Step 8) was made to make sure theSPI CS was only being controlled at the end of the access, resulting inthe nice transitions shown in Figure 4 :
Fix applied, they attempted to break it again (Step 9). It worked fine.
Fred was so pleased he bought himself a doughnut on the way home (Step 11).
Dunstan Power is a chartered electronics engineer providing design, production, andsupport in electronics to all of ByteSnap Design's clients. Havinggraduated with a degree in engineering from Cambridge University,Dunstan has been working in the electronics industry since 1992 and, in2004, founded Diglis Design Ltd, an electronic design consultancy, wherehe developed many successful electronic board and FPGA designs.
In2008, Dunstan teamed up with his former colleague Graeme Wintle toestablish a company that would supply its clients with integratedsoftware development and embedded design services, and ByteSnap Designwas born.