Today’s system-on-chip (SoC) designs aren’t just hardware anymore. In the past, the creation of hardware chips was separate from the creation of the software to be executed on those chips, but today an SoC isn’t complete until you’ve proven that the intended software works – and works well – on the platform. In other words, today the SoC itself is a full-fledged embedded system.
This comes with a benefit. In the past, if there was a hardware problem the software programmer had to figure out how to code around it because the hardware was already done. By validating the software before declaring an SoC to be complete, you get an opportunity to fix hardware issues before they’re cast into silicon. But that benefit comes with a challenge: debugging efforts may lead to a hardware or a software cause. You can’t assume either one is correct.
Hardware debug before tape-out is traditionally done using simulation. Various other hardware validation strategies involving things like formal verification have augmented simulation to increase basic coverage and ensure that corner cases aren’t missed, but for debugging, simulation maintains its central role.
Software debug is traditionally done using debug engines, one per core. They take advantage of hardware features that provide some level of visibility into and control over the goings-on inside a processor. While there are a basic set of expected debug capabilities, your ability to diagnose issues is limited by the kind of access that the processor provides.
Traditional software debug also typically happens on the actual system, so you’re executing real code on real hardware at target system speeds. This allows you to get through large volumes of code to an offending routine relatively quickly.
These traditional techniques break down when debugging an SoC. Because there is no real hardware, code cannot be executed at true system speed. The hardware can theoretically be simulated as code is executed, with the benefit that you get all of the hardware visibility provided by the simulator: nothing is hidden. The problem is speed: this is an inordinately slow way to debug code.
If your SoC will be running programs over Linux, for example, you have to complete the Linux boot – billions of clock cycles – before your software can even begin executing. It is estimated that, at typical simulation speeds (around 10 Hz equivalent), a complete Linux boot would take over 28 years.
This is where emulation becomes critical. The SoC hardware is implemented in hardware, typically an FPGA or some other programmable element, giving it much higher speed. On such a system, the Linux boot can be accomplished as quickly as 15 minutes, depending on the actual speed being run.
Critically, however, emulators provide the kind of control and visibility that you get in a hardware debugger. You get the best of both software and hardware debug worlds: breakpoints and waveforms both play a part.
Regardless of the program being debugged, traditional hardware and software debug don’t know anything about each other. It would be inefficient if the two types of debug always had to be done independently, going back and forth and rerunning the software over and over in an attempt to locate a problem.
Allowing the two to work together, as if they were in sync, is highly preferable, and this can be done on an emulator. There are at least three ways to use hardware and software debuggers together in a manner that gets the most information from a single run.
The first relies on the software debugger as the primary tool. The idea is to get to a point in the program where things start going awry, and then bring the hardware debugger into play. So you set two breakpoints in your software, one each for the start and stop points of the waveform dump, and then start executing. At the first breakpoint, everything will come to a stop, and you can set up the hardware debugger to start dumping signals for waveform viewing. Execution then can resume, proceeding until the next breakpoint, when you turn the dumping back off (and finish execution if desired).
You can also set up the hardware debugger to trigger on a certain condition – say, when a particular function is reached. This is made possible by “software symbol awareness,” whereby software names can be used in the hardware debugger, and the translation between the software symbol – for instance, from function name to program counter value – happens automatically.
Software symbol awareness is made possible when the software is compiled with the debug flag on. There are a variety of tools that can then be used to extract those symbols for use by the emulator. Of course, that only helps with static symbols – like global or static variables or functions. Dynamically allocated variables, like local variables and parameters, can be located by calculating their position from the stack pointer.
By pre-defining when signal dumping should start and stop, you can start the execution from the software debugger, and the hardware debugger will automatically create the waveform.
The second way of having the two debuggers work together is by instrumenting the hardware. You can use assertions or additional logic to force some sort of activity – print a message, set another value, anything that can be done in hardware or via an assertion.
Both of these approaches use the software debugger to set things up and then see what happens in hardware. The third approach also assumes you run the software up to a point of interest before breaking, but at that point you can use the hardware debugger to force various values in the hardware, and then continue running to see what happens on the software side of things.
Debugging a kernel panic
On a system running Linux, the fact that the OS has to boot before overlying software can be run means that Linux itself has to be up and working properly before you can start debugging your own software. But, in fact, any embedded system that requires Linux also requires work to get Linux to work properly. A successful boot is a battle won. And subtle issues with the hardware can impact whether or not the system will come up properly.
An example of such a problem can illustrate how the interplay between hardware and software is critical in the debugging of odd problems. In this example, Linux fails to boot – it “panics” or issues an internal error, and we need to figure out why.
The hardware platform (called the design-under-test, or DUT) in this case consists of a Diamond DC-232L processor core with 16 MB of ROM and 128 MB of RAM. The system is equipped with a UART and the ability to drive an LCD. For debug purposes, it also has a JTAG test access port (TAP).
The system runs Linux from MontaVista, version 4.0.1, implementing kernel 2.6. Initramfs is used for the initial RAM disk, and most of the shell utilities are handled using Busybox 1.5.
The RTL describing this system is loaded into a ZeBu emulator and then connected to a host system. Such emulators communicate with a host via transactions so that commands and interactions with the host don’t slow down emulation speed. For this example, there are three transactors: one for the UART, one for the LCD, and one for the JTAG TAP.
On the host side, we run a Linux console from the UART transactor, the LCD display from the LCD transactor, and a software debugger that connects through the JTAG transactor. For hardware debugging, zRun is used. zRun has a software symbol awareness feature that lets you work with the hardware side using the language of the software side. The entire system is shown in Figure 1 below.
The processor is emulated at 12.5 MHz, and it takes a couple of seconds to download the design into the FPGAs on the emulator. Once execution starts, if everything boots correctly, it takes 70 seconds to get to the Linux prompt at this speed.
In this particular example, at some point, when the boot process is almost complete, something goes wrong, and the console points to an interrupt issue with irq4: “too much work” (Figure 2 below ).
Clickon image to enlarge.
The next step is to find the offending code in the kernel source. The kernel has a lot of code, but the error message appears to indicate the function name, serial8250, and a search for that turns up the routine in which the panic message is generated (Figure 3 below ).
Clickon image to enlarge.
The key player in this routine is the Interrupt Information Register, or IIR. It may contain clues as to what’s going on, but there’s a problem unique to the IIR: using a software debugger to read the IIR has the side-effect of clearing it. This is an “intrusive” observation in that it changes the state of the system, something to be avoided if possible.
The way to get around this issue is to use hardware debugging instead. With a hardware debugger, you’re simply looking at various portions of the hardware. To the hardware debugger, a register is a register, and this particular register can be viewed without its special software semantics. Said a different way, the hardware debug option is non-intrusive.
First we want to watch specifically what happens during this routine. Because the hardware debugger is aware of software symbols, you can set the trigger to break or start tracing when the program counter enters the routine, which is as simple as looking for serial8250_interrupt, the name of the function (Figure 4 below ).
Clickon image to enlarge.
We can then set up to capture all of the UART signals as well as other critical points like the IIRs and memory bus, as shown in Figure 5 below.
Clickon image to enlarge.
We then continue execution to create and subsequently analyze the waveforms for the selected signals.
Examination of the waveforms shows that the UART controller isn’t communicating with the processor in the way it should be. Linux buffers characters intended for the UART until the UART controller signals that it’s ready to receive the characters by virtue of its own empty outgoing buffer.
In this case, the UART controller never generates that initial interrupt saying it’s ready for output, even though its buffer is starting out empty. It’s only later, when someone hits a key on the keyboard, that the interrupt is generated. By that time, Linux has overflowed its own character buffer, and that’s where things fall apart.
If the interrupt signal isn’t being generated, then we need to look into the hardware definition for clues (Figure 6 below ).
Clickon image to enlarge.
Here it’s clear that the defining event for generating the interrupt is completion of a transmission. In other words, it’s set up to send an interrupt when the buffer becomes empty. But the buffer is also empty when the system starts up, before anything has been transmitted. The problem is that this logic only triggers when the buffer becomes empty, not when it is empty.
What this illustrates is the fact that software issues may reveal subtle underlying hardware problems, the kinds of issues that, left undetected before tape-out, would ultimately require an expensive mask revision. The only way to uncover such devilry is by executing software before the hardware is created, and the only way to do that in a reasonable timeframe is by using emulation.
It’s situations like this that reinforce the fact that, when you’re creating a complex SoC – or even a simple one, as shown in this example – you need to run both software and hardware debug together in an emulator in order to verify that both the hardware and software components are operating correctly, and that the system is truly ready to be committed to silicon.
Donald Cramb is director of the Consulting Services Division of EVE-USA in San Jose, Calif., and is responsible for customer services, applications and design solutions to support specific customer requests. Previously, he was a partner at ArchSilc Design Automation, a company focused on system-level verification solutions. Earlier in his career, Cramb was director of Services at Quickturn where he helped expand its offerings into key target markets, including wireless, graphics, multi-media and networking. He went on to become vice president of Technical Services for three years at Virtio and held a similar position at Tharas Systems. After graduating with a bachelor’s degree in Electrical Engineering from the University of Edinburgh, Cramb spent 11 years employed by Philips in the United Kingdom and Silicon Valley. He began his career as a design engineer.