Problems like this can only be debugged non-intrusively - debugging that has no side effects on the system. Trace was invented to solve these kinds of problems. Before we go on to talk about trace, let's look at these problems in more detail, including a real-world debugging situation that trace helped solve.
Bugs that disappear when you run them under a debugger or even with added printf() statements to the most innocuous places are usually caused by memory corruption or race conditions that depend on a very particular sequence and timing of events.
Adding a printf() statement alters the memory footprint of the program, and slows it down as well. Running a program under a debugger can slow a program down as well, depending on how the debugger interacts with the target being debugged.
Applications that can't be stopped or slowed down are usually at the heart of many embedded products. For example, a cellphone can't be halted in the middle of a call because it will hang up the call.
We were reminded once that we had left an inkjet printer halted in our lab by the smoke that started coming out from the printer as its print heads started burning the paper. Hard drive firmware code has large comments blocks that remind would-be human debuggers not to step through certain parts of the code or else risk the crashing the drive head into the platter.
Going beyond bug-finding, code optimization is often possible only when guided by non-intrusive measurement. The traditional way of profiling code is to co-opt a timer on the target and periodically poll the program counter to get a statistical view of slow spots in the code.
However, since this is statistical, it can only get an approximate view of performance: some events may not be sampled often enough or even not at all. Increasing the sampling rate will only slow the target down, thereby decreasing the accuracy of the measurement.
Statistical profiling also has to store its data somewhere and usually has to output its profiling data once its target buffers have filled up. This uses memory on the target, and intrudes on the target's run-time, which can have unexpected, serious effects. Clearly, traditional methods of collecting profiling are seriously limited.
During the development of the Green Hills Probe V2 (GHP2), an Ethernet-connected JTAG probe, we got a very real reminder of this kind of problem as we were chasing a performance problem that seemed to appear and disappear when code that had nothing to do with the problem area was changed. We were seeing variations in download speed ranging from 490 kilobytes per second to 850 kB/sec.
After some thought, we decided that it was probably a cache problem, but how do we prove that? Traditionally, this would have involved a bit of guesswork and experiments that can only indirectly hint at the problem.
Fortunately, the CPU used in GHP2 has trace, which can non-intrusively provide enough information for us to see what was happening to the cache. After collecting trace data, we quickly wrote a small Python script to simulate the CPU's cache system using the trace data collected to characterize cache usage with the fastest and slowest firmware.
Just as we had suspected, the slow firmware had far more cache misses than the fast firmware. The code in the critical loop was being bumped out of cache by code that had nothing to do with the loop, other than having the misfortune of being associated with the same cache lines.
Now, using this system, we could also optimize our system by configuring the linker directives file so that the critical loop is never evicted from its cache line. By doing this, we significantly exceeded the download speeds of even the fastest firmware: the download speeds now consistently hover around 1000 kB/sec, which is more than double the slowest speed.
Just as significantly, this was all accomplished in one afternoon's work. Without trace, we don't know how long it would have taken us to find the problem, much less the optimal layout for no cache misses. Trace had not only helped us identify the problem, but it also helped us find a solution that would not have been possible without trace.
Hopefully, we've shown you how trace is useful in typical embedded debugging situations. We'll quickly review trace as it exists today, and then look at high-speed serial trace, which is the next major evolution of this important debugging technology.
The limits of parallel trace
Trace is a non-intrusive history of a CPU's execution. It usually
indicates which PC addresses have been executed, and can also include
the memory areas accessed by the executed instructions.
Because it has to run at the core clock speed of the CPU, trace is usually highly compressed --- version 3 of ARM's Embedded Trace Macrocell claims compression ratios of 32-to-1 --- and is output over multiple, high-speed data pins. ARM's ETM standard can use as many as 20 pins, with almost every pin running at hundreds of megaHertz.
Despite the compression, this is still a huge amount of data: 1 gigabyte of ARM ETM version 1 trace data is good for only about 1 second of execution time on a 300 MHz ARM9 CPU. As you can imagine, this huge data output of trace causes many problems for many different parts of a trace-capable debugging tool. And it doesn't get you much run-time to characterize your problem: some systems take more than 1 second to just boot!
High-speed serial trace as we discuss below will solve two of these problems: dedicating large numbers of high-speed pins on a chip die, and outputting ever-more data as CPU speeds increase.
We will first quickly review some of the other problems, because a complete discussion would be well beyond the scope of this article, but it's necessary to appreciate the enormity of the task of using this trace data effectively.
The biggest problem of trace is its size and bandwidth. Collecting trace data at enormous speeds and storing it in real time to an enormous, fast memory array is challenging enough. But what you do with the data afterwards is even more difficult.
We're jaded to storage these days, perhaps from reading too many electronics store ads that advertise $250 1-terabyte hard drives, and using desktop operating systems that require 1 gigabyte of memory to work only tolerably well.
The storage and memory available today makes 1 GB of data look pedestrian. Yet 1 GB of data is a huge amount of data: 32-bit Linux only provides 2 GB of available memory in a process's address space. Earlier, we had mentioned that trace data could be compressed as much as 32 times, which makes it impractical to directly manipulate even 1 second's worth of uncompressed trace data on a 32-bit computer.
And even if we had 64-bit computers with dozens of gigabytes of memory, moving 1 GB of data from the trace collection probe to a host computer is not a trivial task. 100base-T Ethernet would take about 80 seconds to transfer 1 GB of data, assuming the trace collection probe and the host computer can fully saturate and utilize 100base-T Ethernet. Due to network traffic and operating system overheads, it often can't.
Even if we had and could saturate Gigabit Ethernet, which is 10 times faster than 100base-T, hard drive write speeds would still limit our transfer speeds. The fastest desktop hard drives can perhaps write between 20 and 30 MB/sec, which is some 4 times slower than Gigabit Ethernet.
So the storage and bandwidth requirements of 0.1 percent of the largest desktop hard drives we can practically buy still far outpace any technology that can be used to process it. We may be able to store it, but it's very difficult to do anything with it after we store it.
Let's assume that storage and bandwidth aren't limiting factors. In that case, we meet what is probably the biggest limiting factor of all: human interaction with trace data. For the ARM9's trace port that we mentioned earlier, 1 GB of trace data holds about 384 million CPU cycles of instructions.
And this is for a modest 1 second of actual runtime. Current trace tools basically ask you to find a bug in over 300 million CPU instructions. No one in their right mind would attempt this --- it is literally worse than finding a needle in a haystack!
Clearly, if we want to debug a very modest amount of CPU runtime with trace, we must overcome some very high hurdles. Doing this requires rethinking completely how we use trace data, and how it fits into our tools.
What use is all the data in the world if you can't do something useful with it? Before attempting to answer that, let's look at an even more fundamental issue: how do we get enough trace data off a CPU in today's increasing technology curve so that we can worry about very advanced tools later on?
Let's visit instead how the competing demands for ever higher execution speed, ever lower costs, and ever lower power consumption are already hampering current trace technology.
The execution ability of CPUs has grown by leaps and bounds, and now debugging technology has to keep up with it so that we don't paint ourselves into an undebuggable corner with the new complexity that's possible with today's very fast and capable CPUs. More systems designers are using SoCs with more complex devices integrated onto one chip, and need to debug these complex systems.
What HSST brings to the game
Just as demand for increased bandwidth in other technologies has driven
their transmission channels to high-speed serial channels, trace is on
the verge of replacing fast, wide parallel channels, with significantly
faster, fewer serial channels.
Hard drives have switched to Serial ATA, and consumer high-definition video basically requires HDMI, both of which use similar transmission protocols as the various high-speed serial trace proposals.
Increasing bandwidth makes high-speed parallel protocols more expensive and difficult to implement. For example, interchannel skew is difficult to control across 20 fast channels, and requires expensive cabling to guarantee performance.
Most trace collection probes today use a micro-coaxial ribbon cable from Precision Interconnect, which we buy for well over $100 for modest quantities of very short lengths.
As speeds increase, crosstalk between channels of a parallel interface also increases. Again, we use heroic $100-per-foot cable to solve this as well as adding even more conductors for ground lines between each signal line.
Switching transient current draw for many high-speed lines is enormous, and causes ground bounce due the finite resistance of conductors. These transients cause glitches that corrupt data. We inadvertently encountered this phenomenon during the development of the SuperTrace probe, a high-speed 1 GB trace collection probe.
We discovered that during certain operations, very infrequently, we would get corrupted data. After spending a few days trying to figure out what was going on, we finally realized that our highest-speed logic had been placed into a corner of the FPGA that had the fewest ground pins.
After re-routing the design for a ground-rich corner of the FPGA, we no longer had data corruption. As CPU speeds and parallel trace port speeds increase, problems like this will only become more common, and more difficult to solve.
More important for ASIC designers is the large number of pins required by parallel trace. While 20 pins may give the best performance from an ARM trace module, designers can barely afford less than half of those number of pins, which can significantly hamstring the performance of the trace port. With an abridged trace port, you may be lucky to get uninterrupted trace of the program counter, and data trace may be impossible.
Developers are forced into an impossible dilemma: do we give up enough pins so the chip will fit and meet its budget, or do we give the software developers (who are often the bottleneck of any electronic product) good enough trace facilities, so the product isn't held back from production for months by obscure bugs?
HSST solves bandwidth by using fewer channels, but running them far faster. Fewer channels means fewer pins, and lower power requirements. Because the data is wrapped into a serial channel, each with its own embedded clock, interchannel skew is no longer a problem, and noise susceptibility and emissions, both important for complying with EMI standards, are greatly reduced.
If more than one high-speed serial channel is used, skew still isn't a problem because multiple serial channels can be bonded to guarantee certain skew specifications.
Serial channels also use some kind of encoding scheme to balance DC and to provide enough transitions for clock recovery. The so-called 8b10b encoding used in Gigabit Ethernet, for example, where 8 bits are encoded to 10 bits in order to equalize the time the wires spend at 1 and 0, is currently the front-runner for HSST. However, 8b10b encoding incurs a 20 percent bandwidth overhead, so a 4 Gigabit-per-second channel has 3.2 Gb/sec of useful bandwidth.
Serial channels under consideration include Xilinx's RocketIO, which can go as fast as 6.25 Gb/sec. Current discussions with various customers, vendors and standards committees include proposals for using 4 of these channels for an aggregate bandwidth of 25 Gbit/sec, which we believe will cover almost all trace needs for at least a few years. For comparison, the highest bandwidth parallel trace ports currently in use are less than 8 Gbit/sec.
What's next for HSST?
Do we expect CPU core speeds to increase by 300 percent in the next few
years? They may, but what is definite is that higher levels of
integration in SoC designs will output more trace data than ever, even
if CPU core speeds remain constant.
Multi-core designs will output more trace data and programmers will increasingly depend on trace to solve the significantly more difficult problems we will create with multi-core systems.
Timing problems will only increase, because we now have true asynchronicity with independent CPUs running, instead of just the simulated asynchronicity we have with single-chip multitasking. Trace protocols must include some way of synchronizing and correlating trace data collected from multiple cores.
SoCs will also have configurable logic, like FPGA fabrics, as well as specialized processors to handle application-specific tasks. These devices need debugging as well, and will also output trace data along with normal CPU data.
ARM's Coresight system already provides a mechanism for combining multiple sources of trace data on an SoC for output in a single trace stream to a trace collection probe. We now need to provide enough bandwidth for this data. Fortunately, it's relatively straightforward to use a high-speed serial interface for these systems.
A serializing module on the parallel outputs of an ARM Coresight port, for example, outputs high-speed serial data to a serial receiver which will convert the serial stream back into parallel Coresight data. From the parallel trace port's point-of-view, nothing has changed except for a huge bandwidth increase.
And this is not an over-simplification of HSST implementation either as the first prototype systems used exactly this scheme. A conventional parallel trace system was connected to a serializer that sent its output over a cable to a deserializer that fed a conventional parallel trace collection probe.
The parallel-trace tools had no idea that such a conversion was being done, and worked, more or less. Of course, over time, more direct integration will see systems transmitting serial trace directly instead of attaching very expensive serial transceivers to existing systems, but the concept does work in actual use.
It's an exciting time to be in the debugging tools business: we are on the cusp of a very big change in the capabilities of our tools as they start to make debugging of traditionally very difficult problems manageable.
Andre Yew manages Green Hills Software's Target
Connections group, which
connects the MULTI Integrated Development Environment to hardware
targets.
The group is responsible for Green Hills' debug devices, and its
supporting
software. Andre has a Bachelor of Science in Engineering and Applied
Science from the California Institute of Technology.