Debugging: Making the move from parallel to high speed serial trace -

Debugging: Making the move from parallel to high speed serial trace

Have you ever had a bug that disappeared when you tried to debug it? Orhow about an application that has to run at full-speed, and can't bestopped or slowed down to take a look at strange behavior?

Problems like this can only be debugged non-intrusively – debugging thathas no side effects on the system. Trace was invented to solve thesekinds of problems. Before we go on to talk about trace, let's look atthese problems in more detail, including a real-world debuggingsituation that trace helped solve.

Bugs that disappear when you run them under a debugger or even withadded printf()statements tothe most innocuous places are usually caused by memory corruption orrace conditions that depend on a very particular sequence and timing ofevents.

Adding a printf() statement alters the memory footprint of theprogram, and slows it down as well. Running a program under a debuggercan slow a program down as well, depending on how the debuggerinteracts with the target being debugged.

Applications that can't be stopped or slowed down are usually at theheart of many embedded products. For example, a cellphone can't behalted in the middle of a call because it will hang up the call.

We were reminded once that we had left an inkjet printer halted inour lab by the smoke that started coming out from the printer as itsprint heads started burning the paper. Hard drive firmware code haslarge comments blocks that remind would-be human debuggers not to stepthrough certain parts of the code or else risk the crashing the drivehead into the platter.

Going beyond bug-finding, code optimization is oftenpossible only when guided by non-intrusive measurement. The traditionalway of profiling code is to co-opt a timer on the target andperiodically poll the program counter to get a statistical view of slowspots in the code.

However, since this is statistical, it can only get an approximateview of performance: some events may not be sampled often enough oreven not at all. Increasing the sampling rate will only slow the targetdown, thereby decreasing the accuracy of the measurement.

Statisticalprofiling also has to store its data somewhere and usually has tooutput its profiling data once its target buffers have filled up. Thisuses memory on the target, and intrudes on the target's run-time, whichcan have unexpected, serious effects. Clearly, traditional methods ofcollecting profiling are seriously limited.

During the development of the Green Hills Probe V2 (GHP2), anEthernet-connected JTAG probe,we got a very real reminder of this kind of problem as we were chasinga performance problem that seemed to appear and disappear when codethat had nothing to do with the problem area was changed. We wereseeing variations in download speed ranging from 490 kilobytes persecond to 850 kB/sec.

After some thought, we decided that it was probably a cacheproblem, but how do we prove that? Traditionally, this would haveinvolved a bit of guesswork and experiments that can only indirectlyhint at the problem.

Fortunately, the CPU used in GHP2 has trace, which cannon-intrusively provide enough information for us to see what washappening to the cache. After collecting trace data, we quickly wrote asmall Python script to simulate the CPU's cache system using the tracedata collected to characterize cache usage with the fastest and slowestfirmware.

Just as we had suspected, the slow firmware had far more cachemisses than the fast firmware. The code in the critical loop was beingbumped out of cache by code that had nothing to do with the loop, otherthan having the misfortune of being associated with the same cachelines.

Now, using this system, we could also optimize our system byconfiguring the linker directives file so that the critical loop isnever evicted from its cache line. By doing this, we significantlyexceeded the download speeds of even the fastest firmware: the downloadspeeds now consistently hover around 1000 kB/sec, which is more thandouble the slowest speed.

Just as significantly, this was all accomplished in one afternoon'swork. Without trace, we don't know how long it would have taken us tofind the problem, much less the optimal layout for no cache misses.Trace had not only helped us identify the problem, but it also helpedus find a solution that would not have been possible without trace.

Hopefully, we've shown you how trace is useful in typical embeddeddebugging situations. We'll quickly review trace as it exists today,and then look at high-speed serialtrace , which is the next major evolution of this importantdebugging technology.

The limits of parallel trace
Trace is a non-intrusive history of a CPU's execution. It usuallyindicates which PC addresses have been executed, and can also includethe memory areas accessed by the executed instructions.

Because it has to run at the core clock speed of the CPU, trace isusually highly compressed — version 3 of ARM's Embedded TraceMacrocell claims compression ratios of 32-to-1 — and isoutput over multiple, high-speed data pins. ARM's ETM standard can useas many as 20 pins, with almost every pin running at hundreds ofmegaHertz.

Despite the compression, this is still a huge amount of data: 1gigabyte of ARM ETM version 1 trace data is good for only about 1second of execution time on a 300 MHz ARM9 CPU. As you can imagine,this huge data output of trace causes many problems for many differentparts of a trace-capable debugging tool. And it doesn't get you muchrun-time to characterize your problem: some systems take more than 1second to just boot!

High-speed serial trace as we discuss below will solve two of theseproblems: dedicating large numbers of high-speed pins on a chip die,and outputting ever-more data as CPU speeds increase.

We will first quickly review some of the other problems, because acomplete discussion would be well beyond the scope of this article, butit's necessary to appreciate the enormity of the task of using thistrace data effectively.

The biggest problem of trace is its size and bandwidth. Collectingtrace data at enormous speeds and storing it in real time to anenormous, fast memory array is challenging enough. But what you do withthe data afterwards is even more difficult.

We're jaded to storage these days, perhaps from reading too manyelectronics store ads that advertise $250 1-terabyte hard drives, andusing desktop operating systems that require 1 gigabyte of memory towork only tolerably well.

The storage and memory available today makes 1 GB of data lookpedestrian. Yet 1 GB of data is a huge amount of data: 32-bit Linuxonly provides 2 GB of available memory in a process's address space.Earlier, we had mentioned that trace data could be compressed as muchas 32 times, which makes it impractical to directly manipulate even 1second's worth of uncompressed trace data on a 32-bit computer.

And even if we had 64-bit computers with dozens of gigabytes ofmemory, moving 1 GB of data from the trace collection probe to a hostcomputer is not a trivial task. 100base-T Ethernet would take about 80seconds to transfer 1 GB of data, assuming the trace collection probeand the host computer can fully saturate and utilize 100base-TEthernet. Due to network traffic and operating system overheads, itoften can't.

Even if we had and could saturate Gigabit Ethernet, which is 10times faster than 100base-T, hard drive write speeds would still limitour transfer speeds. The fastest desktop hard drives can perhaps writebetween 20 and 30 MB/sec, which is some 4 times slower than GigabitEthernet.

So the storage and bandwidth requirements of 0.1 percent of thelargest desktop hard drives we can practically buy still far outpaceany technology that can be used to process it. We may be able to storeit, but it's very difficult to do anything with it after we store it.

Let's assume that storage and bandwidth aren't limiting factors. Inthat case, we meet what is probably the biggest limiting factor of all:human interaction with trace data. For the ARM9's trace port that wementioned earlier, 1 GB of trace data holds about 384 million CPUcycles of instructions.

And this is for a modest 1 second of actual runtime. Current tracetools basically ask you to find a bug in over 300 million CPUinstructions. No one in their right mind would attempt this — it isliterally worse than finding a needle in a haystack!

Clearly, if we want to debug a very modest amount of CPU runtimewith trace, we must overcome some very high hurdles. Doing thisrequires rethinking completely how we use trace data, and how it fitsinto our tools.

What use is all the data in the world if you can't do somethinguseful with it? Before attempting to answer that, let's look at an evenmore fundamental issue: how do we get enough trace data off a CPU intoday's increasing technology curve so that we can worry about veryadvanced tools later on?

Let's visit instead how the competing demands for ever higherexecution speed, ever lower costs, and ever lower power consumption arealready hampering current trace technology.

The execution ability of CPUs has grown by leaps and bounds, and nowdebugging technology has to keep up with it so that we don't paintourselves into an undebuggable corner with the new complexity that'spossible with today's very fast and capable CPUs. More systemsdesigners are using SoCs with more complex devices integrated onto onechip, and need to debug these complex systems.

What HSST brings to the game
Just as demand for increased bandwidth in other technologies has driventheir transmission channels to high-speed serial channels, trace is onthe verge of replacing fast, wide parallel channels, with significantlyfaster, fewer serial channels.

Hard drives have switched to Serial ATA, and consumerhigh-definition video basically requires HDMI,both of which use similar transmission protocols as the varioushigh-speed serial trace proposals.

Increasing bandwidth makes high-speed parallel protocols moreexpensive and difficult to implement. For example, interchannel skew isdifficult to control across 20 fast channels, and requires expensivecabling to guarantee performance.

Most trace collection probes today use a micro-coaxial ribbon cablefrom Precision Interconnect, which we buy for well over $100 for modestquantities of very short lengths.

As speeds increase, crosstalk between channels of a parallelinterface also increases. Again, we use heroic $100-per-foot cable tosolve this as well as adding even more conductors for ground linesbetween each signal line.

Switching transient current draw for many high-speed lines isenormous, and causes ground bounce due the finite resistance ofconductors. These transients cause glitches that corrupt data. Weinadvertently encountered this phenomenon during the development of theSuperTrace probe, a high-speed 1 GB trace collection probe.

We discovered that during certain operations, very infrequently, wewould get corrupted data. After spending a few days trying to figureout what was going on, we finally realized that our highest-speed logichad been placed into a corner of the FPGA that had the fewest groundpins.

After re-routing the design for a ground-rich corner of the FPGA, weno longer had data corruption. As CPU speeds and parallel trace portspeeds increase, problems like this will only become more common, andmore difficult to solve.

More important for ASIC designers is the large number of pinsrequired by parallel trace. While 20 pins may give the best performancefrom an ARM trace module, designers can barely afford less than half ofthose number of pins, which can significantly hamstring the performanceof the trace port. With an abridged trace port, you may be lucky to getuninterrupted trace of the program counter, and data trace may beimpossible.

Developers are forced into an impossible dilemma: do we give upenough pins so the chip will fit and meet its budget, or do we give thesoftware developers (who are often the bottleneck of any electronicproduct) good enough trace facilities, so the product isn't held backfrom production for months by obscure bugs?

HSST solves bandwidth by using fewer channels, but running them farfaster. Fewer channels means fewer pins, and lower power requirements.Because the data is wrapped into a serial channel, each with its ownembedded clock, interchannel skew is no longer a problem, and noisesusceptibility and emissions, both important for complying with EMIstandards, are greatly reduced.

If more than one high-speed serial channel is used, skew still isn'ta problem because multiple serial channels can be bonded to guaranteecertain skew specifications.

Serial channels also use some kind of encoding scheme to balance DCand to provide enough transitions for clock recovery. The so-called8b10b encoding used in Gigabit Ethernet, for example, where 8 bits areencoded to 10 bits in order to equalize the time the wires spend at 1and 0, is currently the front-runner for HSST. However, 8b10b encodingincurs a 20 percent bandwidth overhead, so a 4 Gigabit-per-secondchannel has 3.2 Gb/sec of useful bandwidth.

Serial channels under consideration include Xilinx's RocketIO, whichcan go as fast as 6.25 Gb/sec. Current discussions with variouscustomers, vendors and standards committees include proposals for using4 of these channels for an aggregate bandwidth of 25 Gbit/sec, which webelieve will cover almost all trace needs for at least a few years. Forcomparison, the highest bandwidth parallel trace ports currently in useare less than 8 Gbit/sec.

What's next for HSST?
Do we expect CPU core speeds to increase by 300 percent in the next fewyears? They may, but what is definite is that higher levels ofintegration in SoC designs will output more trace data than ever, evenif CPU core speeds remain constant.

Multi-core designs will output more trace data and programmers willincreasingly depend on trace to solve the significantly more difficultproblems we will create with multi-core systems.

Timing problems will only increase, because we now have trueasynchronicity with independent CPUs running, instead of just thesimulated asynchronicity we have with single-chip multitasking. Traceprotocols must include some way of synchronizing and correlating tracedata collected from multiple cores.

SoCs will also have configurable logic, like FPGA fabrics, as wellas specialized processors to handle application-specific tasks. Thesedevices need debugging as well, and will also output trace data alongwith normal CPU data.

ARM's Coresight system alreadyprovides a mechanism for combining multiple sources of trace data on anSoC for output in a single trace stream to a trace collection probe. Wenow need to provide enough bandwidth for this data. Fortunately, it'srelatively straightforward to use a high-speed serial interface forthese systems.

A serializing module on the parallel outputs of an ARM Coresightport, for example, outputs high-speed serial data to a serial receiverwhich will convert the serial stream back into parallel Coresight data.From the parallel trace port's point-of-view, nothing has changedexcept for a huge bandwidth increase.

And this is not an over-simplification of HSST implementation eitheras the first prototype systems used exactly this scheme. A conventionalparallel trace system was connected to a serializer that sent itsoutput over a cable to a deserializer that fed a conventional paralleltrace collection probe.

The parallel-trace tools had no idea that such a conversion wasbeing done, and worked, more or less. Of course, over time, more directintegration will see systems transmitting serial trace directly insteadof attaching very expensive serial transceivers to existing systems,but the concept does work in actual use.

It's an exciting time to be in the debugging tools business: we areon the cusp of a very big change in the capabilities of our tools asthey start to make debugging of traditionally very difficult problemsmanageable.

Andre Yew manages Green Hills Software's TargetConnections group, whichconnects the MULTI Integrated Development Environment to hardwaretargets.The group is responsible for Green Hills' debug devices, and itssupportingsoftware. Andre has a Bachelor of Science in Engineering and AppliedScience from the California Institute of Technology.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.