CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Debugging: Making the move from parallel to high speed serial trace
Andre Yew describes the history trace debug and describes the evolution of High-Speed Serial Trace (HSST) and discusses how it replaces conventional parallel trace, especially as CPU speeds and System-on-Chip integration complexity increase.



Embedded.com

The limits of parallel trace
Trace is a non-intrusive history of a CPU's execution. It usually indicates which PC addresses have been executed, and can also include the memory areas accessed by the executed instructions.

Because it has to run at the core clock speed of the CPU, trace is usually highly compressed --- version 3 of ARM's Embedded Trace Macrocell claims compression ratios of 32-to-1 --- and is output over multiple, high-speed data pins. ARM's ETM standard can use as many as 20 pins, with almost every pin running at hundreds of megaHertz.

Despite the compression, this is still a huge amount of data: 1 gigabyte of ARM ETM version 1 trace data is good for only about 1 second of execution time on a 300 MHz ARM9 CPU. As you can imagine, this huge data output of trace causes many problems for many different parts of a trace-capable debugging tool. And it doesn't get you much run-time to characterize your problem: some systems take more than 1 second to just boot!

High-speed serial trace as we discuss below will solve two of these problems: dedicating large numbers of high-speed pins on a chip die, and outputting ever-more data as CPU speeds increase.

We will first quickly review some of the other problems, because a complete discussion would be well beyond the scope of this article, but it's necessary to appreciate the enormity of the task of using this trace data effectively.

The biggest problem of trace is its size and bandwidth. Collecting trace data at enormous speeds and storing it in real time to an enormous, fast memory array is challenging enough. But what you do with the data afterwards is even more difficult.

We're jaded to storage these days, perhaps from reading too many electronics store ads that advertise $250 1-terabyte hard drives, and using desktop operating systems that require 1 gigabyte of memory to work only tolerably well.

The storage and memory available today makes 1 GB of data look pedestrian. Yet 1 GB of data is a huge amount of data: 32-bit Linux only provides 2 GB of available memory in a process's address space. Earlier, we had mentioned that trace data could be compressed as much as 32 times, which makes it impractical to directly manipulate even 1 second's worth of uncompressed trace data on a 32-bit computer.

And even if we had 64-bit computers with dozens of gigabytes of memory, moving 1 GB of data from the trace collection probe to a host computer is not a trivial task. 100base-T Ethernet would take about 80 seconds to transfer 1 GB of data, assuming the trace collection probe and the host computer can fully saturate and utilize 100base-T Ethernet. Due to network traffic and operating system overheads, it often can't.

Even if we had and could saturate Gigabit Ethernet, which is 10 times faster than 100base-T, hard drive write speeds would still limit our transfer speeds. The fastest desktop hard drives can perhaps write between 20 and 30 MB/sec, which is some 4 times slower than Gigabit Ethernet.

So the storage and bandwidth requirements of 0.1 percent of the largest desktop hard drives we can practically buy still far outpace any technology that can be used to process it. We may be able to store it, but it's very difficult to do anything with it after we store it.

Let's assume that storage and bandwidth aren't limiting factors. In that case, we meet what is probably the biggest limiting factor of all: human interaction with trace data. For the ARM9's trace port that we mentioned earlier, 1 GB of trace data holds about 384 million CPU cycles of instructions.

And this is for a modest 1 second of actual runtime. Current trace tools basically ask you to find a bug in over 300 million CPU instructions. No one in their right mind would attempt this --- it is literally worse than finding a needle in a haystack!

Clearly, if we want to debug a very modest amount of CPU runtime with trace, we must overcome some very high hurdles. Doing this requires rethinking completely how we use trace data, and how it fits into our tools.

What use is all the data in the world if you can't do something useful with it? Before attempting to answer that, let's look at an even more fundamental issue: how do we get enough trace data off a CPU in today's increasing technology curve so that we can worry about very advanced tools later on?

Let's visit instead how the competing demands for ever higher execution speed, ever lower costs, and ever lower power consumption are already hampering current trace technology.

The execution ability of CPUs has grown by leaps and bounds, and now debugging technology has to keep up with it so that we don't paint ourselves into an undebuggable corner with the new complexity that's possible with today's very fast and capable CPUs. More systems designers are using SoCs with more complex devices integrated onto one chip, and need to debug these complex systems.

1 | 2 | 3 | 4

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :