Debugger performance matters: The importance of good metrics -

Debugger performance matters: The importance of good metrics


Debugging is the most difficult and costly phase of softwaredevelopment for systems large and small. Deeply embedded systems don'thave the standard PC user interfaces of keyboards, mice, graphicdisplays or even network consoles, so you need specialized debugging tools to get thecriticalsystem information necessary to find and fix bugs.

For many systems, that access is provided by a hardware debug devicewhich communicates to your system's microprocessors through an on-chipdebug (OCD) port. These debugdevices can have dramatically different performance characteristics.

Your development time is valuable, so make sure that when you buildyour system and select a hardware debug device that you carefullyconsider debugging performance, or you may find yourself waiting whenyou should be debugging.

Debugging Performance Metrics
The difficulty in measuring “debugging performance” lies in the variednature of debugging. One day's task is nothing like the next, anddebugging as a whole can be thought of as a sequence of experimentsundertaken one after the other.

Figure1. Typical embedded debug setup

Piece by piece, you try to pin down the problem, working through thesame section of code as different scenarios are examined, tested, andthen set aside. During this process, you spend a lot of time reloadingor reprogramming your application, re-running the application tospecific breakpoints, stepping through code, uploading logging or traceinformation, and examining the state of the system.

Fortunately for measuring debugging performance, the time taken forthese tasks is dominated by a single factor ” memory access speed onyour system. Reloading or reprogramming the application requires directwrites to RAM and/or non-volatile memory.

Need to read out that log of what the system was doing when it died?You want to viewing a peripheral's memory-mapped registers? Trying todebug your deadlocked application, looking for which of your system'stasks is holding a semaphore when it shouldn't?

In all of these cases, it's memory access to the rescue. If memoryaccess is slow, you're looking at a lot of dead time while debugging,waiting on your debugging system to catch up.

As such, memory access speed is the fundamental measure ofproductivity and performance for debugging an embedded processor. It isalso easy to measure; simply dump a large number of pseudo-random bytesfrom the debugging host into the memory of the system under debug andtime how long it takes to complete. This will give the memory writespeed of the system, and read speed can be measured by simply readingthe data back.

Why does memory access performance vary from system to system?Performance bottlenecks lurk everywhere, but the most important onesare the hardware debug device and the design of the microprocessor'sdebug port.

In addition, certain other design factors of the system under debug(not just the selection of microprocessor) can affect memory accessspeed. To understand this better, we must examine the details ofdebug-mode memory access through a debug port.

Memory Access Through a Debug Port
Debug ports come in many shapes and sizes. The focus here will be ondebug ports based on the widely used IEEE1149.1 boundary scan standard (commonly known as “JTAG”).Thisfour-signal standard was originally designed for in-circuit PCB anddevice testing, but has been extended to include software debugging.

The standard has a number of characteristics which make itwell-suited to the task of in-system debugging, including access tomultiple devices simultaneously, and the possibility of combiningsoftware debug, manufacturing test, and device programming into asingle low-pin count connector.

JTAG is a simple interface at the pin level, with a single clockcalled TCK driven by the debug device. Along with TCK the debug devicesends one bit of data per TCK cycle on the TDI signal and one bit ofcontrol information on the TMS signal.

On each cycle the system under debug replies with a single bit ofdata out on the TDO signal. Since only one bit can be sent and receivedper TCK period, the frequency of TCK is a significant factor in theperformance of the JTAG interface; at a TCK frequency of 10MHz theinterface can carry no more than 10 million bits per second.

On top of this simple signaling scheme, microprocessors add theirown protocols for allowing memory access. Some device families requirethousands of JTAG TCK periods per byte read or written from memory,while the most efficient device families require only slightly morethan 8 TCK cycles for each byte of memory accessed.

On the whole, most devices add somewhere between 20% to 100% ofoverhead in TCK periods for their most efficient memory access method,so each byte of memory read or written requires 10 to 16 JTAG TCKperiods.

The topology of the system under debug can also affect memory accessefficiency. One of the strengths of the JTAG standard lies in itsability to serially chain multiple devices from different manufacturersinto a single scan chain that is all accessible through a single debugdevice.

This makes system-level testing, visibility and debug veryconvenient, but it comes with a cost. Systems with multiple devices ina scan chain incur extra overhead for each operation, which reducesthroughput. A system with tens of devices chained together can easilycut the theoretical best-case memory access throughput in half.

Careful system design and signal routing are also required for aJTAG-based system to perform at its full potential. Remember thatJTAG-based systems can send and receive only a single bit of data perTCK cycle, so it is very important that the system handle high TCKfrequencies while maintaining the timing relationship of TCK to theother three JTAG signals.

If the four high-speed JTAG signals are not treated carefully incircuit design and layout, the maximum frequency of the JTAG interfacemay be limited, and this will limit the maximum memory accessperformance of the system.

The ARM1176JZF-S: puttingperformance metrics to work
For more in-depth analysis of memory access, let's examine the debugsystem provided on the ARM1176JZF-S high-performanceembedded processor core (for a fulldiscussion, see ARM's excellent user manual for this core ).

The ARM11 debug port allows arbitrary opcodes to be fed to andexecuted on the processor core while in debug mode, and offers aregister (the Debug Data Transfer Register, or DTR) that is visible toboth the processor core and the debug port. A naïve but logicalway to read memory from the debug port is shown in Figure 2 below.

Figure2. Unoptimized scan sequence

This works, but for large-scale memory access is inefficient,requiring 648 JTAG clock cycles to read a single 4-byte value frommemory. To put that level of efficiency into context, we can easilycompute the memory access speed of a debug device when given the numberof TCK cycles required per memory access:

So this scan sequence running at a typical 10MHz JTAG clock can readmemory at no more than 60.3 kilobytes per second:

This same sequence with minor changes can be used to write memory atthe same efficiency. Unfortunately, 60 kilobytes per second isn't veryfast. As an example, a developer with a 2.5 megabyte application wouldhave to wait 42 seconds each time the program is downloaded. An extra42 seconds for every new test case or scenario quickly adds up to asignificant loss of expensive developer time.

Fortunately, it is easy to do much better. If we only execute steps1 through 4 once and use a load instruction with auto-increment in step5, then we increase efficiency so only 216 cycles are used for each4-byte load or store. Thanks to the ingenuity and forethought of theARM11 engineering team, steps 5 and 6 can also be combined andoptimized so each 4-byte load or store consumes just 41 JTAG clockcycles as shown in Figure 3, below .

Figure3. Optimized scan sequence

Now the best-case memory transfer speed at a 10MHz JTAG clock ismuch faster, and debugging cycles for our hypothetical developer arepractically instantaneous:

This analysis is simple for the ARM1176JZF-S, but for other devicesthe process of efficient memory access is not always obvious or welldocumented. It is critical that debug devices use efficient memoryaccess routines ” and they must execute those routines within tighttime constraints in order to achieve high performance.

Debug Device Implementation
One common way to drive these JTAG lines is through the generalpurposeI/O signals (GPIOs) of a small microcontroller. This has theadvantageof being inexpensive and simple. The major drawback is speed – themicrocontroller must compute JTAG command sequences, extract TDI bitvalues, compute any TMS values and do bit operations on the GPIOregisters.

If it takes even as few as 20 cycles of the microcontroller per TCKclock edge and the microcontroller runs at 60MHz, 1,640 microcontrollercycles will be required per 4-byte shift command and the maximumeffective clock speed of the JTAG interface for this device will beonly about 1.5MHz:

After accounting for the microcontroller handling the transfer ofdata to and from the debugging host, such a system is slowedfurther  – if the microcontroller spends half its time moving datafrom the host, the effective TCK speed drops to 750kHz. Substitutingthis figure into the memory transfer speed equation for the ARM1176 (41cycles per 4-bit load/store) yields a transfer speed of 71.5KB/sec.

One way to speed things up is to use programmable logic to handletransforming high-level “shift commands” into the bit-level signalpatters. Using the same microcontroller example, if 100 cycles arerequired on average per command and each command can shift 16 JTAG TCKcycles, then each 4-byte memory access (which uses 41 JTAG TCK cycles)can be accomplished in 300 microcontroller CPU cycles.

With only 300 cycles required per 4-byte memory access and assumingthat 50% of the CPU cycles are still dedicated to transferring datafrom the debugging host, memory access throughput increases above 390KB/sec:

Does actual TCK frequency matter?
An interesting fact emerges here: the actual TCK frequency no longermatters, as you can see from its absence in the above equation (onecaveat: TCK frequency must be above about 4MHz, or the system will belimited by the TCK frequency ” but for figuring the upper boundary ofperformance, TCK frequency is no longer necessarily a limiting factor).

This microcontroller system could have a TCK clock generator capableof 50MHz or more, and throughput would still be exactly 390.63KB/sec.The throughput of the debug device has become limited by the computingpower of the microcontroller and its programmable shifting logic.

The only way to increase memory access performance for this debugdevice is to increase the computational throughput of themicrocontroller, either by increasing its clock rate or by decreasingthe number of cycles needed per shift command.

Figure4. Hardware debug device speed

This is an important piece of information to remember as youconsider any JTAG-oriented hardware debug device –  maximum TCKfrequency is important, but the ability to fill those TCK cycles withuseful work is even more critical.

Today's typical high-end hardware debug devices are often built likethe example microcontroller+PLD device benchmarked in Figure 4 above . With ahigh-performance microprocessor and dedicated JTAG management logic,memory access speeds of 2MB/sec or more on a system like this ARM1176example are common, but they are only possible with a well-designed andhighly optimized hardware debug device.

In fact, the debug ports of some of today's devices are capable ofcorrect operation at TCK speeds above 100MHz, offering a challenge todesigners of hardware debug devices.

For the ARM1176 example, 100MHz means a throughput of 9527 KB/sec,making debugging and programming tasks virtually instantaneous. To liveup to the performance potential of such systems, careful system-leveldesign of the hardware debug device is required to ensure thatbottlenecks within the device do not limit performance.

If the device is connected to the debug host by USB, the device mustsupport USB 2.0 high speed or be limited by the 1.5 MB/sec throughputof USB 1.1. If the debug device is connected by ethernet, ahigh-performance networking subsystem capable of nearly saturating 100megabit ethernet must be used.

On top of that, the debug device must be able to issue whatevercommands are necessary to execute a 4-byte load or store every 410nS tomaintain the 9527 KB/sec transfer rate, and the system must havesufficient buffering and power to sustain that throughput whilesimultaneously transferring nearly 10 megabytes of data per second fromthe debugging host.

At the end of the day, you just want to get your work done in as quickand painless a way as possible. With this in mind, it is important toexamine hardware debug performance when designing a new system, andespecially when selecting a hardware debug device.

High-performance debugging equipment means you can spend less timewaiting around for system restart and critical debugging information,and more time solving the real-world problems of a deeply embeddedsystem.

Anderson MacKay is EngineeringManager in GreenHills Software's Target Connections group, responsible for productplanning, engineering, and project management for the  Probe andSuperTrace Probe products.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.