By D. R. Gonzales and Brian Branson
The good news is that we can now cram a
lot of functionality into our system-on-chips, including multiple
processors. The bad news is that we still have to debug our
hardware/software designs and make them work. The really good news
is that we have the test hardware and software to do the job, even
for MP systems. This debugability is especially critical for
wireless portable devices, which may have multiple processors,
generally a DSP and a uC.
SoC design methodologies for programmable cores now include
static debug blocks, which may be used during the early stages of
product development. By including additional debug related
capability, on-chip suppliers are offering designers the ability to
fully understand the behavior of a given system, including
validation of both hardware and software architectures, and their
interdependence. This is essential for evaluating real-time power
consumption in Internet-ready handheld devices.
Wireless SoCs also introduce the need to develop and debug power
efficient handheld wireless devices, and the market's need for
rapid product development. Debugging highly integrated multiple
core systems on a single chip can be done if the necessary hardware
and software resources are available.
A good example of SoC debuggability is the Motorola 3G Baseband
Transceiver, an SoC that implements a handset design by integrating
a M•CORE M341 micro-RISC core and a StarCore DSP core onto a
single chip. This SoC also implements a real-time debug port based
on the IEEE Industry Standards and Technology Organization (ISTO)
Nexus 5001 Forum specification.
The IEEE-ISTO Nexus 5001 Forum is
defining a common set of microcontroller on-chip debug features,
protocols, pins, and interfaces to external tools and tool sets
that address the static debug requirements of embedded real-time
architectures. The Nexus standard defines a scalable set of
features that enable existing debug blocks to be accessed via an
extensible auxiliary port. The features associated with this new
auxiliary port focus on the required real-time transfer of
information to and from the embedded microcontroller.
The Design—Power Profiles
Designing a low power/high performance system, such as a
cellular handset, is no mean feat. These are complex designs that
must live within critical power and performance envelopes.
Figure 1: A digital cellular handset can be partitioned
into three main sections. The RF section receives and transmits
analog and/or digital information; the Analog Baseband and Control
section handles intermediate frequency conversion, user interaction
and power control; and the Power Management section distributes and
manages power to all elements of the handset.
In first- and second-generation digital cellular solutions,
overall baseband power consumption is derived from the combination
of:
- Standby leakage power
- Active power for time-based protocol software stacks and data
(voice) transmission
- System event power induced by an active page or call, or other
user-induced event as illustrated in Figure 2.
Figure 2: Characteristic Power Consumption for Cellular
Handset
The relative periods of standby and active power can be
calculated with decent accuracy based on knowledge of the wireless
protocol. Standby power consumption can hence be estimated via
leakage current information for a given technology, and knowledge
of the amount of time the chip stays in this inactive mode.
Active power consumption is a bit more difficult to estimate,
but for repetitive software stacks performing known protocol
functions, this too can be relatively accurately determined.
Consider then the problem of estimating and optimizing on-chip
power consumption during user-induced events such as Wireless
Application Protocol (WAP) browsing, high speed down/up link
transactions, and Motion Picture and Entertainment Group (MPEG4)
structured audio activity.
The embedded system contains all the necessary capability to
perform these functions, even in parallel with other events, but
their behavior is much less deterministic. Software written to
handle this multitude of system activity must be carefully
optimized to improve overall battery life for a particular
application.
| Task |
Digital Power |
Analog Power |
RF Power |
| Network Access |
40-ma |
20-ma |
40-ma |
| Call Service |
20/30-ma |
20-ma |
50-ma |
| 3G Playback |
35-ma |
15-ma |
50-ma |
Table 1: Prior studies show that the three main blocks of
the cellular handset each consume from 15- to 50-ma of current
depending on its state of activity.
Real-Time Performance Analysis
Lab bench analysis of prototype systems permits conventional
methods of evaluation such as circuit boards with logic analyzer
interfaces. Typically these boards provide a means for initial
power up and integration of software and hardware modules.
Each core in the baseboard and processor chip is evaluated
individually in a static debug form. They are each put in a special
mode of operation and their programmer model registers and memory
are checked while single stepping a test program that was
downloaded from a host computer. Once the system passes the "smoke
test" where each processor exits reset and performs initialization
functions correctly, the task of debugging real-time kernel and
interrupt structures begins.
This type of debugging for a real-time wireless device required
a logic analyzer to monitor the external bus interface at selected
clock cycles and to sample the bus activity for recording and later
analysis. Unfortunately, this approach is becoming extremely
expensive and even physically impossible as microcontroller and DSP
speeds move above 100-MHz. And it's even worse for SoCs with
on-chip microcontrollers, DSP, and memory—there is no place
where the logic analyzer can be attached to monitor the on-chip
processor activity.
Traditionally, when the bus interfaces are not available,
developers embedded "printf" statements in their code at strategic
points so that the data needed is sent to a peripheral port and
retrieved by a host processor. In this approach, the information
captured is minimal and the technique inserts intrusive delays in
the application. And this cost is unacceptably time consuming,
especially as system software layers become more complex. An
on-chip hardware/software approach is needed that enables the
on-chip processors to be monitored and those results to be
transferred to the host for analysis.
Cellular Performance Considerations
But debugging effective cellular phones requires more than just
simple software debugging. The ability to track the performance and
power usage of the design are also needed. Moreover these are not
simple one processor designs. Typically, cellular handsets have
multiple processors, usually a microcontroller for control
functions and a DSP for signal processing. The debugging resources
need to be able to track both processors and their coordination, as
well as to track the on-chip peripherals. Thus cellular handsets
require specialized debugging resources tailored to the specifics
of handset design.
The heart of the cellular handset is the baseband transceiver,
which performs all computations relative to call service, Internet
Web interaction, and handset control.
Figure 3: A block diagram of a Motorola wireless baseband
processor, including separate MCU and DSP core complexes,
interfaced to separate on-chip RAM and ROM memories and
core-specific peripheral and I/O functions.
One way to fully track each processor cores' operation
separately, as well as its interaction with the other, would be to
pin out the internal core buses to external pads, thus achieving
good visibility of both cores' bus cycles. However, this approach
is simply too costly, as cellular SoCs need to minimize I/O and
package costs. Nonetheless, system hardware and software architects
still need a method of understanding standalone and integrated core
processor behavior.
WAP Debug Hardware Requirements
There is a solution, a standard solution at that. The IEEE-ISTO
Nexus 5001 Specification defines the mechanisms needed for
debugging an Internet-ready handset. The WAP architectural
specification focuses on optimizing for efficient use of device
resources. But the task of providing a communications protocol as
well as an Internet protocol layer dictates that the RAM required
will be 1- to 4-MB and the Flash ROM, which will hold the kernel,
will be on the order of 256- to 512-KB.
Since the number of external accesses to RAM directly affects
power consumption, the microcontroller engine must have an
efficient instruction set and resident cache and Memory Management
Unit (MMU) to reduce external bus transactions as illustrated in
Figure 4. A key power consumption goal is to write the handset code
so that it efficiently uses the cache.
Figure 4: M•CORE M341 Processor with Nexus 5001
Debug Port
Once the cache and MMU are enabled, the interaction of the core
with the cache is no longer visible to the user unless there is a
cacheable instruction or data miss resulting in an external access
to fill the cache. This problem is aggravated further when
developers must debug code that exhibits abnormal behavior in
real-time or there is a need to capture power measurements when
running specific code.
A good example of a handset SoC design that implements a Nexus
5001 port for debugging is the M•CORE M341 microcontroller
core, the microcontroller part of the handset design. The same
techniques are used for the DSP processor as well. The
M•CORE's Nexus 5001 port supports accessing user resources
using a high speed output port to transmit real-time program and
data information. The feature set of the Nexus 5001 port is of
class 3, providing static debug capability and real-time process
identification, program trace, data trace, and read/write access to
M•CORE Local Bus (MLB) resources. Of course, using a 2- to
8-bit output port for reporting real-time 32-bit address and data
values requires an efficient data transmission method.
Using Public Messages
A set of data packets commonly referred to as Public Messages in
the Nexus 5001 specification has been defined for the efficient
transfer of debug information between the embedded processor and a
development system. Public Messages consist of a transfer code or
TCODE, source processor identification number, and the data
associated with the particular feature being accomplished. A key
requirement in the definition of the Public Messages is efficiency;
thus packets may be variable in length depending on the TCODE.
Messaging capability is controlled by a JTAG serial interface.
The JTAG interface couples to a OnCE static debug block and
provides access to all Nexus 5001 registers on the M•CORE
M341 processor. Messaging capability is enabled prior to
deassertion of the reset pin so exit from reset may be
monitored.
Monitoring Program Flow
Following program behavior can be reduced to the changes in the
program counter due to branching, jumping to subroutines, and
servicing interrupts and exceptions. Analysis shows that on average
12-13% of instructions executed in a program are of change of flow
nature. Therefore, it is not necessary to report every
instruction's address but rather only report the change of flow.
What is needed to follow the source listing is where you are
relative to a reference start address, and where you are going when
you change program flow.
Three types of public messages provide program flow behavior.
Real-time operating system (OS) debugs must have a means for
reporting a process ownership identifier. The objective of the
Ownership Trace Message is to give the most current value of the
data bus when a process writes to a special address. This address
is called the User Base Address where comparators on the Nexus 5001
snoop logic triggers a capture of the data bus. Thus, whenever a
context switch of the OS occurs, a process identifier may be
transmitted using Ownership Trace message. This may be key for
correlating virtual to physical address maps of the MMU when
sending messages to the source level debugger.
Branch Trace Messages report when direct branch or indirect
branch instructions are executed. The difference in the messages is
that in a direct branch occurrence, the only information needed is
the number of instructions executed since the last change of flow.
A reference address using a Sync Message is normally transmitted to
establish where the program counter currently is. After that, all
references are made to that address until an indirect change of
flow occurs. This reduces the number of bits transmitted in a
message. Indirect branch messages report the number of instructions
executed since the last change of flow and the address where the
program counter is jumping to, thus establishing a new reference
address.
If you need to report specific memory accesses, the Watchpoint
Message does the job. This message triggers off hardware
comparators and complex access qualifiers, which monitor the
M•CORE virtual bus.
The idea is to set a watchpoint trigger where a signal as well
as a message may be transmitted. The message tells which of the
watchpoint triggers occurred. This is especially valuable for
debugging variable writes. For example, if you have a global
variable that is being modified by a number of processes and you
want to pinpoint which of those processes is accessing that
variable, the watchpoint message is the tool to use.
This feature also asserts an event out pin, which may trigger a
logic analyzer to capture specific public messages and/or
peripheral signals. For power analysis, the trigger may be used to
capture current measurements at specific points in code or data
accesses, which may be useful for pinpointing power consuming hot
spots.
Monitoring Data Variables
Data Trace messages provide a means for reporting real-time data
accesses to memory locations. Reporting data loads and stores has a
much higher instance than reporting program flow changes. Analysis
has shown that as much as 25% of instructions executed in a program
are data accesses. Data Messages may be used to report stack
contents, global and local variables as well as peripheral port
accesses.
To control the number of Data Messages transmitted, data trace
qualifiers include the access type, i.e., read/write or either, as
well as a start and stop address range. If the data address and
access type qualifiers are met, data messages are generated and
sent to the debug port. This narrows the window of memory
locations, which may incur a Data Message.
Only sending the unique portion of the data address instead of
the complete address reduces output bandwidth requirements for the
debug port. Consequently a data trace message is reconstructed
relative to each prior message using a synchronization message as a
reference address to begin with.
Real-Time Data Access Capability
The M•CORE M341's Nexus port provides access to the MLB
mapped resources via the JTAG port. A Ready for Transfer pin (RDY)
was added to increase the transfer rate. Calculations show that
accesses to the Read/Write Data Register allow for a throughput of
1-Mbps on an M•CORE M341 microcontroller operating at 50-MHz
system clock. Block transfers are possible with a single setup of
Read/Write control and address registers. This permits 32-bit
transfers in 38 JTAG clocks where each JTAG clock is one-half of
the system clock.
This capability significantly reduces program and data load
times as well as enables the developer to examine arrays of memory
without stopping the application. Data Trace Messages only report
data movement within a well-defined data window while Read/Write
Access permits accessing values asynchronously. This feature is
very useful for downloading new filter coefficients or encryption
keys when testing communication protocols. Another important use is
the retrieval of power data values, which may be built into the
power management unit of the handset.
Real-Time Debug Tool Support
Two key ingredients to successfully reducing development time
are on-chip circuits such as the Nexus 5001 port and the associated
development tools, which support the Nexus interface. The Nexus
5001 specification defines the pins, connectors, and the protocol
for transferring messages to and from the host computer.
However, it is very difficult to define stringent rules for
Nexus register sizes, bit positions, and other implementation
specific details, which may not suit particular semiconductor
vendors' architectures. An application protocol interface (API)
that abstracts implementation details is the ideal solution for
tool vendors.
Figure 5: A lab debug environment that uses an integrated
software tool set coupled to a logic analyzer and power
source/monitor for complete handset development
The emulation controller provides the abstraction layer so that
an API may be defined, which provides the details at the emulation
controller to the Nexus interface without burdening the Tool
Vender. An FPGA was added to the emulation controller, which would
reconstruct the full message from the two Nexus port output pins to
a 40-bit wide message with message trigger signal. This improves
utilization of the logic analyzer's trace buffer.
Classical debuggers use a load, arm, go scenario where once the
debugger has sent the target processor(s) to work, the debug
environment is frozen until the target processor re-enters a debug
or interrogation mode. To fully exploit the real-time debug
capabilities of the M341 Nexus port, the debugger must permit
interrogation of target resources as it executes code in
real-time.
During initial development of the M341 Nexus port, a Hi-Ware
(Hiwave) debugger was interfaced to a Tektronix TLA-714 logic
analyzer. The Hiwave debugger directly interacts with the Tektronix
logic analyzer to arm its trace buffer for message captures and
later displays the trace buffer contents within the Hiwave
environment.
Since the M341 processor has a different instruction set and
programmer's model than the DSP56600 architecture, a dual
integrated environment with split windows, one each for the
respective processor is utilized for debugging the baseband
transceiver. A single emulation controller is used that can
communicate with each processor using the JTAG protocol. A
semaphore configuration in the dual debugger's control module
regulates traffic to the emulation controller so there are no
message collisions when communicating with either processor.
Minimizing Debugging Penalties
The additional feature set of the Nexus port doesn't come
without some die area and power penalty. Therefore, during its
implementation all sub-module clocks were gated off for inactive
circuitry and the message decode state machine and logic were
enabled/disabled via Nexus control. Special consideration was given
to the message queues to reduce power and the output port was made
variable width to accommodate a 2- or 8-bit width.
This is quite important from a development perspective. During
lab analysis the 8-bit port would be used since there was room to
add larger connectors on the evaluation cards. But once the handset
ergonomics had been finalized and the high-density double-sided
surface mount board was used, it was decided that a reduced
bandwidth over the output port could be feasible at that point in
system development.
Overall the Nexus Class 3 implementation was 7.5% of the M341
processor area. But considering the size of the complete baseband
transceiver, it becomes quite small relative to the addition of
on-chip memories and the DSP.
For more information on the IEEE-ISTO Nexus 5001 Forum, read IEEE-ISTO Nexus 5001 Forum Targets Real-Time
Debugging.