When you've got multiple processors and peripherals in a system, understanding the real-time dynamics of all these chips is critical to developing reliable, cost-effective products–especially given today's short product-development schedules. Real-time embedded systems are increasingly implemented on multicore ASIC or system-on-chip (SoC) devices to take advantage of the lower power consumption, lower unit cost, and higher integration that these devices offer. Many of the standard tools that developers have relied on to see into the operation of products implemented in older technology can no longer be used in these powerful new multifunction designs. By their integrated nature, SoC architectures have processors, memory, peripherals, controllers, and other important subsystem components all on the same silicon die. High-speed internal buses connect the various components, and overall performance depends upon the efficient management of data flow among them. Bottlenecks, latency, and contention over shared resources such as buses and memory are killers for real-time data delivery. Developers, more than ever before, need visibility into what's happening under the hood in order to optimize performance.
Unfortunately, monitoring transactions among system components is no longer a simple matter of hooking up a logic analyzer or bus analyzer, since many of the signals of interest are buried deep within the chip. Visibility into an SoC requires a mix of hardware and software mechanisms to collect data within the SoC itself, backed by profiling and correlation tools that can help developers interpret the data collected.
You can get visibility in a variety of ways. In many cases, however, the very components (memory, system bus, I/O port) you're watching are used to capture information. In other words, it can be difficult to assess bus utilization or memory contention when the assessment process itself affects the results. The challenge is to capture and upload data points without adversely affecting system performance.
Welcome to the real world
Traditionally, when a logic analyzer wasn't available or was too much trouble to set up, developers used software instrumentation to gain visibility into their designs. They would add debugging code to the target that collected, processed, and uploaded debug data. Turning a timer on and off when entering and leaving a function, for example, was and still is a straightforward way to profile a function through software instrumentation.
While it takes only a few C language printf instructions to instrument code and neatly format the collected data and output it to a standard I/O device, such code can have a significant effect on code size, memory utilization, cache performance, timing, and resource contention of the system. These drawbacks frequently make the use of the printf only suitable for instrumenting non–real-time control code. For real-time or deterministic code, such as that used in consumer media players and recorders, hand-held wireless communication devices, telecom, robotics, or automotive applications, other less-intrusive techniques are needed to avoid interfering with the execution of the target programs or missing real-time deadlines.
There are a number of ways to increase visibility while also decreasing intrusiveness. Conceptually, monitoring a system involves collecting data, buffering the data, uploading the data from the target device, post-processing it, and displaying it. Carefully scheduling when and where these activities occur allows you to minimize the impact they have on system performance. Reducing the memory footprint associated with the instrumented code and the data-collection infrastructure enables you to collect more data and increase the accuracy or scope of your measurements of the system's real-time behavior.
Typically, it takes several times the size of a data point to record the context information needed to allow it to be interpreted meaningfully. For example, in addition to the data value at the time of collection, you may need to tag the name of the variable the data is associated with, capture a timestamp for when the value was collected, note what function was executing when the capture was made, and so on. Several techniques capture and organize this type of contextual information without relying on printf and its string-formatting capabilities. Often data contains patterns, and if you collect the data in a certain way, you can infer some of these additional characteristics without having to include them in the buffer. Some techniques for increasing visibility include:
- Record format: If you collect a single variable in a buffer, you no longer need to tag what variable was collected. If you're collecting multiple values, you can create a record format where each value has a designated slot and, again, avoid having to designate what you're collecting.
- Multiple buffers: By grouping like data points together, you can simplify circular-buffer management, reducing the latency of collecting a data point. Likewise, if you segregate data collection by priority, then when the system is at full utilization you can let circular buffers capturing less-critical information overflow instead of upsetting real-time system deadlines with an untimely upload. In any case, you'll need mechanisms to flag the overflow and potentially track how much data is lost if the buffer makes any assumptions, such as an implied timestamp that needs to be reconstructed.
- Sampled data: Configuring hardware counters and letting them run is nonintrusive. Reading a counter and uploading its value, however, is intrusive. The more frequently you log a counter, the more accurate your log but the more intrusive the collection and uploading will be. Keep frequencies low until you determine that you actually need more accurate information.
For example, a periodic profiler that logs which function is currently executing can secure a fairly accurate percentage-use profile of code. Such a profiler collects a fraction of the information collected during a log of every function call and is that much less intrusive. You can also sample data points as a low-priority task, although this may skew your results.
- Deterministic data: If the frequency of capture of sampled data is fixed, you don't need to include a timestamp. Alternatively, if data must pass through a set series of computational blocks, you need only record the data value and timestamp, since the actual block can be determined from the order of the timestamps. If you're capturing several values it might be more efficient to assume the flow of data through a series of blocks, logging the function and timestamp only and assuming the data-record format. Additionally, you can employ multiple buffers to assume both function and record, further reducing intrusiveness.
- Dynamic/intelligent logging: By collecting data only when you need to (in other words, the circumstances under which the information is of interest to you), you reduce the impact of collecting the data. Using several debug flags enables you to narrow what you capture by setting a particular flag only when the circumstances of interest are active. In this way, you don't overload the system by uploading information you don't really need. This also reduces capture overhead and conserves buffer space. Setting or checking a flag requires only a processor cycle or two, so it's a useful technique to employ even with hardware-based counters.
- Piece-wise logging: In some cases it may be possible to halt the target without affecting its execution, such as when you have no real-time operations active. If so, you can “avoid” buffer-uploading overhead by triggering a halt when it's safe to do so, then uploading buffers while the system is suspended.
- Piece-wise uploading: If you have an idle task, you can use it to upload buffers in sections when the system isn't fully utilized. While this doesn't reduce the overhead of uploading, it does shift the intrusiveness of uploading to a time when it has significantly less influence on system performance.
- RTOS monitoring: For more complex monitoring, you may find support as near as your real-time operating system. Many operating systems have built-in mechanisms and libraries that support on-chip monitoring hardware, both to ease configuration and provide foundation code for managing circular buffers and the infrastructure required to stream data off chip, as well as hooks for self-monitoring. By abstracting the process of logging and offloading data, you can quickly reconfigure what you monitor, how you monitor it, how frequently you monitor, where you capture data, and how you offload it. Before you create your own infrastructure to instrument your code, check what the RTOS provides first.
- Avoid accessing memory and other system resources: Software instrumentation should only be used as a fallback when hardware mechanisms are insufficient, such as when you need an extremely large buffer or when limited processing reduces overall intrusiveness. Ideally, if you can monitor the system bus or memory without using either the system bus or memory, you'll get more accurate results. If you keep the amount of data collected to a minimum, you can avoid using buffers in memory; just feed data directly across the JTAG or trace bus.
- Monitor in stages: If you need to collect a great deal of data, consider collecting it in a series of runs. Note that because you'll be collecting different information each run, you won't be able to correlate the results because the timestamps include differing instrumentation latencies. This technique is best when trying to figure out where to start looking for a problem. As you narrow down the possibilities, you can narrow the level of monitoring as well.
- Modules: If you pass data across a JTAG or trace bus, you can place a processing module between the target and host to handle timestamp creation and limited processing of data. By offloading the timestamp to the module, you free up bandwidth on the test bus for sending more information. Modules are also an effective way of enabling completely nonintrusive monitoring. For example, the module could snoop the system bus, monitor a specific memory-address range over the test bus, or trigger a near–real-time capture of a block of data using direct memory access (DMA).
Given the idiosyncrasies of various real-time applications, you'll want to test different collection schemes to determine which combination achieves the most visibility with the least intrusion. For example, if you have spare memory you can increase buffer sizes. What's best depends upon what and how much data you're collecting.
In some cases instrumenting code is either still too intrusive or not accurate enough, or is simply unable to gather the information required to understand the dynamics of the dataflow through a complex SoC. Increasingly, SoC architectures are including features to assist monitoring the operation of the device in hardware in order to meet these needs.
- Event counters: Many subtle details aren't detectable when software to monitor an event. For example, counting the number of times a particular CPU core stalled while waiting for access to a shared resource (such as external memory) just doesn't work in software. Hardware designs that include a few well-placed counters can provide valuable insight into the dynamics of your system at little additional cost. They can be read out either through the debugger's JTAG interface or they can be periodically read by, for example, a background task in the software and written to a buffer for interrogation at a later time.
- High-watermark counters: Frequently developers need to understand the worst-case extremes the device is operating under, such as the maximum amount of time it took to service an interrupt or the minimum and maximum jitter in an input. High-watermark counters provide hardware that can be configured to monitor specific bus events and latch in the maximum (high-watermark) or minimum (low-watermark) characteristics of the event. They can provide valuable statistics without a lot of the overhead that would otherwise be required to implement them either in target software, or to collect the data and move it off chip for post-processing.
- Trace: A more expensive, but extremely valuable, form of hardware assisted monitoring is trace, where bus transactions are logged to a dedicated piece of on-chip memory so that the last N bus transactions leading up to an event can be captured.
Uploading captured data
Typically you'll upload data to a development system (such as a PC) or to a monitoring module for further analysis. Once you've figured out what debug information you need to collect and how to collect it as unobtrusively as possible, you have to figure out how to get the data off the chip–ideally while the application is still running.
One of the key tradeoffs you can make is between buffer depth and frequency of uploading. The smaller your debug data buffers, the more frequently you have to upload data. Frequent uploading can have a sustained effect on system performance. If you have a large memory pool available for buffering debug data, you can collect data with fewer consequences for system performance. However, larger buffers require more target memory and uploading them while the device is running has a more pronounced effect on system performance.
In the field, it may be useful to be able to upload the data over a system port such as a serial bus, TCP/IP connection, or USB port. The additional data throughput consumes bandwidth over that port, however, as well as processor cycles to handle the protocols and data transmission.
When you're collecting more data than you can stream off of the chip in real-time, you'll inevitably introduce gaps in the captured data. In these cases, it's necessary to periodically insert enough context information to ensure that the data can be successfully decoded after it has finally been captured off-chip. Packetizing the data or introducing periodic “sync points” are two ways to provide this extra information in the data stream. You can do this as part of the data-upload process so that the redundant information doesn't have to be stored on-chip.
When multiple CPU cores are working together in an SoC, it's often necessary to upload the information captured for each core in parallel so that a complete picture of the system can be assembled. If multiple upload paths aren't available, you'll either need to combine the data from multiple cores into a single buffer before uploading or multiplex it in some way to share the upload path. Once again, the dynamics of the system and the relative importance of the data need to be taken into account when deciding the best approach to dealing with this problem. If there's a lot of relatively unimportant data coming from one core and important information is coming infrequently from another core, you'll need a means of ensuring that the important information takes precedence over the unimportant information.
Analysis and visualization
Converting all the raw data generated by an SoC device into a format that's simple to understand offers its own set of challenges. The variety of data types that can be collected, the hardware-specific mechanisms necessary to collect them, and the different kinds of application-specific problems users could be trying to solve make creating a single all-purpose analysis and visualization tool very challenging. Modular, customizable tools that can be customized to work with specific hardware, software, and problem domains are often the best approach for providing the flexibility needed to overcome these challenges. Using a modular framework makes it easier to correlate data from multiple data streams, analyze the correlated data for specific types of information, and present the information that has been pulled out of the data in an easy-to-understand display. Some examples of the types of capabilities such a framework should offer are outlined here:
- Correlating data points: When addressing system-level issues such as bottlenecks, contentions, or load balancing in a multiprocessor SoC, you may need to collect data from multiple processors and accelerators. In this case, reconstructing system behavior requires correlating multiple logs to a single time line. On some systems the availability of a system clock can make this easy. On others, you may be able to leverage a clock by accessing it from other cores through system-level resources such as a DMA. If a common clock isn't practical then other mechanisms can be used to periodically synchronize the time of multiple cores. One way of doing this is to use interrupts to pass a synchronizing timestamp via shared memory. Ideally, the visualization tool can adapt to these different methods of log correlation.
- Analysis infrastructure: A modular framework enables you to implement common analysis activities as modules that can be used to implement a number of different analysis and visualization tools, for example:
• A generic customizable data translator and table view can easily be used to create a message log viewer
• An analysis module that's able to correlate data streams based on an event or timestamp can be used as a building block that constructs data feeds for other analysis modules
• A module that analyzes high-water marks over time can provide the basis of an application-specific dashboard, bandwidth-utilization monitor, and so forth
- Extendibility: Although a lot of collected data can be evaluated with generic components, it's nice to be able to create custom components to extend the tools' environment.
- Configurability: Visualization tools are critical for extracting meaning from large buffer uploads, and developers need to be able to configure tools to highlight the specific discrepancies and peaks of data to pinpoint both abnormal and typical behavior. In order to offload processing of data from the target, the tools should provide a programmable foundation to allow building application intelligence into the tool and reduce the amount of data that needs to be collected. They should also provide enough control to specify what data is to be collected at any given time.
The challenge of achieving visibility into real-time SoC systems is certainly not trivial. Collecting enough information to produce meaningful results without skewing those results requires a system-level approach. By using software instrumentation libraries, taking advantage of hardware-assisted monitoring, and managing how the data is moved off-chip, developers can collect more information less intrusively, increasing the accuracy, width, depth, and granularity of the data collected. New flexible tool suites and software development strategies will help today's developers meet the challenge of testing and debugging complex SoC architectures for real-time applications with accuracy and confidence.
Brian Cruickshank is a software architect for Texas Instruments. He has a Bachelor of electrical engineering from McMaster University, 12 patents, and over 20 years of software, hardware, and systems development experience, specializing in digital signal processing applications. He has worked at Texas Instruments for three years as a member of the Code Composer Studio development team and is currently focusing on software development tools for heterogeneous devices. You can reach him at firstname.lastname@example.org.
Imtaz Ali is an engineering manager for Code Composer Studio-related products at Texas Instruments. After graduating from the University of Toronto with a degree in electrical engineering, he began his career as a software designer at Nortel Networks PBX and then Voice Messaging division. He joined Go-DSP in its early stages and was instrumental in the success of Code Composer. Since Texas Instruments' acquisition of Go-DSP in 1998, Imtaz has played a key role in directing the evolution of Code Composer Studio. You can reach him at .