Non-intrusive debug and performance optimization for multicore systems -

Non-intrusive debug and performance optimization for multicore systems


This Multicore Expo paper accompanies the class “ME862: Non-intrusive Debug and Performance Optimization for Multicore Systems “ to be held on May 3, 2011 in San Jose, CA.

This paper is from a class at the Multicore Expo 2011. Click here for more information about the conference.

Efficiently partitioned and bug-free software running on multiple cores is crucial for taking full advantage of multicore systems. Debugging of such complex software systems comes with an additional degree of intricacy due to the vanishing accessibility of the sub-system interfaces, buses and concurrent processing paradigm.

With this in mind, developers need to have a deeper understanding of on-chip debug technologies, tools and techniques that can accelerate debug cycles and overcome constraints for such complex multicore systems.

In traditional single core systems, debugging and performance optimization has been core centric and primarily involves debugging at the core level and optimizing algorithms. In multicore systems, traditional debug and performance optimization techniques do not scale very well and there is an increased complexity due to System-on-Chips (SoC). With the increasing complexity of SoCs, the visibility to analyze and optimize internal bus transactions remains challenging.

Traditionally, a programmer could write special routines and instrument code to get more run-time visibility but these techniques are tedious, intrusive and could impact application behavior. The success of such traditional approaches is very subjective and may delay development time if developers do not have a strong system level understanding or familiarity of the overall codebase. This could significantly impact time to market and the perceived multicore advantage could potentially turn into a disadvantage.

In order to expedite debug cycles and eliminate intrusiveness, core trace has been one of the widely used technologies so far. Core trace provides core level execution visibility without any code instrumentation.

However, it lacks system-wide or SoC-wide view. For system level visibility, system trace (STM) technology is gaining momentum and provides indispensable capabilities by instrumentation of traces from the cores and combines them with hardware monitoring events outside the processor.

Today, semiconductor companies, such as Texas Instruments, offer devices with STM capabilities including hardware assisted software instrumentation mechanism and hardware events to help analyze bus transactions, power and clock transitions and on-chip performance monitoring.

On-chip performance monitoring is one of STM’s key features. The performance monitors provide non-intrusive visibility into the complex SoC interconnection network to better understand sustained data bandwidth and latency characteristics to meet real-time performance goals. 

The performance monitoring provides such metrics as throughput, latency, arbitration conflicts and data transaction link occupancy on various data flows. This kind of information is very critical for resolving multicore system level issues.

The performance monitors are implemented as part of the Open Core Protocol (OCP) L3 interconnect infrastructure and can probe links and record bus events. Such events are computed and captured within a user-defined sampling window and periodically gets reported via a STM channel and recorded by a trace receiver. Performance monitoring also provides additional filtering and triggering capabilities to focus on specific initiators and targets.  

A typical SoC video application consists of several processing engines communicating with the external shared random access memory (RAM). The processing engines can be thought of as operating as independent initiators and the RAM can be though of as the target. In order to better understand bottlenecks, the performance monitors can provide traffic profiling for the processing elements on the RAM.

Figures 1 and 2 below are performance monitoring examples of video being captured through a camera device having STM performance monitoring capabilities. Both examples highlight where performance evaluation is completed by understanding the average burst length and throughput produced by an imaging subsystem on external memory interface (EMIF).

In these cases, the on-chip performance monitors snoops activity by connecting probes to OCP signals which calculate the EMIF load by computing the traffic statistic within a sampling window. The probes detect OCP events for payload and word transfers and increment internal counters. The counter values are periodically reported to the user through the STM interface.

These counter values are calculated to get information like average bytes per packet and throughput per sampling window. Generally, such performance measurement exercise takes just a few minutes versus a few hours or days when using software instrumentation or other traditional approaches.

The graph shown in Figure 1 below plots average bytes per packet (X-axis) per sampling period (Y-axis). The measurements in this figure show that it varies between an average of 44 bytes per packet and 128 bytes per packet per sampling window within a video frame.

Click on image to enlarge.

Figure 1: Average packet bytes per sampling window on EMIF.

Theaverage throughput bytes (X-axis) per sampling period (Y-axis) representing the EMIF throughput trend for the video capture is shown in Figure 2 below . The measurement in this example shows that average throughput varies between 64 bytes and 6400 bytes per sampling period within a video frame.

Click on image to enlarge.

Figure 2: Average throughput bytes per sampling window on EMIF

The examples demonstrate improved system level visibility and analysis capability, without any code changes and by using STM performance monitors. This information can be very useful for developers seeking visibility into performance bottlenecks or throughput related issues.

In the absence of such sophisticated visibility, developers have to rely on custom solutions and their own experiences in understanding such issues.Such an approach can take days and even weeks before the issue is understood and resolved but with performance monitoring, an adequate level of information and visibility is available in minutes and expedites time to market for developers.

By gaining a deeper understanding of tactics for debugging and performance optimization, such as core trace or STM for multicore systems, developers are empowered with the ability to accelerate debug cycles, time to market and overcome obstacles.

Vikas Varshney manages the emulation debug and trace tools group at Texas Instruments and is responsible for creating next generation XDS class emulator products and associated debug and trace tooling for TI chipsets. He has several years of experience in the field of embedded debug tools and has been part of a number of industry leading debug and trace tools for embedded systems.

To learn more about this topic, register for the Spring ESC, May 2-4, sign up for the Multicore Expo and attend Vikas Varshney’s class on this topic.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.