Use virtual prototypes to model multiprocessor system power needs

Portable and embedded products that consume less power have a very significant advantage in today's extremely competitive markets. Each generation of product planning must satisfy substantial increases in functionality and performance plus substantial reductions in power consumption. This is particularly true in the case of battery-powered embedded and portable (often wireless) consumer electronics systems.

These portable products are become physically smaller with each new generation, yet consumers have grown to expect more and better functionality (which requires increased processing capability and performance) and todemand longer battery life. In addition to actually making telephone calls, for example, a modern cell phone may include features such as the ability toact as a personal organizer; play games; take, transmit, and receive still pictures and/or short videos; browse the internet; and so forth.

In the past, the focus of next-generation product planning has been concentrated largely on the micro-architecture of the underlying microprocessing units. However, the improvement of the processor micro-architecture typically yields only second- or third-order effects with regard to improving performance.

By comparison, the overall hardware (platform) architecture and the architecture and algorithmic content of the software that runs on it both have first-order effects at the system level.

Creating optimal low-power designs requires making sophisticated tradeoffs in the hardware architecture, the software architecture, and the underlyingsoftware algorithms. The creation of successful power-sensitive designs requires system architects and engineers (both hardware and software) tohave the ability to accurately and efficiently perform and quantify such tradeoffs. In order to achieve this, the architects and engineers require the ability to access and analyze power data early in the design process.

Virtual prototyping for power management
Virtual prototypes (VPs) and virtual system prototypes (VSPs) can be a powerful and effective means to model, analyze, and optimize real-time system power requirements and trade-offs. Using such tools, very precise and incremental changes can be made in the hardware architecture and software algorithms which significantly affect the power consumption of a system.

But characteristics such as performance and power for a complex system such as a cell phone – including its software – cannot be represented and computed asa formal mathematical problem. The only realistic solution for determining such characteristics is some form of simulation.

One option for this simulation is hardware acceleration and/or emulation. Unfortunately, in addition to providing only limited visibility into the inner working of the system, the highest level of abstraction supported by these solutions are register transfer level (RTL) representations.

As a result, development and evaluation cannot commence until a long way into the design cycle when the hardware portion of the design is largely completed.In turn, this limits the design team's ability with regard to exploring, evaluating and optimizing the hardware architecture. In addition, FPGA implementations of processors typically are slow, executing software at around 1 MIP – about 50 times slower than a virtual processor model of thesame processor.

The related concepts of virtual prototypes (VPs) and virtual system prototypes (VSPs) provide a solution. A VP is a functionality-accurate and timing-accurate software model of the hardware portions of an electronic system. Such a model will typically include processor cores, memory subsystems, peripherals, buses, bridges, mechanical and RF devices, and soon. By comparison, a VSP is a model of the entire system: that is, the combination of the VP and the software that will run on it.

Fully evaluating the characteristics of a complex system may require performing many hundreds of experiments on various system configurations. Furthermore, it is not unusual for a single simulation to require that 100billion instructions be run to reproduce a problem or to compute a representative result. This represents less than one hour of simulation time using a high-performance, timing-accurate VSP.

By comparison, the same software simulation would take between 100 to 500 hours or more using atypical timing-accurate structural instruction set simulator (ISS) model and 100,000 hours or more using an RTL model.

A key advantage of using a VSP is that the hardware and software portions of the system can be developed and evaluated concurrently. A VSP allows different hardware architectures to be quickly and easily tested and analyzed under real software workloads. While hard real-time software code is being developed, its execution yields trace data (from probes insertedinto the models) from which performance (timing, reaction times, latency times, etc.) and power data alongside normal debug data.

Using VSP for cell phone power modeling
To get a good sense of how the VSP approach, let’s look at how it would be used in a cell phone system (Figure One, below )consisting of two ARM926E processors, a StarCore SC1200 processor, hierarchical bus and memory subsystems, and a variety of peripherals.

Figure 1: 3G cell phone controller VSP

There are many techniques for constructing objective functions for system attributes such as power. The classical technique is to track event frequencies and/or latencies and to construct the power function based on events that contribute significantly to the computation of power. By comparison, a VSP enables power analysis to be made in the context of alternative hardware and software architectures running real software workloads.

The first step with the VSP is to assign “weights” for each class of function that contributes to the system's overall power consumption. As a simple example, we could start by assigning a default weight of 1.0 to ageneric register file access, and to then base other weights as multiples of this default weight. Consider some of the weights that might be associatedwith the ARM926E CPU and its cache and memory accesses as shown in Table 1, below .

As seen here, the weights (W) for each of these function classes have been set to constant multiples of the generic register access function. However,these weights may be represented by more complex functions; for example, the cache hit/miss weights could each be a function of the cache structure (size, wayness, policies, etc.).

The next step is to build an interpretation table that defines the component bindings, as shown in Table 2 , below. Although these tables are large, the event bindings themselves are simple to implement, since each is typically a pointer to a function and a history buffer of events.

In the case of power calculations, the basic function to be computed is that of instant power, which calculates the total energy consumed over some period of time or some number of events (such as clock cycles). Based onthis instant power, the two main derived functions that are of interest for optimization purposes are:

1) The maximum power consumed over a particular period of time (this will be the maximum of the instant powers); and,
2) The average power consumed over the course of an entire experiment.

If we assume that only one of the ARM926E processors is enabled (to match the experiments presented in the next section), then a simplified accumulating function used to compute the instant power per k-cycles is asfollows:

fPower = (WInst x fInst) + (WPipe x fPipe) + (WCache x fCache) + …


fInst = (WiJmp x NiJmp) + (WiArith x NiArith) + (WiCoproc xNiCoproc) + …

Where Nx means “the sum of the instructions of type 'x' over the course of 'k' cycles and WiJmp, WiArith, WiCoproc, etc. are the weights associatedwith the function types and events from Table 1. Thus:

fInst = (2.0 x NiJmp) + (1.0 x NiArith) + (12.0 x NiCoproc) + …

Similar sub-functions occur for fPipe, fCache, and so forth. Note that the weights for each of the classes of functions contributing to the main accumulating function fPower (that is, WInst, WPipe, WCache, etc.) may beassumed here to be a constant value of 1.0.

In more sophisticated studies, however, these weights might be replaced with more complex functions relevant to computing power in ways not considered for the purposes of the simple examples presented in this paper. For example, such functions might include history-dependent and implementation-dependent attributes.

Experimental Results
For the purposes of these simple example experiments, we created a VSP corresponding to Figure One above , but with only one of the ARM processors enabled and its instruction and data busses bridged to a shared memory. (We put the second ARM processor and the StarCore SC1200 processor in reset mode so that they consumed no cycles and no power).

We then constructed four suites of experiments (58 experiments in all) to investigate the effects of various arrangements of cache, busses, memory hierarchy, and algorithms with regard to power consumption and performance (speed).

With regard to the software workloads used to exercise our experiments, we employed Viterbi and Sieve programs from the Embedded Microprocessor Benchmark Consortium (EEMBC) test suite, a prime number program downloaded from the web, and a Linux boot of MontaVista Linux v2.6.

Viterbi: The results from seven Virterbi-based experiments used for calibration were expected; an uncached implementation was poor in regard tospeed and power. However, when the cache was enabled, even a minimal cache of 1,024 bytes proved sufficient for this algorithm. Furthermore, due to thefact that there was a better than 99.5% hit rate on the data and instruction caches, the cache line size was shown to be immaterial, as was the bus widthand memory type (either DDR or SDR).

Linux Boot: The results from nine structural variants of the experimental VSP were computed while booting Linux. These variants were generated as asubset of all possible variants based on different mixtures of cache size (1K, 8K, 32K), cache line size (16B, 32B), memory configured as DDR (firstword delayed five cycles, subsequent words available per half cycle) and SDR (first word delayed five cycles and subsequent words available per cycle), and a bus data width of four bytes.

As is expected in an environment where the working set size of the target code greatly exceeds the cache size, the impact of the memory hierarchy onpower and speed is considerable. With regard to booting Linux in our VSP, setting the ARM296E cache size to 32 Kbytes, the cache line size to 32 bytes, and using DDR memory yielded the minimum power consumption and maximum performance.

However, reducing the cache size to 16 Kbytes adversely impacted both power and performance by only around 1%, whereas this reduction would proportionally reduce the silicon cost by around 30%.Similarly, further reducing the cache size to 8 Kbytes negatively impacted power and speed by 5-10% while yielding a further 25% reduction in siliconcost.

Alternate Memory Hierarchies: This portion of our experiments was used to investigate the best tradeoff between power consumption, performance, andsilicon cost for a controller executing a limited amount of code: a prime number generator using the Sieve of Eratosthenes algorithm. One of the things we were very interested in with regards to this series of experiments was determining the near-minimum cache size that would still yield within 5% of optimum performance and power.

In these experiments, we considered instruction and data cache characteristics of size (0B, 64B, 128B, 256B, 1 KB, 4 KB, 8 KB), cache line size (16B, 32B), wayness (1, 2, 4), cache power rating (3, 4, 5 – we variedthe relative power consumption depending on size), and memory type (DDR, SRD). The results were as expected except that the transition between acache size of 64B and 128B was very sharp, and at 128B we essentially achieved full speed. By comparison, in the case of power, the uncached power consumption was 20-35% less than the power consumed by 64B caches, and 200% higher than that consumed by 128B caches.

From this we determined that the effect of installing a small cache in a processor to achieve a four-fold increase in performance has a detrimentaleffect on power consumption due to the infrastructure required to support the cache. The cost of the cache (and its infrastructure) is also high in terms of silicon real estate. These considerations led to an investigationof alternative memory hierarchies that might achieve a better tradeoff between speed, power, and cost for a controller running a limited amount of code in an embedded application.

Thus, we decided to mimic the relative power consumed by a dedicated external 128B buffer (essentially a small, physically-addressed, direct-mapped, on-chip cache external to the processor). The results werethat we could achieve a further 40% saving in power while maintaining optimum speed. This is significantly better than the earlier results because it minimizes power consumption and cost while maximizing performance.

Algorithmic Optimization: Last but certainly not least, our final ten experiments considered the effect of using different versions of an algorithm to optimize the combination of hardware and software for aparticular (embedded) application. Since we already had good empirical data for our initial Sieve-based prime number generation algorithm, we acquiredanother version known as Kazmierczak's algorithm.

This algorithm required a small external 512B buffer to achieve its maximum speed, which was 40% faster that the Sieve-based algorithm while consuming only 15% more power. This provides an alternate solution that isoptimized for speed where the tolerance to power and cost are more elastic.

Optimizing the hardware and software components of systems with complex objective functions is non-intuitive. Sophisticated tradeoffs between thehardware architecture and the software and algorithmic loads that are run on the hardware cannot be achieved by intuition or by formal mathematical analysis alone.

In the case of the Sieve and Kazmierczak algorithmic experiments discussed above, for example, it is inconceivable that the optimal hardware-softwarearchitectures determined by these experiments could have been determined simply by looking at – or mathematically analyzing – these two algorithms.

However, the use of a high-performance, functionality-accurate and timing-accurate VSP allows objective functions such as power consumption to be determined in the context of alternative hardware and software architectures running real software workloads.

Graham Hellestrand is founder and chief technology and strategy officer forVaST Systems Technology. You can reach him at .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.