CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Providing memory system and compiler support for MPSoc designs: Part 3
Compiler Support



Embedded.com

Power/Energy Optimization
Power/energy-directed and performance-directed optimizations often conflict as far as an MPSoC-based environment is concerned. For example, the two graphs shown in Figure 9-6 below give the execution cycles and energy consumption behavior when an array-based application (tomcatv from the Spec benchmark suite) is parallelized over a different number of processors (ranging from 1 to 8).

Each bar in these graphs represents a result (for the corresponding loop nest), normalized with respect to the single processor case (that is, the energy consumption or execution cycles when the nest is executed on a single processor).

Figure 9-6 Energy consumption (top) and execution cycles (bottom) for different nests of tomcatv. For each nest, each bar corresponds to a specific number of processors (from one processor, the leftmost bar, to eight processors, the rightmost bar).

From these graphs, one can make several observations. First, energy and performance trends are different from each other. That is, in a given nest, the best processor size from the energy perspective is (in general) different from the best processor size from the execution time perspective.

This is because in many cases increasing the number of processors beyond a certain point may continue to improve performance (execution cycles), but the extra energy to power up the additional processor and its caches may offset any potential energy benefits coming from the reduced execution cycles (as a result of increased parallelism).

Second, for a given nest, increasing the number of processors does not always reduce execution cycles. This occurs because there is an extra performance cost of parallelizing a loop, which includes spawning parallel threads, interprocessor synchronization during loop execution if there are data dependences, and synchronizing the threads after the loop execution. If the loop in question does not have a sufficient number of iterations to justify the use of a certain number of processors, this extra cost can dominate overall performance behavior.

Third, the best number of processors from the energy or performance perspectives depends on the loop nest in question. That is, different nests may require different processor sizes for the best results, and it is not a good idea to use the same number of processors for all the nests in a given application.

Given these observations, one can see that it is not easy to select the ideal processor sizes for each nest in a given code to optimize an objective function, which might have both performance and energy elements. As an example, consider a compilation (parallelization) scenario in which we would like to optimize energy consumption of the application keeping the execution time under a predetermined limit. Important questions in this scenario are (1) whether there are processor sizes for each nest to satisfy this compilation constraint, and (2) if there are multiple solutions, how one can choose the best one.

Most published work on parallelism for high-end machines  is static; that is, the number of processors that execute the code is fixed for the entire execution. For example, if the number of processors that execute a code is fixed at 8, all parts of the code (for example, all loop nests) are executed using the same eight processors.

However, this may not be a very good strategy from the viewpoint of energy consumption. Instead, one can consume much less energy by using the minimum number of processors for each loop nest (and shutting down the caches of the unused processors). In adaptive parallelization, the number of processors is tailored to the specific needs of each code section (e.g., a nested loop in array-intensive applications).

In other words, the number of processors that are active at a given period of time changes dynamically as the program executes. For instance, an adaptive parallelization strategy can use 4, 6, 2, and 8 processors to execute the first four nests in a given code. There are two important issues that need to be addressed in designing an effective adaptive parallelization strategy for an on-chip multiprocessor:

  • Mechanism: How is the number of processors to execute each code section determined? There are at least two ways of determining the number of processors per nest: the dynamic approach and the static approach. The first option is adopting a fully dynamic strategy whereby the number of processors (for each nest) is decided in the course of execution (at run time). Kandemir et al. has presented such a dynamic strategy.

    Although this approach is expected to generate better results once the number of processors has been decided (as it can take run-time code behavior and dynamic resource constraints into account), it may also incur some performance overhead during the process of determining the number of processors. This overhead can, in some cases, offset the potential benefits of adaptive parallelism.

    In the second option, the number of processors for each nest is decided at compile time. This approach has a compile-time overhead, but it does not lead to any run-time penalty. It should be emphasized, however, that although this approach determines the number of processors statically at compile time, the activation/deactivation of processors and their caches occurs dynamically at run time.

  • Policy: What is the basis of the criterion on which we decide the number of processors? An optimizing compiler can target different objective functions such as minimizing execution time of the compiled code, reducing executable size, improving power or energy behavior of the generated code, and so on.

    In addition, it is also possible to adopt complex objective functions and compilation criteria such as "minimize the sum of the cache and main memory energy consumptions while keeping the total execution cycles bounded by M." For example, the framework described in ref. 400 accepts as objective function and constraints any linear combination of energy consumptions and execution cycles of individual hardware components.

Future Research.
Perhaps the most important future research direction in compilation of MPSoC applications will be compiling applications under multiple constraints that involve power, energy, execution cycles, and memory usage. Unfortunately, current compilation techniques (even those developed in the context of high-end multiprocessor systems) do not directly extend to the MPSoC domain since they only focus on performance.

Equally important is studying techniques that produce quick estimation of metrics of interest (e.g., power and memory usage). Optimizing compilers can use such estimates to rank different (and sometimes conflicting) optimizations and to select the best one given the constraints that need to be satisfied.

That is, we envision future MPSoC compilers to have numerous estimation and optimization modules (each targeting at a different metric) and a solver, which selects the best optimization under multiple constraints.

 Conclusion
The significant interest in MPSoC architectures in recent times has caused researchers to study architectural optimizations from a different perspective. With the full advance knowledge of the applications being implemented by the system, many design parameters can be optimized and/or customized. This is especially true of the memory subsystem, in which a vast array of different organizations can be employed for application-specific systems, and the designer is not restricted to the traditional cache hierarchy. The optimal memory architecture for an application-specific system can be significantly different from the typical cache hierarchy of processors.

Although some of the analytical techniques can be automated, much work remains to be performed before a completely "push-button" methodology evolves for application-specific customization of the memory organization in MPSoCs.

To read Part 1, go to Types of Memory Architectures.
To read Part 2, go to  Customization of memory architectures

This series of articles is based on copyrighted material submitted by Nikil Dutt and Mahmut Kandemir to "Multiprocessor Systems-On-Chips"  edited byWayne Wolf and Ahmed Amine Jerraya. It is used with the permission of the publisher, Morgan Kaufmann, an imprint of Elsevier. The book can be purchased on-line.

Mahmut Kandemir is an assistant professor in the Computer Science and Engineering Department at Pennsylvania State University. Nikil Dutt is a professor of computer science for Embedded Computer Systems at the University of California, Irvine

1 | 2 | 3 | 4

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :