Power/Energy Optimization
Power/energy-directed and performance-directed optimizations often
conflict as far as an MPSoC-based environment is concerned. For
example, the two graphs shown in Figure
9-6 below give the execution cycles and energy consumption
behavior when an array-based application (tomcatv from the Spec
benchmark suite) is parallelized over a different number of processors
(ranging from 1 to 8).
Each bar in these graphs represents a result (for the corresponding
loop nest), normalized with respect to the single processor case (that
is, the energy consumption or execution cycles when the nest is
executed on a single processor).
 |
| Figure
9-6 Energy consumption (top) and execution cycles (bottom) for
different nests of tomcatv. For each nest, each bar corresponds to a
specific number of processors (from one processor, the leftmost bar, to
eight processors, the rightmost bar). |
From these graphs, one can make several observations. First, energy
and performance trends are different from each other. That is, in a
given nest, the best processor size from the energy perspective is (in
general) different from the best processor size from the execution time
perspective.
This is because in many cases increasing the number of processors
beyond a certain point may continue to improve performance (execution
cycles), but the extra energy to power up the additional processor and
its caches may offset any potential energy benefits coming from the
reduced execution cycles (as a result of increased parallelism).
Second, for a given nest, increasing the number of processors does
not always reduce execution cycles. This occurs because there is an
extra performance cost of parallelizing a loop, which includes spawning
parallel threads, interprocessor synchronization during loop execution
if there are data dependences, and synchronizing the threads after the
loop execution. If the loop in question does not have a sufficient
number of iterations to justify the use of a certain number of
processors, this extra cost can dominate overall performance behavior.
Third, the best number of processors from the energy or performance
perspectives depends on the loop nest in question. That is, different
nests may require different processor sizes for the best results, and
it is not a good idea to use the same number of processors for all the
nests in a given application.
Given these observations, one can see that it is not easy to select
the ideal processor sizes for each nest in a given code to optimize an
objective function, which might have both performance and energy
elements. As an example, consider a compilation (parallelization)
scenario in which we would like to optimize energy consumption of the
application keeping the execution time under a predetermined limit.
Important questions in this scenario are (1) whether there are
processor sizes for each nest to satisfy this compilation constraint,
and (2) if there are multiple solutions, how one can choose the best
one.
Most published work on parallelism
for high-end machines is
static; that is, the number of processors that execute the code is
fixed for the entire execution. For example, if the number of
processors that execute a code is fixed at 8, all parts of the code
(for example, all loop nests) are executed using the same eight
processors.
However, this may not be a very good strategy from the viewpoint of
energy consumption. Instead, one can consume much less energy by using
the minimum number of processors for each loop nest (and shutting down
the caches of the unused processors). In adaptive parallelization, the
number of processors is tailored to the specific needs of
each code section (e.g., a nested loop in array-intensive applications).
In other words, the number of processors that are active at a given
period of time changes dynamically as the program executes. For
instance, an adaptive parallelization strategy can use 4, 6, 2, and 8
processors to execute the first four nests in a given code. There are
two important issues that need to be addressed in designing an
effective adaptive parallelization strategy for an on-chip
multiprocessor:
- Mechanism: How is the
number of processors to execute each code section determined? There are
at least two ways of determining the number of processors per nest: the
dynamic approach and the static approach. The first option is adopting
a fully dynamic strategy whereby the number of processors (for each
nest) is decided in the course of execution (at run time). Kandemir et
al. has presented such a dynamic strategy.
Although this approach is
expected to generate better results once the number of processors has
been decided (as it can take run-time code behavior and dynamic
resource constraints into account), it may also incur some performance
overhead during the process of determining the number of processors.
This overhead can, in some cases, offset the potential benefits of
adaptive parallelism.
In the second option, the number of processors
for each nest is decided at compile time. This approach has a
compile-time overhead, but it does not lead to any run-time penalty. It
should be emphasized, however, that although this approach determines
the number of processors statically at compile time, the
activation/deactivation of processors and their caches occurs
dynamically at run time.
- Policy: What is the basis
of the criterion on which we decide the number of processors? An
optimizing compiler can target different objective functions such as
minimizing execution time of the compiled code, reducing executable
size, improving power or energy behavior of the generated code, and so
on.
In addition, it is also possible to adopt complex objective
functions and compilation criteria such as "minimize the sum of the
cache and main memory energy consumptions while keeping the total
execution cycles bounded by M." For example, the framework described in
ref. 400 accepts as objective function and constraints any linear
combination of energy consumptions and execution cycles of individual
hardware components.
Future Research.
Perhaps the most important future research direction in compilation of
MPSoC applications will be compiling applications under multiple
constraints that involve power, energy, execution cycles, and memory
usage. Unfortunately, current compilation techniques (even those
developed in the context of high-end multiprocessor systems) do not
directly extend to the MPSoC domain since they only focus on
performance.
Equally important is studying techniques that produce quick
estimation of metrics of interest (e.g., power and memory usage).
Optimizing compilers can use such estimates to rank different (and
sometimes conflicting) optimizations and to select the best one given
the constraints that need to be satisfied.
That is, we envision future
MPSoC compilers to have numerous estimation and optimization modules
(each targeting at a different metric) and a solver, which selects the
best optimization under multiple constraints.
Conclusion
The significant interest in MPSoC architectures in recent times has
caused researchers to study architectural optimizations from a
different perspective. With the full advance knowledge of the
applications being implemented by the system, many design parameters
can be optimized and/or customized. This is especially true of the
memory subsystem, in which a vast array of different organizations can
be employed for application-specific systems, and the designer is not
restricted to the traditional cache hierarchy. The optimal memory
architecture for an application-specific system can be significantly
different from the typical cache hierarchy of processors.
Although some of the analytical techniques can be automated, much
work remains to be performed before a completely "push-button"
methodology evolves for application-specific customization of the
memory organization in MPSoCs.
To read Part 1, go to
Types of Memory Architectures.
To read Part 2, go to
Customization
of memory architectures
This series of
articles is based on copyrighted material submitted by Nikil Dutt and
Mahmut Kandemir to "Multiprocessor
Systems-On-Chips" edited byWayne
Wolf and Ahmed Amine Jerraya. It is used with the permission of the
publisher, Morgan Kaufmann, an imprint of Elsevier. The book can be
purchased on-line.
Mahmut
Kandemir is an assistant professor in the Computer Science and
Engineering Department at Pennsylvania State University. Nikil Dutt is a professor of
computer science for Embedded Computer Systems at the University of
California, Irvine