An optimizing compiler that targets MPSoC
environments should tackle a number of critical issues. Building on
what was learned in Part 1 and Part 2, we
first explain these issues and then study potential solutions. From
the performance viewpoint, perhaps the two most important
memory-related tasks to be performed in an MPSoC environment are
optimizing parallelism and locality. Other important issues relate to
power/energy consumption and memory space.
The
problem with parallelism
Optimizing parallelism is obviously important, since parallelism is the
main reason to employ multiple processors in a single unit. In fact, a
parallelization strategy determines how memory is utilized by multiple
on-chip processors and can be an important factor for achieving an
acceptable performance. However, maximum parallelism may not always be
easy to achieve because of several factors. For example, intrinsic data dependences in
the code may not allow full utilization of all on-chip processors.
Similarly, in some cases, interprocessor communication costs can be
overwhelming as one increases the number of processors used.
Finally, performance benefits due to increased interprocessor
parallelism may not be sufficient when one considers the increase in
power consumption. Because of all these, it may be preferable to avoid
increasing the number of processors arbitrarily. In addition, the
possibility of different parts of the same application demanding
different number of processors can make the problem much harder.
Instruction
and Data Locality
An equally important problem is ensuring locality of data/instruction
accesses. Although achieving acceptable instruction cache performance
is not very difficult (since instructions are read-only and exhibit
perfect spatial locality), the same cannot be said for data locality.
This is because straightforward coding of many applications can lead
to poor data cache utilization. In addition, in an MPSoC environment,
interprocessor communication can lead to frequent cache line
invalidations/updates (due to interprocessor data sharing), which in
turn increases overall latency.
This last issue becomes particularly problematic when false sharing
occurs (i.e., the multiple processors share a cache line but not the
same data in it). Therefore, an important task for the compiler is to
minimize false sharing as much as possible.
Power/Energy
Consumption
Considering the environments for which MPSoC based systems can be
employed, one can see that performance is only one part of the big
picture. Minimizing power/energy consumption can become vital for
system availability. In fact, power/energy consideration is one of the
most important factors that distinguish an MPSoC-based system from its high-end
counterparts. . In
other words, one cannot directly transfer pure performance-oriented
compilation policies from a high-end
parallel computing domain. This is because these
techniques in general favor increasing the number of processors (to be
used for execution of the application) as long as there are additional
performance benefits in doing so.
However, such a strategy may not be preferable in our context since
increasing the number of processors means powering up more processors
along with their local caches (and/or SPMs). Obviously, this can lead
to higher power consumption and a larger overall energy-delay product
value. Therefore, an optimizing compiler should be able to balance the
increase in power consumption and decrease in execution cycles. In
addition, it is well known from high-end parallel computing that
parallelism and data locality can sometimes impose conflicting demands
and require different program/data optimizations.
Adding power/energy minimization as an additional metric makes the
overall optimization problem extremely hard and justifies, in our
opinion, the use of heuristic solutions.
Memory
Space
Yet another memory-related issue is the memory space required by the
application. Since most prior existing locality-oriented studies have
focused on improving memory access patterns, there is no guarantee that
such techniques will reduce space demands. However, reducing space
consumption might be of critical importance as doing so increases the
effectiveness of on-chip memory utilization and can reduce the number
of off-chip references.
Looking for Solutions
Let's start the search with a discussion of several techniques that can
be used for enhancing memory behavior of MPSoC architectures. We focus
on both performance and energy aspects.
Optimizing Parallelism
Parallelism can either be expressed by the programmer at the source
level or be automatically derived by an optimizing
compiler from the
sequential code [397]. In the former case (which is called the
user-assisted method), the compiler's job is relatively easy, and
consists of distributing the computations across the processors (as
indicated by the user) and minimizing interprocessor communication.
This may involve program transformations that reduce the number of
communications (messages) as well as the volume of data communicated at
each message. In the latter case (i.e., when the compiler is to
generate parallel code from a sequential one), the compiler needs to
analyze data dependences and extract available parallelism. After that,
by assessing the volume and intensity of interprocessor communication,
the compiler can come up with a suitable parallelization strategy
(e.g., in a loop nest-based code, it determines which loops should be
run in parallel). After this step, the compilation process proceeds as
in the user-directed (user-assisted) parallelism case.
Although in the high-end computing arena it is generally accepted
that user-assisted parallelism is a more viable approach than compiler-initiated parallelism
[397], the choice is not clear in the context of MPSoC due to
additional constraints of low power and limited memory space. That is,
it can be extremely difficult for the programmer to strike a balance
among performance, power, and memory space consumption for a given
application. What makes it even more difficult is that not all MPSoC
applications are array/loop nest-based, making the task of identifying
parallelism very difficult for the user.
In order for the compiler to parallelize a code given performance
(execution cycles), energy, and memory space consumption constraints,
it needs to carry out two major tasks: (1) estimating performance,
power consumption, and space requirements of a given piece of code; and
(2) optimizing the objective function under multiple constraints (which
might be of performance, power, memory space, or a combination of
these). Based on this observation, there are at least two different
ways of dealing with this MPSoC parallelization problem:
- Static approaches: These
techniques are generally applicable when the code is statically
analyzable. Many applications from embedded image and video processing
areas fall into this category.
In this case, the compiler can analyze the application code and
estimate its execution cycles, power/energy
consumption, and memory space demands. Then it can conduct a similar
estimation process for each potential optimized version, and it selects
the best candidate for the output code. Obviously, an important
question here is how to generate different optimized versions.
Among the techniques proposed so far in literature are heuristic
schemes and
integer linear programming (ILP)-based techniques that make use of
profiling [399]. It is to be noted that the static analyzability of the
code also makes it possible for the compiler to come up with a small
set of candidate optimized versions.
- Dynamic approaches: The
schemes in this category also work with a candidate set of optimized
versions; but it is less clear (compared with the previous case) at
compile time how these candidates would rank with respect to each
other.
In other words, the selection of the best optimized version is
postponed until run time. One approach recently proposed in ref. 400 is
first to execute a portion of the code fragment to be optimized and
then, based on the run-time statistics collected, select the most
suitable parallelization strategy at run time. Clearly, the time (and
energy) spent in determining the best optimized version at run time is
an overhead and needs to be optimized away as much as possible.
<>It should also be emphasized that whether the memory hierarchy in
question is constructed using caches or SPMs can make a difference in
the success of the parallelization strategy chosen. More specifically,
it is easier to estimate the performance and energy consumption with
SPMs (as opposed to caches) since the compiler is in full control of
memory transfers in the former (whereas the second one is managed by
the hardware).