An optimizing compiler that targets MPSoC
environments should tackle a number of critical issues. Building on
what was learned in Part 1 and Part 2, we
first explain these issues and then study potential solutions. From
the performance viewpoint, perhaps the two most important
memory-related tasks to be performed in an MPSoC environment are
optimizing parallelism and locality. Other important issues relate to
power/energy consumption and memory space.
The
problem with parallelism
Optimizing parallelism is obviously important, since parallelism is the
main reason to employ multiple processors in a single unit. In fact, a
parallelization strategy determines how memory is utilized by multiple
on-chip processors and can be an important factor for achieving an
acceptable performance. However, maximum parallelism may not always be
easy to achieve because of several factors. For example, intrinsic data dependences in
the code may not allow full utilization of all on-chip processors.
Similarly, in some cases, interprocessor communication costs can be
overwhelming as one increases the number of processors used.
Finally, performance benefits due to increased interprocessor parallelism may not be sufficient when one considers the increase in power consumption. Because of all these, it may be preferable to avoid increasing the number of processors arbitrarily. In addition, the possibility of different parts of the same application demanding different number of processors can make the problem much harder.
Instruction
and Data Locality
An equally important problem is ensuring locality of data/instruction
accesses. Although achieving acceptable instruction cache performance
is not very difficult (since instructions are read-only and exhibit
perfect spatial locality), the same cannot be said for data locality.
This is because straightforward coding of many applications can lead
to poor data cache utilization. In addition, in an MPSoC environment,
interprocessor communication can lead to frequent cache line
invalidations/updates (due to interprocessor data sharing), which in
turn increases overall latency.
This last issue becomes particularly problematic when false sharing occurs (i.e., the multiple processors share a cache line but not the same data in it). Therefore, an important task for the compiler is to minimize false sharing as much as possible.
Power/Energy
Consumption
Considering the environments for which MPSoC based systems can be
employed, one can see that performance is only one part of the big
picture. Minimizing power/energy consumption can become vital for
system availability. In fact, power/energy consideration is one of the
most important factors that distinguish an MPSoC-based system from its high-end
counterparts. . In
other words, one cannot directly transfer pure performance-oriented
compilation policies from a high-end
parallel computing domain. This is because these
techniques in general favor increasing the number of processors (to be
used for execution of the application) as long as there are additional
performance benefits in doing so.
However, such a strategy may not be preferable in our context since increasing the number of processors means powering up more processors along with their local caches (and/or SPMs). Obviously, this can lead to higher power consumption and a larger overall energy-delay product value. Therefore, an optimizing compiler should be able to balance the increase in power consumption and decrease in execution cycles. In addition, it is well known from high-end parallel computing that parallelism and data locality can sometimes impose conflicting demands and require different program/data optimizations.
Adding power/energy minimization as an additional metric makes the overall optimization problem extremely hard and justifies, in our opinion, the use of heuristic solutions.
Memory
Space
Yet another memory-related issue is the memory space required by the
application. Since most prior existing locality-oriented studies have
focused on improving memory access patterns, there is no guarantee that
such techniques will reduce space demands. However, reducing space
consumption might be of critical importance as doing so increases the
effectiveness of on-chip memory utilization and can reduce the number
of off-chip references.
Looking for Solutions
Let's start the search with a discussion of several techniques that can
be used for enhancing memory behavior of MPSoC architectures. We focus
on both performance and energy aspects.
Optimizing Parallelism
Parallelism can either be expressed by the programmer at the source
level or be automatically derived by an optimizing
compiler from the
sequential code [397]. In the former case (which is called the
user-assisted method), the compiler's job is relatively easy, and
consists of distributing the computations across the processors (as
indicated by the user) and minimizing interprocessor communication.
This may involve program transformations that reduce the number of communications (messages) as well as the volume of data communicated at each message. In the latter case (i.e., when the compiler is to generate parallel code from a sequential one), the compiler needs to analyze data dependences and extract available parallelism. After that, by assessing the volume and intensity of interprocessor communication, the compiler can come up with a suitable parallelization strategy (e.g., in a loop nest-based code, it determines which loops should be run in parallel). After this step, the compilation process proceeds as in the user-directed (user-assisted) parallelism case.
Although in the high-end computing arena it is generally accepted that user-assisted parallelism is a more viable approach than compiler-initiated parallelism [397], the choice is not clear in the context of MPSoC due to additional constraints of low power and limited memory space. That is, it can be extremely difficult for the programmer to strike a balance among performance, power, and memory space consumption for a given application. What makes it even more difficult is that not all MPSoC applications are array/loop nest-based, making the task of identifying parallelism very difficult for the user.
In order for the compiler to parallelize a code given performance (execution cycles), energy, and memory space consumption constraints, it needs to carry out two major tasks: (1) estimating performance, power consumption, and space requirements of a given piece of code; and (2) optimizing the objective function under multiple constraints (which might be of performance, power, memory space, or a combination of these). Based on this observation, there are at least two different ways of dealing with this MPSoC parallelization problem:
In this case, the compiler can analyze the application code and estimate its execution cycles, power/energy consumption, and memory space demands. Then it can conduct a similar estimation process for each potential optimized version, and it selects the best candidate for the output code. Obviously, an important question here is how to generate different optimized versions.
Among the techniques proposed so far in literature are heuristic schemes and integer linear programming (ILP)-based techniques that make use of profiling [399]. It is to be noted that the static analyzability of the code also makes it possible for the compiler to come up with a small set of candidate optimized versions.
In other words, the selection of the best optimized version is postponed until run time. One approach recently proposed in ref. 400 is first to execute a portion of the code fragment to be optimized and then, based on the run-time statistics collected, select the most suitable parallelization strategy at run time. Clearly, the time (and energy) spent in determining the best optimized version at run time is an overhead and needs to be optimized away as much as possible.
If the hierarchy has private L1 caches (one per processor), then one
can make use of the theory developed
in the optimizing compiler
community. Compiler optimizations that
target improving data locality can be broadly divided into two major
categories: (1) optimizations that target the reduction of memory
latency, and (2) optimizations that target the hiding of memory
latency.
In the first group are the techniques that modify program access pattern, memory layout of variables, or both. Loop transformations change the data access pattern exhibited by the loop nests in the application to make them more cache-friendly. Their major drawback is that data dependences in the code may prevent the compiler from using the best transformation from the locality perspective. Among the most popular loop transformations are loop interchange (which changes the order of two loops in the nest), loop fusion (which combines two neighboring loop nests into a single nest), and tiling (which implements a block matrix style of computation automatically).
The drawbacks of loop transformations motivated researchers to
investigate alternative optimizations. One of these is data
space
(memory layout)
transformations, or data
transformations for
short. In applying a data transformation, instead of the program
access pattern, the compiler modifies the underlying memory layouts of
the data structures involved. For multidimensional array structures,
implementing a data transformation involves transforming the array
subscript expressions and modifying the corresponding array
declarations.
Work in this area focuses on pointer-intensive codes and is based on the idea of co-locating data items that are accessed in close proximity in time. It should be noted that it is possible to apply data transformations statically or dynamically. In static transformations, the best layout for a given data structure is determined statically (at compile time), and all references to the old data structure are replaced by the references to the new data structure. Many layout transformations that target temporary data structure variables fall into this category.
In the dynamic transformation case, on the other hand, the layout
transformations take place during the course of execution.
Specifically, the compiler statically determines the layouts of the
data structures of interest at compile time at each program point and
inserts code for explicit layout transformation. However, the actual
layout transformations themselves take place at run time.
For example,
consider the dynamic layout optimization approach.
This
technique extends the static layout optimization techniques to select
dynamically changing layouts to further improve the locality of data
accesses. It divides the job of optimizing locality between compile
time and run time.
More specifically, the layouts of arrays for different regions of code are determined at compile time (e.g., running a static layout optimizer for each loop nest in the code), and the bookkeeping code that is necessary to transform memory layouts dynamically between different program regions is inserted in the code (again at compile time). These bookkeeping codes, however, are executed at run time.
The techniques discussed so far try to improve data locality,
thereby reducing the number of accesses to slower levels in the memory
hierarchy. Software data prefetching, in contrast, tries to bring data
to the fast memory ahead of time (i.e., before the data are actually
needed). If this can be achieved, the data can be accessed from cache
memory when they are needed.
In order to issue the prefetches ahead of time, a compiler transformation known as software pipelining needs to be used. Mowry presents a compiler-directed software prefetching algorithm that consists of the following three steps: (1) determine data references that are likely to be cache misses and therefore need to be prefetched; (2) isolate the predicted cache miss instances through loop splitting [398]; and (3) apply software pipelining and insert explicit prefetch instructions in the code.
However, if the first level of memory in the hierarchy is an SPM (one per processor), the compiler needs to employ a different set of optimizations. This is because, as mentioned earlier, in an SPM-based system, the compiler is in full control of data transfers between SPM and the lower levels of the memory hierarchy. Therefore, it needs to insert in the code explicit data movement instructions at compile time. These instructions, when they are executed at run time, cause data transfers between SPM and other memory components.
Obviously, this is a desired compilation strategy only if one wants to manage SPMs dynamically. On the other hand, it is also possible to fix the contents of the SPMs statically, i.e., at compile time. In this case, the compiler does not insert explicit data movement calls in the code since there are no run-time data movements. As discussed earlier in this series, Panda et al. has presented such a strategy for a single processor embedded architecture. Although it is possible to extend the approach to the MPSoC case, one needs to be careful in selecting the frequently shared data items to be placed in the SPM.
It is also possible to allow multiple processors to share a single
(or a set of) on-chip SPM. A solution along this direction is called
the
virtually shared SPM (VS-SPM).
[ 408]. Here,the
execution model in a VS-SPM-based architecture is as follows. The
system takes as input a parallelized application (user-assisted or
through compiler-based dependence analysis).
In this model, each loop
nest is parallelized as much as possible. In processing a parallel
loop, all processors in the system participate in computation, and each
executes a subset of loop iterations. When the execution of a parallel
loop has completed, the processors synchronize using a special
construct called a barrier before starting the next loop.
The synchronization and communication between processors are maintained using fast on-chip communication links. Based on the parallelization strategy, each processor works on a portion of each array in the code. Since its local SPM space is typically much smaller than the portion of the array it is currently operating on, it divides its portion into chunks (also called data tiles) and operates on one chunk at a time. When a data tile has been used, it is either discarded or written back into off-chip memory (if modified).
In order to improve the reuse of data in the VS-SPM, one can
consider intraprocessor data reuse and interprocessor data reuse.
Intraprocessor data reuse corresponds to optimizing data reuse when
considering the access pattern of each processor in isolation. Previous
work, just discussed above,
addresses this problem.
It should be noted, however, that exploiting
intraprocessor reuse only may not be very effective in a VS-SPM-based
environment. This is because intraprocessor data reuse has a local
processor-centric perspective and does not take interprocessor data
sharing (communication) effects into account. Such effects are
particularly important for applications in which data regions touched
by different processors overlap.
This is very common in many
array-intensive embedded image processing applications. Interprocessor
data reuse, on the other hand, focuses on the problem of optimizing
data accesses considering access patterns of all processors in the
system. In other words, it has an application-centric view of data
accesses. The compilation approach proposed in ref. 408 maximizes
interprocessor data reuse for SPM-based data.
More specifically, the data tile accesses of individual processors are restructured (coordinated) in such a way that whenever a nonlocal array element is required (to perform some computation), it can be found in some remote SPM (instead of going to the off-chip memory). If this can be achieved for all nonlocal accesses, one can eliminate all extra off-chip memory accesses due to interprocessor communication.
From a multiple processor perspective, the locality problem is more
challenging to tackle than the single processor case. The SPM approach
is simple and has been summarized in the previous two paragraphs. If
the processors have conventional L1 caches instead of SPMs, the
situation becomes much more difficult to handle. This is because
on-chip processors can share cache lines in many different ways.
In
true-sharing, processors share the same data item, and the integrity of
the item should be maintained using an on-chip coherence protocol. In
false-sharing, they just share the cache line, not the data item
itself. Note that this case also causes a coherence activity (which
costs extra execution cycles as well as energy), and in the
false-sharing case, this activity is pure overhead that needs to be
minimized. Although we expect that the approaches
used in high-end
computing may be very useful in this context, one also
needs to
consider the energy consumption problem. Therefore, we see energy-aware
coherence protocols/algorithms as an important potential research
direction.
Optimizing Memory Space Utilization
Another important issue in an MPSoC-based environment is the management
of limited memory space. It should be observed that optimizing for data
locality does not mean that the optimized application will execute
using fewer data items. To reduce data space requirements of a given
application, one may need to consider lifetimes of program data
structures/variables. In particular, if the lifetimes of two different
data items do not overlap, they can potentially share the same memory
location, thereby reducing overall memory space demand.
For array-based applications, in a given loop nest, one can define the lifetime of an array element as the difference (in loop iterations) between the time it is first assigned (written) and the time it is last used (read). For a given array index a (which might be multidimensional), the start of its lifetime is referred to as S(a), whereas the end of its lifetime is denoted using E(a) - both in terms of loop iterations. Using these definitions, the lifetime vector for this array element can be given as s = E(a) - S(a), where "-" denotes vector subtraction.
Note that the lifetime of a is expressed as a vector, as in general
there might be multiple loops in the next, and expressing lifetime as a
vector allows the compiler to measure the impact of loop
transformations on it. As an example, if an array element (that is
accessed in a nest with two loops) is produced in iteration (1 3)T and
consumed (i.e., last-read) in iteration (4 5)T, its lifetime
vector is s = (4 5)T - (1 3)T = (3 2)T.
It should be noted that before S(a) and after E(a), the memory location allocated to this array element could be used for storing another array element (which belongs to the same array or to a different array). Obviously, the shorter the difference between E(a) and S(a) the better, as it leaves more room for other array elements.
However, it is to be noted that reducing lifetimes of array elements does not guarantee memory location reuse. This is because in order to have memory location reuse, the lifetimes should be shortened such that the different elements can share the same memory location. As a result, it is important for the compiler to make sure that this happens (when lifetimes are reduced). That is, the compiler can check whether the lifetime reduction will really be beneficial; if not, it may choose not to apply it.
Along these lines, Lefebvre and
Feautrier propose a method for
automatic storage management for loop-based parallel programs with
affine references. They show that parallelization by total data
expansion can be replaced by partial data expansion without distorting
the degree of parallelism.
Grun et al. have presented memory size estimation techniques for multimedia applications. Strout et al. introduce the universal occupancy vector that allows schedule-independent storage space optimization. Their objective is to reduce data space requirements of a given code without preventing subsequent locality optimizations from being performed. Wilde and Rajopadhye have investigated memory reuse by performing static lifetime computations of variables using a polyhedral model. Lim et al. have defined an optimization method called the array contraction.
The work done by Song et al. and Catthoor and his colleagues shows
how loop fusion can be used for minimizing data
space
requirements. Along similar lines, Fraboulet
et al. has also presented
an algorithm to reduce the use of temporary arrays by loop fusion. They
demonstrate that although the complexity of their algorithm is not
polynomial, it is highly efficient in practice. Thies et al. has
described a unified general framework for scheduling and storage
optimization.
One of the problems that they address, namely, "choosing a store for a given schedule" tries to assign storage locations to array elements with storage size reduction in mind. It should also be observed that reducing memory space requirements alone might help improve data locality (cache behavior). This is because in array-intensive applications a significant portion of data cache misses (called capacity misses) occurs because large datasets are accessed in a short period of time (e.g., within a loop). By reducing the number of array elements, one can also improve cache behavior.
We also need to point out an important tradeoff that needs to be
made for an MPSoC-based execution environment. In general, more memory
space means more parallelism, as discussed in studies such as those of
Tu and Padua and Barthou et al.
This is
because a large
percentage of data dependences in a given application are
anti-dependences and output-dependences, which occur mainly due to
reuse of the same memory location for different data items.
By
allocating a separate memory location for each data item, one can
eliminate these dependences. However, as we have discussed above,
reusing the same memory location for different data items is an
important optimization as far as memory space utilization is concerned.
Therefore, an optimizing compiler should be able to resolve this
tradeoff considering the performance and memory space constraints at
the same time.
Power/Energy Optimization
Power/energy-directed and performance-directed optimizations often
conflict as far as an MPSoC-based environment is concerned. For
example, the two graphs shown in Figure
9-6 below give the execution cycles and energy consumption
behavior when an array-based application (tomcatv from the Spec
benchmark suite) is parallelized over a different number of processors
(ranging from 1 to 8).
Each bar in these graphs represents a result (for the corresponding
loop nest), normalized with respect to the single processor case (that
is, the energy consumption or execution cycles when the nest is
executed on a single processor).
![]() |
| Figure 9-6 Energy consumption (top) and execution cycles (bottom) for different nests of tomcatv. For each nest, each bar corresponds to a specific number of processors (from one processor, the leftmost bar, to eight processors, the rightmost bar). |
From these graphs, one can make several observations. First, energy
and performance trends are different from each other. That is, in a
given nest, the best processor size from the energy perspective is (in
general) different from the best processor size from the execution time
perspective.
This is because in many cases increasing the number of processors beyond a certain point may continue to improve performance (execution cycles), but the extra energy to power up the additional processor and its caches may offset any potential energy benefits coming from the reduced execution cycles (as a result of increased parallelism).
Second, for a given nest, increasing the number of processors does not always reduce execution cycles. This occurs because there is an extra performance cost of parallelizing a loop, which includes spawning parallel threads, interprocessor synchronization during loop execution if there are data dependences, and synchronizing the threads after the loop execution. If the loop in question does not have a sufficient number of iterations to justify the use of a certain number of processors, this extra cost can dominate overall performance behavior.
Third, the best number of processors from the energy or performance perspectives depends on the loop nest in question. That is, different nests may require different processor sizes for the best results, and it is not a good idea to use the same number of processors for all the nests in a given application.
Given these observations, one can see that it is not easy to select the ideal processor sizes for each nest in a given code to optimize an objective function, which might have both performance and energy elements. As an example, consider a compilation (parallelization) scenario in which we would like to optimize energy consumption of the application keeping the execution time under a predetermined limit. Important questions in this scenario are (1) whether there are processor sizes for each nest to satisfy this compilation constraint, and (2) if there are multiple solutions, how one can choose the best one.
Most published work on parallelism
for high-end machines is
static; that is, the number of processors that execute the code is
fixed for the entire execution. For example, if the number of
processors that execute a code is fixed at 8, all parts of the code
(for example, all loop nests) are executed using the same eight
processors.
However, this may not be a very good strategy from the viewpoint of energy consumption. Instead, one can consume much less energy by using the minimum number of processors for each loop nest (and shutting down the caches of the unused processors). In adaptive parallelization, the number of processors is tailored to the specific needs of each code section (e.g., a nested loop in array-intensive applications).
In other words, the number of processors that are active at a given period of time changes dynamically as the program executes. For instance, an adaptive parallelization strategy can use 4, 6, 2, and 8 processors to execute the first four nests in a given code. There are two important issues that need to be addressed in designing an effective adaptive parallelization strategy for an on-chip multiprocessor:
Although this approach is expected to generate better results once the number of processors has been decided (as it can take run-time code behavior and dynamic resource constraints into account), it may also incur some performance overhead during the process of determining the number of processors. This overhead can, in some cases, offset the potential benefits of adaptive parallelism.
In the second option, the number of processors for each nest is decided at compile time. This approach has a compile-time overhead, but it does not lead to any run-time penalty. It should be emphasized, however, that although this approach determines the number of processors statically at compile time, the activation/deactivation of processors and their caches occurs dynamically at run time.
In addition, it is also possible to adopt complex objective functions and compilation criteria such as "minimize the sum of the cache and main memory energy consumptions while keeping the total execution cycles bounded by M." For example, the framework described in ref. 400 accepts as objective function and constraints any linear combination of energy consumptions and execution cycles of individual hardware components.
Future Research.
Perhaps the most important future research direction in compilation of
MPSoC applications will be compiling applications under multiple
constraints that involve power, energy, execution cycles, and memory
usage. Unfortunately, current compilation techniques (even those
developed in the context of high-end multiprocessor systems) do not
directly extend to the MPSoC domain since they only focus on
performance.
Equally important is studying techniques that produce quick estimation of metrics of interest (e.g., power and memory usage). Optimizing compilers can use such estimates to rank different (and sometimes conflicting) optimizations and to select the best one given the constraints that need to be satisfied.
That is, we envision future MPSoC compilers to have numerous estimation and optimization modules (each targeting at a different metric) and a solver, which selects the best optimization under multiple constraints.
Conclusion
The significant interest in MPSoC architectures in recent times has
caused researchers to study architectural optimizations from a
different perspective. With the full advance knowledge of the
applications being implemented by the system, many design parameters
can be optimized and/or customized. This is especially true of the
memory subsystem, in which a vast array of different organizations can
be employed for application-specific systems, and the designer is not
restricted to the traditional cache hierarchy. The optimal memory
architecture for an application-specific system can be significantly
different from the typical cache hierarchy of processors.
Although some of the analytical techniques can be automated, much
work remains to be performed before a completely "push-button"
methodology evolves for application-specific customization of the
memory organization in MPSoCs.
Mahmut Kandemir is an assistant professor in the Computer Science and Engineering Department at Pennsylvania State University. Nikil Dutt is a professor of computer science for Embedded Computer Systems at the University of California, Irvine