Optimizing Locality
Locality optimization can be performed for both code and data. As
mentioned earlier, since code locality is in general very good to begin
with, most recent studies on the topic concentrate on data locality.
Broadly speaking, there are two facets to the data locality problem:
(1) single processor perspective, and (2) multiprocessor perspective.
From the viewpoint of a single processor in an MPSoC, the locality
optimization problem is not very different from one in a single
processor architecture. The main goal is to maximize the number of data
accesses satisfied from the highest level of memory in the hierarchy.>
If the hierarchy has private L1 caches (one per processor), then one
can make use of the theory developed
in the optimizing compiler
community. Compiler optimizations that
target improving data locality can be broadly divided into two major
categories: (1) optimizations that target the reduction of memory
latency, and (2) optimizations that target the hiding of memory
latency.
In the first group are the techniques that modify program
access pattern, memory layout of variables, or both. Loop
transformations change the data access pattern exhibited by the loop
nests in the application to make them
more
cache-friendly. Their
major drawback is that data dependences in the code may prevent the
compiler from using the best transformation from the locality
perspective. Among the most popular loop transformations are loop
interchange (which changes the order of two loops in the nest), loop
fusion (which combines two neighboring loop nests into a single nest),
and tiling (which implements a block
matrix style of computation
automatically).
The drawbacks of loop transformations motivated researchers to
investigate alternative optimizations. One of these is data
space
(memory layout)
transformations, or data
transformations for
short. In applying a data transformation, instead of the program
access pattern, the compiler modifies the underlying memory layouts of
the data structures involved. For multidimensional array structures,
implementing a data transformation involves transforming the array
subscript expressions and modifying the corresponding array
declarations.
Work in this area focuses on pointer-intensive
codes and is based on the idea of co-locating data items
that are
accessed in close proximity in time. It should be noted that it is
possible to apply data transformations statically or dynamically. In
static transformations, the best layout for a given data structure is
determined statically (at compile time), and all references to the old
data structure are replaced by the references to the new data
structure. Many layout transformations that target temporary data
structure variables fall into this category.
In the dynamic transformation case, on the other hand, the layout
transformations take place during the course of execution.
Specifically, the compiler statically determines the layouts of the
data structures of interest at compile time at each program point and
inserts code for explicit layout transformation. However, the actual
layout transformations themselves take place at run time.
For example,
consider the dynamic layout optimization approach.
This
technique extends the static layout optimization techniques to select
dynamically changing layouts to further improve the locality of data
accesses. It divides the job of optimizing locality between compile
time and run time.
More specifically, the layouts of arrays for
different regions of code are determined at compile time (e.g., running
a static layout optimizer for each loop nest in the code), and the
bookkeeping code that is necessary to transform memory layouts
dynamically between different program regions is inserted in the code
(again at compile time). These bookkeeping codes, however, are executed
at run time.
The techniques discussed so far try to improve data locality,
thereby reducing the number of accesses to slower levels in the memory
hierarchy. Software data prefetching, in contrast, tries to bring data
to the fast memory ahead of time (i.e., before the data are actually
needed). If this can be achieved, the data can be accessed from cache
memory when they are needed.
In order to issue the prefetches ahead of
time, a compiler transformation known as software pipelining needs to
be used. Mowry
presents
a compiler-directed software prefetching
algorithm that consists of the following three steps: (1) determine
data references that are likely to be cache misses and therefore need
to be prefetched; (2) isolate the predicted cache miss instances
through loop splitting [398];
and (3) apply software pipelining and
insert explicit prefetch instructions in the code.
However, if the first level of memory in the hierarchy is an SPM
(one per processor), the compiler needs to employ a different set of
optimizations. This is because, as mentioned earlier, in an SPM-based
system, the compiler is in full control of data transfers between SPM
and the lower levels of the memory hierarchy. Therefore, it needs to
insert in the code explicit data movement instructions at compile time.
These instructions, when they are executed at run time, cause data
transfers between SPM and other memory components.
Obviously, this is a desired compilation strategy only if one wants
to manage SPMs dynamically. On the
other hand, it is also possible to fix the contents of the SPMs
statically, i.e., at compile time. In this case, the compiler does not
insert explicit data movement calls in the code since there are no
run-time data movements. As discussed earlier in this series, Panda et
al. has presented such a strategy for a single
processor
embedded
architecture. Although it is possible to extend the approach to the
MPSoC case, one needs to be careful in selecting the frequently shared
data items to be placed in the SPM.
It is also possible to allow multiple processors to share a single
(or a set of) on-chip SPM. A solution along this direction is called
the
virtually shared SPM (VS-SPM).
[ 408]. Here,the
execution model in a VS-SPM-based architecture is as follows. The
system takes as input a parallelized application (user-assisted or
through compiler-based dependence analysis).
In this model, each loop
nest is parallelized as much as possible. In processing a parallel
loop, all processors in the system participate in computation, and each
executes a subset of loop iterations. When the execution of a parallel
loop has completed, the processors synchronize using a special
construct called a barrier before starting the next loop.
The
synchronization and communication between processors are maintained
using fast on-chip communication links. Based on the parallelization
strategy, each processor works on a portion of each array in the code.
Since its local SPM space is typically much smaller than the portion of
the array it is currently operating on, it divides its portion into
chunks (also called data tiles) and operates on one chunk at a time.
When a data tile has been used, it is either discarded or written back
into off-chip memory (if modified).
In order to improve the reuse of data in the VS-SPM, one can
consider intraprocessor data reuse and interprocessor data reuse.
Intraprocessor data reuse corresponds to optimizing data reuse when
considering the access pattern of each processor in isolation. Previous
work, just discussed above,
addresses this problem.
It should be noted, however, that exploiting
intraprocessor reuse only may not be very effective in a VS-SPM-based
environment. This is because intraprocessor data reuse has a local
processor-centric perspective and does not take interprocessor data
sharing (communication) effects into account. Such effects are
particularly important for applications in which data regions touched
by different processors overlap.
This is very common in many
array-intensive embedded image processing applications. Interprocessor
data reuse, on the other hand, focuses on the problem of optimizing
data accesses considering access patterns of all processors in the
system. In other words, it has an application-centric view of data
accesses. The compilation approach proposed in ref. 408 maximizes
interprocessor data reuse for SPM-based data.
More specifically, the
data tile accesses of individual processors are restructured
(coordinated) in such a way that whenever a nonlocal array element is
required (to perform some computation), it can be found in some remote
SPM (instead of going to the off-chip memory). If this can be achieved
for all nonlocal accesses, one can eliminate all extra off-chip memory
accesses due to interprocessor communication.
From a multiple processor perspective, the locality problem is more
challenging to tackle than the single processor case. The SPM approach
is simple and has been summarized in the previous two paragraphs. If
the processors have conventional L1 caches instead of SPMs, the
situation becomes much more difficult to handle. This is because
on-chip processors can share cache lines in many different ways.
In
true-sharing, processors share the same data item, and the integrity of
the item should be maintained using an on-chip coherence protocol. In
false-sharing, they just share the cache line, not the data item
itself. Note that this case also causes a coherence activity (which
costs extra execution cycles as well as energy), and in the
false-sharing case, this activity is pure overhead that needs to be
minimized. Although we expect that the approaches
used in high-end
computing may be very useful in this context, one also
needs to
consider the energy consumption problem. Therefore, we see energy-aware
coherence protocols/algorithms as an important potential research
direction.