CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Providing memory system and compiler support for MPSoc designs: Part 3
Compiler Support



Embedded.com
Optimizing Locality
Locality optimization can be performed for both code and data. As mentioned earlier, since code locality is in general very good to begin with, most recent studies on the topic concentrate on data locality. Broadly speaking, there are two facets to the data locality problem: (1) single processor perspective, and (2) multiprocessor perspective. From the viewpoint of a single processor in an MPSoC, the locality optimization problem is not very different from one in a single processor architecture. The main goal is to maximize the number of data accesses satisfied from the highest level of memory in the hierarchy.

If the hierarchy has private L1 caches (one per processor), then one can make use of the theory developed in the optimizing compiler community. Compiler optimizations that target improving data locality can be broadly divided into two major categories: (1) optimizations that target the reduction of memory latency, and (2) optimizations that target the hiding of memory latency.

In the first group are the techniques that modify program access pattern, memory layout of variables, or both. Loop transformations change the data access pattern exhibited by the loop nests in the application to make them more cache-friendly. Their major drawback is that data dependences in the code may prevent the compiler from using the best transformation from the locality perspective. Among the most popular loop transformations are loop interchange (which changes the order of two loops in the nest), loop fusion (which combines two neighboring loop nests into a single nest), and tiling (which implements a block matrix style of computation automatically).

The drawbacks of loop transformations motivated researchers to investigate alternative optimizations. One of these is data space (memory layout) transformations, or data transformations for short. In applying a data transformation, instead of the program access pattern, the compiler modifies the underlying memory layouts of the data structures involved. For multidimensional array structures, implementing a data transformation involves transforming the array subscript expressions and modifying the corresponding array declarations.

Work in this area focuses on pointer-intensive codes and is based on the idea of co-locating data items that are accessed in close proximity in time. It should be noted that it is possible to apply data transformations statically or dynamically. In static transformations, the best layout for a given data structure is determined statically (at compile time), and all references to the old data structure are replaced by the references to the new data structure. Many layout transformations that target temporary data structure variables fall into this category.

In the dynamic transformation case, on the other hand, the layout transformations take place during the course of execution. Specifically, the compiler statically determines the layouts of the data structures of interest at compile time at each program point and inserts code for explicit layout transformation. However, the actual layout transformations themselves take place at run time.

For example, consider the dynamic layout optimization approach. This technique extends the static layout optimization techniques to select dynamically changing layouts to further improve the locality of data accesses. It divides the job of optimizing locality between compile time and run time.

More specifically, the layouts of arrays for different regions of code are determined at compile time (e.g., running a static layout optimizer for each loop nest in the code), and the bookkeeping code that is necessary to transform memory layouts dynamically between different program regions is inserted in the code (again at compile time). These bookkeeping codes, however, are executed at run time.

The techniques discussed so far try to improve data locality, thereby reducing the number of accesses to slower levels in the memory hierarchy. Software data prefetching, in contrast, tries to bring data to the fast memory ahead of time (i.e., before the data are actually needed). If this can be achieved, the data can be accessed from cache memory when they are needed.

In order to issue the prefetches ahead of time, a compiler transformation known as software pipelining needs to be used. Mowry presents a compiler-directed software prefetching algorithm that consists of the following three steps: (1) determine data references that are likely to be cache misses and therefore need to be prefetched; (2) isolate the predicted cache miss instances through loop splitting [398]; and (3) apply software pipelining and insert explicit prefetch instructions in the code.

However, if the first level of memory in the hierarchy is an SPM (one per processor), the compiler needs to employ a different set of optimizations. This is because, as mentioned earlier, in an SPM-based system, the compiler is in full control of data transfers between SPM and the lower levels of the memory hierarchy. Therefore, it needs to insert in the code explicit data movement instructions at compile time. These instructions, when they are executed at run time, cause data transfers between SPM and other memory components.

Obviously, this is a desired compilation strategy only if one wants to manage SPMs dynamically. On the other hand, it is also possible to fix the contents of the SPMs statically, i.e., at compile time. In this case, the compiler does not insert explicit data movement calls in the code since there are no run-time data movements. As discussed earlier in this series, Panda et al.  has presented such a strategy for a single processor embedded architecture. Although it is possible to extend the approach to the MPSoC case, one needs to be careful in selecting the frequently shared data items to be placed in the SPM.

It is also possible to allow multiple processors to share a single (or a set of) on-chip SPM. A solution along this direction is called the virtually shared SPM (VS-SPM). [ 408]. Here,the execution model in a VS-SPM-based architecture is as follows. The system takes as input a parallelized application (user-assisted or through compiler-based dependence analysis).

In this model, each loop nest is parallelized as much as possible. In processing a parallel loop, all processors in the system participate in computation, and each executes a subset of loop iterations. When the execution of a parallel loop has completed, the processors synchronize using a special construct called a barrier before starting the next loop.

The synchronization and communication between processors are maintained using fast on-chip communication links. Based on the parallelization strategy, each processor works on a portion of each array in the code. Since its local SPM space is typically much smaller than the portion of the array it is currently operating on, it divides its portion into chunks (also called data tiles) and operates on one chunk at a time. When a data tile has been used, it is either discarded or written back into off-chip memory (if modified).

In order to improve the reuse of data in the VS-SPM, one can consider intraprocessor data reuse and interprocessor data reuse. Intraprocessor data reuse corresponds to optimizing data reuse when considering the access pattern of each processor in isolation. Previous work, just discussed above, addresses this problem.

It should be noted, however, that exploiting intraprocessor reuse only may not be very effective in a VS-SPM-based environment. This is because intraprocessor data reuse has a local processor-centric perspective and does not take interprocessor data sharing (communication) effects into account. Such effects are particularly important for applications in which data regions touched by different processors overlap.

This is very common in many array-intensive embedded image processing applications. Interprocessor data reuse, on the other hand, focuses on the problem of optimizing data accesses considering access patterns of all processors in the system. In other words, it has an application-centric view of data accesses. The compilation approach proposed in ref. 408 maximizes interprocessor data reuse for SPM-based data.

More specifically, the data tile accesses of individual processors are restructured (coordinated) in such a way that whenever a nonlocal array element is required (to perform some computation), it can be found in some remote SPM (instead of going to the off-chip memory). If this can be achieved for all nonlocal accesses, one can eliminate all extra off-chip memory accesses due to interprocessor communication.

From a multiple processor perspective, the locality problem is more challenging to tackle than the single processor case. The SPM approach is simple and has been summarized in the previous two paragraphs. If the processors have conventional L1 caches instead of SPMs, the situation becomes much more difficult to handle. This is because on-chip processors can share cache lines in many different ways.

In true-sharing, processors share the same data item, and the integrity of the item should be maintained using an on-chip coherence protocol. In false-sharing, they just share the cache line, not the data item itself. Note that this case also causes a coherence activity (which costs extra execution cycles as well as energy), and in the false-sharing case, this activity is pure overhead that needs to be minimized. Although we expect that the approaches used in high-end computing may be very useful in this context, one also needs to consider the energy consumption problem. Therefore, we see energy-aware coherence protocols/algorithms as an important potential research direction.

1 | 2 | 3 | 4

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :