Providing memory system and compiler support for MPSoc designs: Part 3
Compiler Support
By Mahmut Kandemir and Nikil Dutt
Embedded.com
(01/07/09, 01:41:00 AM EST)

An optimizing compiler that targets MPSoC environments should tackle a number of critical issues. Building on what was learned in Part 1 and Part 2, we first explain these issues and then study potential solutions. From the performance viewpoint, perhaps the two most important memory-related tasks to be performed in an MPSoC environment are optimizing parallelism and locality. Other important issues relate to power/energy consumption and memory space.

The problem with parallelism
Optimizing parallelism is obviously important, since parallelism is the main reason to employ multiple processors in a single unit. In fact, a parallelization strategy determines how memory is utilized by multiple on-chip processors and can be an important factor for achieving an acceptable performance. However, maximum parallelism may not always be easy to achieve because of several factors. For example, intrinsic data dependences in the code may not allow full utilization of all on-chip processors. Similarly, in some cases, interprocessor communication costs can be overwhelming as one increases the number of processors used.

Finally, performance benefits due to increased interprocessor parallelism may not be sufficient when one considers the increase in power consumption. Because of all these, it may be preferable to avoid increasing the number of processors arbitrarily. In addition, the possibility of different parts of the same application demanding different number of processors can make the problem much harder.

Instruction and Data Locality
An equally important problem is ensuring locality of data/instruction accesses. Although achieving acceptable instruction cache performance is not very difficult (since instructions are read-only and exhibit perfect spatial locality), the same cannot be said for data locality.

This is because straightforward coding of many applications can lead to poor data cache utilization. In addition, in an MPSoC environment, interprocessor communication can lead to frequent cache line invalidations/updates (due to interprocessor data sharing), which in turn increases overall latency.

This last issue becomes particularly problematic when false sharing occurs (i.e., the multiple processors share a cache line but not the same data in it). Therefore, an important task for the compiler is to minimize false sharing as much as possible.

Power/Energy Consumption
Considering the environments for which MPSoC based systems can be employed, one can see that performance is only one part of the big picture. Minimizing power/energy consumption can become vital for system availability. In fact, power/energy consideration is one of the most important factors that distinguish an MPSoC-based system from its high-end counterparts. . In other words, one cannot directly transfer pure performance-oriented compilation policies from a high-end parallel computing domain. This is because these techniques in general favor increasing the number of processors (to be used for execution of the application) as long as there are additional performance benefits in doing so.

However, such a strategy may not be preferable in our context since increasing the number of processors means powering up more processors along with their local caches (and/or SPMs). Obviously, this can lead to higher power consumption and a larger overall energy-delay product value. Therefore, an optimizing compiler should be able to balance the increase in power consumption and decrease in execution cycles. In addition, it is well known from high-end parallel computing that parallelism and data locality can sometimes impose conflicting demands and require different program/data optimizations.

Adding power/energy minimization as an additional metric makes the overall optimization problem extremely hard and justifies, in our opinion, the use of heuristic solutions.

Memory Space
Yet another memory-related issue is the memory space required by the application. Since most prior existing locality-oriented studies have focused on improving memory access patterns, there is no guarantee that such techniques will reduce space demands. However, reducing space consumption might be of critical importance as doing so increases the effectiveness of on-chip memory utilization and can reduce the number of off-chip references.

Looking for Solutions
Let's start the search with a discussion of several techniques that can be used for enhancing memory behavior of MPSoC architectures. We focus on both performance and energy aspects.

Optimizing Parallelism
Parallelism can either be expressed by the programmer at the source level or be automatically derived by an optimizing compiler from the sequential code [397]. In the former case (which is called the user-assisted method), the compiler's job is relatively easy, and consists of distributing the computations across the processors (as indicated by the user) and minimizing interprocessor communication.

This may involve program transformations that reduce the number of communications (messages) as well as the volume of data communicated at each message. In the latter case (i.e., when the compiler is to generate parallel code from a sequential one), the compiler needs to analyze data dependences and extract available parallelism. After that, by assessing the volume and intensity of interprocessor communication, the compiler can come up with a suitable parallelization strategy (e.g., in a loop nest-based code, it determines which loops should be run in parallel). After this step, the compilation process proceeds as in the user-directed (user-assisted) parallelism case.

Although in the high-end computing arena it is generally accepted that user-assisted parallelism is a more viable approach than compiler-initiated parallelism [397], the choice is not clear in the context of MPSoC due to additional constraints of low power and limited memory space. That is, it can be extremely difficult for the programmer to strike a balance among performance, power, and memory space consumption for a given application. What makes it even more difficult is that not all MPSoC applications are array/loop nest-based, making the task of identifying parallelism very difficult for the user.

In order for the compiler to parallelize a code given performance (execution cycles), energy, and memory space consumption constraints, it needs to carry out two major tasks: (1) estimating performance, power consumption, and space requirements of a given piece of code; and (2) optimizing the objective function under multiple constraints (which might be of performance, power, memory space, or a combination of these). Based on this observation, there are at least two different ways of dealing with this MPSoC parallelization problem:

<>It should also be emphasized that whether the memory hierarchy in question is constructed using caches or SPMs can make a difference in the success of the parallelization strategy chosen. More specifically, it is easier to estimate the performance and energy consumption with SPMs (as opposed to caches) since the compiler is in full control of memory transfers in the former (whereas the second one is managed by the hardware).

Optimizing Locality
Locality optimization can be performed for both code and data. As mentioned earlier, since code locality is in general very good to begin with, most recent studies on the topic concentrate on data locality. Broadly speaking, there are two facets to the data locality problem: (1) single processor perspective, and (2) multiprocessor perspective. From the viewpoint of a single processor in an MPSoC, the locality optimization problem is not very different from one in a single processor architecture. The main goal is to maximize the number of data accesses satisfied from the highest level of memory in the hierarchy.

If the hierarchy has private L1 caches (one per processor), then one can make use of the theory developed in the optimizing compiler community. Compiler optimizations that target improving data locality can be broadly divided into two major categories: (1) optimizations that target the reduction of memory latency, and (2) optimizations that target the hiding of memory latency.

In the first group are the techniques that modify program access pattern, memory layout of variables, or both. Loop transformations change the data access pattern exhibited by the loop nests in the application to make them more cache-friendly. Their major drawback is that data dependences in the code may prevent the compiler from using the best transformation from the locality perspective. Among the most popular loop transformations are loop interchange (which changes the order of two loops in the nest), loop fusion (which combines two neighboring loop nests into a single nest), and tiling (which implements a block matrix style of computation automatically).

The drawbacks of loop transformations motivated researchers to investigate alternative optimizations. One of these is data space (memory layout) transformations, or data transformations for short. In applying a data transformation, instead of the program access pattern, the compiler modifies the underlying memory layouts of the data structures involved. For multidimensional array structures, implementing a data transformation involves transforming the array subscript expressions and modifying the corresponding array declarations.

Work in this area focuses on pointer-intensive codes and is based on the idea of co-locating data items that are accessed in close proximity in time. It should be noted that it is possible to apply data transformations statically or dynamically. In static transformations, the best layout for a given data structure is determined statically (at compile time), and all references to the old data structure are replaced by the references to the new data structure. Many layout transformations that target temporary data structure variables fall into this category.

In the dynamic transformation case, on the other hand, the layout transformations take place during the course of execution. Specifically, the compiler statically determines the layouts of the data structures of interest at compile time at each program point and inserts code for explicit layout transformation. However, the actual layout transformations themselves take place at run time.

For example, consider the dynamic layout optimization approach. This technique extends the static layout optimization techniques to select dynamically changing layouts to further improve the locality of data accesses. It divides the job of optimizing locality between compile time and run time.

More specifically, the layouts of arrays for different regions of code are determined at compile time (e.g., running a static layout optimizer for each loop nest in the code), and the bookkeeping code that is necessary to transform memory layouts dynamically between different program regions is inserted in the code (again at compile time). These bookkeeping codes, however, are executed at run time.

The techniques discussed so far try to improve data locality, thereby reducing the number of accesses to slower levels in the memory hierarchy. Software data prefetching, in contrast, tries to bring data to the fast memory ahead of time (i.e., before the data are actually needed). If this can be achieved, the data can be accessed from cache memory when they are needed.

In order to issue the prefetches ahead of time, a compiler transformation known as software pipelining needs to be used. Mowry presents a compiler-directed software prefetching algorithm that consists of the following three steps: (1) determine data references that are likely to be cache misses and therefore need to be prefetched; (2) isolate the predicted cache miss instances through loop splitting [398]; and (3) apply software pipelining and insert explicit prefetch instructions in the code.

However, if the first level of memory in the hierarchy is an SPM (one per processor), the compiler needs to employ a different set of optimizations. This is because, as mentioned earlier, in an SPM-based system, the compiler is in full control of data transfers between SPM and the lower levels of the memory hierarchy. Therefore, it needs to insert in the code explicit data movement instructions at compile time. These instructions, when they are executed at run time, cause data transfers between SPM and other memory components.

Obviously, this is a desired compilation strategy only if one wants to manage SPMs dynamically. On the other hand, it is also possible to fix the contents of the SPMs statically, i.e., at compile time. In this case, the compiler does not insert explicit data movement calls in the code since there are no run-time data movements. As discussed earlier in this series, Panda et al.  has presented such a strategy for a single processor embedded architecture. Although it is possible to extend the approach to the MPSoC case, one needs to be careful in selecting the frequently shared data items to be placed in the SPM.

It is also possible to allow multiple processors to share a single (or a set of) on-chip SPM. A solution along this direction is called the virtually shared SPM (VS-SPM). [ 408]. Here,the execution model in a VS-SPM-based architecture is as follows. The system takes as input a parallelized application (user-assisted or through compiler-based dependence analysis).

In this model, each loop nest is parallelized as much as possible. In processing a parallel loop, all processors in the system participate in computation, and each executes a subset of loop iterations. When the execution of a parallel loop has completed, the processors synchronize using a special construct called a barrier before starting the next loop.

The synchronization and communication between processors are maintained using fast on-chip communication links. Based on the parallelization strategy, each processor works on a portion of each array in the code. Since its local SPM space is typically much smaller than the portion of the array it is currently operating on, it divides its portion into chunks (also called data tiles) and operates on one chunk at a time. When a data tile has been used, it is either discarded or written back into off-chip memory (if modified).

In order to improve the reuse of data in the VS-SPM, one can consider intraprocessor data reuse and interprocessor data reuse. Intraprocessor data reuse corresponds to optimizing data reuse when considering the access pattern of each processor in isolation. Previous work, just discussed above, addresses this problem.

It should be noted, however, that exploiting intraprocessor reuse only may not be very effective in a VS-SPM-based environment. This is because intraprocessor data reuse has a local processor-centric perspective and does not take interprocessor data sharing (communication) effects into account. Such effects are particularly important for applications in which data regions touched by different processors overlap.

This is very common in many array-intensive embedded image processing applications. Interprocessor data reuse, on the other hand, focuses on the problem of optimizing data accesses considering access patterns of all processors in the system. In other words, it has an application-centric view of data accesses. The compilation approach proposed in ref. 408 maximizes interprocessor data reuse for SPM-based data.

More specifically, the data tile accesses of individual processors are restructured (coordinated) in such a way that whenever a nonlocal array element is required (to perform some computation), it can be found in some remote SPM (instead of going to the off-chip memory). If this can be achieved for all nonlocal accesses, one can eliminate all extra off-chip memory accesses due to interprocessor communication.

From a multiple processor perspective, the locality problem is more challenging to tackle than the single processor case. The SPM approach is simple and has been summarized in the previous two paragraphs. If the processors have conventional L1 caches instead of SPMs, the situation becomes much more difficult to handle. This is because on-chip processors can share cache lines in many different ways.

In true-sharing, processors share the same data item, and the integrity of the item should be maintained using an on-chip coherence protocol. In false-sharing, they just share the cache line, not the data item itself. Note that this case also causes a coherence activity (which costs extra execution cycles as well as energy), and in the false-sharing case, this activity is pure overhead that needs to be minimized. Although we expect that the approaches used in high-end computing may be very useful in this context, one also needs to consider the energy consumption problem. Therefore, we see energy-aware coherence protocols/algorithms as an important potential research direction.

Optimizing Memory Space Utilization
Another important issue in an MPSoC-based environment is the management of limited memory space. It should be observed that optimizing for data locality does not mean that the optimized application will execute using fewer data items. To reduce data space requirements of a given application, one may need to consider lifetimes of program data structures/variables. In particular, if the lifetimes of two different data items do not overlap, they can potentially share the same memory location, thereby reducing overall memory space demand.

For array-based applications, in a given loop nest, one can define the lifetime of an array element as the difference (in loop iterations) between the time it is first assigned (written) and the time it is last used (read). For a given array index a (which might be multidimensional), the start of its lifetime is referred to as S(a), whereas the end of its lifetime is denoted using E(a) - both in terms of loop iterations. Using these definitions, the lifetime vector for this array element can be given as s = E(a) - S(a), where "-" denotes vector subtraction.

Note that the lifetime of a is expressed as a vector, as in general there might be multiple loops in the next, and expressing lifetime as a vector allows the compiler to measure the impact of loop transformations on it. As an example, if an array element (that is accessed in a nest with two loops) is produced in iteration (1 3)T and consumed (i.e., last-read) in iteration (4 5)T, its lifetime vector is s = (4 5)T - (1 3)T = (3 2)T.

It should be noted that before S(a) and after E(a), the memory location allocated to this array element could be used for storing another array element (which belongs to the same array or to a different array). Obviously, the shorter the difference between E(a) and S(a) the better, as it leaves more room for other array elements.

However, it is to be noted that reducing lifetimes of array elements does not guarantee memory location reuse. This is because in order to have memory location reuse, the lifetimes should be shortened such that the different elements can share the same memory location. As a result, it is important for the compiler to make sure that this happens (when lifetimes are reduced). That is, the compiler can check whether the lifetime reduction will really be beneficial; if not, it may choose not to apply it.

Along these lines, Lefebvre and Feautrier propose a method for automatic storage management for loop-based parallel programs with affine references. They show that parallelization by total data expansion can be replaced by partial data expansion without distorting the degree of parallelism.

 Grun et al.  have presented memory size estimation techniques for multimedia applications. Strout et al.  introduce the universal occupancy vector that allows schedule-independent storage space optimization. Their objective is to reduce data space requirements of a given code without preventing subsequent locality optimizations from being performed. Wilde and Rajopadhye have investigated memory reuse by performing static lifetime computations of variables using a polyhedral model. Lim et al. have defined an optimization method called the array contraction.

The work done by Song et al.  and Catthoor and his colleagues shows how loop fusion can be used for minimizing data space requirements. Along similar lines, Fraboulet et al. has also presented an algorithm to reduce the use of temporary arrays by loop fusion. They demonstrate that although the complexity of their algorithm is not polynomial, it is highly efficient in practice. Thies et al. has described a unified general framework for scheduling and storage optimization.

One of the problems that they address, namely, "choosing a store for a given schedule" tries to assign storage locations to array elements with storage size reduction in mind. It should also be observed that reducing memory space requirements alone might help improve data locality (cache behavior). This is because in array-intensive applications a significant portion of data cache misses (called capacity misses) occurs because large datasets are accessed in a short period of time (e.g., within a loop). By reducing the number of array elements, one can also improve cache behavior.

We also need to point out an important tradeoff that needs to be made for an MPSoC-based execution environment. In general, more memory space means more parallelism, as discussed in studies such as those of Tu and Padua and Barthou et al. This is because a large percentage of data dependences in a given application are anti-dependences and output-dependences, which occur mainly due to reuse of the same memory location for different data items.

By allocating a separate memory location for each data item, one can eliminate these dependences. However, as we have discussed above, reusing the same memory location for different data items is an important optimization as far as memory space utilization is concerned. Therefore, an optimizing compiler should be able to resolve this tradeoff considering the performance and memory space constraints at the same time.

Power/Energy Optimization
Power/energy-directed and performance-directed optimizations often conflict as far as an MPSoC-based environment is concerned. For example, the two graphs shown in Figure 9-6 below give the execution cycles and energy consumption behavior when an array-based application (tomcatv from the Spec benchmark suite) is parallelized over a different number of processors (ranging from 1 to 8).

Each bar in these graphs represents a result (for the corresponding loop nest), normalized with respect to the single processor case (that is, the energy consumption or execution cycles when the nest is executed on a single processor).

Figure 9-6 Energy consumption (top) and execution cycles (bottom) for different nests of tomcatv. For each nest, each bar corresponds to a specific number of processors (from one processor, the leftmost bar, to eight processors, the rightmost bar).

From these graphs, one can make several observations. First, energy and performance trends are different from each other. That is, in a given nest, the best processor size from the energy perspective is (in general) different from the best processor size from the execution time perspective.

This is because in many cases increasing the number of processors beyond a certain point may continue to improve performance (execution cycles), but the extra energy to power up the additional processor and its caches may offset any potential energy benefits coming from the reduced execution cycles (as a result of increased parallelism).

Second, for a given nest, increasing the number of processors does not always reduce execution cycles. This occurs because there is an extra performance cost of parallelizing a loop, which includes spawning parallel threads, interprocessor synchronization during loop execution if there are data dependences, and synchronizing the threads after the loop execution. If the loop in question does not have a sufficient number of iterations to justify the use of a certain number of processors, this extra cost can dominate overall performance behavior.

Third, the best number of processors from the energy or performance perspectives depends on the loop nest in question. That is, different nests may require different processor sizes for the best results, and it is not a good idea to use the same number of processors for all the nests in a given application.

Given these observations, one can see that it is not easy to select the ideal processor sizes for each nest in a given code to optimize an objective function, which might have both performance and energy elements. As an example, consider a compilation (parallelization) scenario in which we would like to optimize energy consumption of the application keeping the execution time under a predetermined limit. Important questions in this scenario are (1) whether there are processor sizes for each nest to satisfy this compilation constraint, and (2) if there are multiple solutions, how one can choose the best one.

Most published work on parallelism for high-end machines  is static; that is, the number of processors that execute the code is fixed for the entire execution. For example, if the number of processors that execute a code is fixed at 8, all parts of the code (for example, all loop nests) are executed using the same eight processors.

However, this may not be a very good strategy from the viewpoint of energy consumption. Instead, one can consume much less energy by using the minimum number of processors for each loop nest (and shutting down the caches of the unused processors). In adaptive parallelization, the number of processors is tailored to the specific needs of each code section (e.g., a nested loop in array-intensive applications).

In other words, the number of processors that are active at a given period of time changes dynamically as the program executes. For instance, an adaptive parallelization strategy can use 4, 6, 2, and 8 processors to execute the first four nests in a given code. There are two important issues that need to be addressed in designing an effective adaptive parallelization strategy for an on-chip multiprocessor:

Future Research.
Perhaps the most important future research direction in compilation of MPSoC applications will be compiling applications under multiple constraints that involve power, energy, execution cycles, and memory usage. Unfortunately, current compilation techniques (even those developed in the context of high-end multiprocessor systems) do not directly extend to the MPSoC domain since they only focus on performance.

Equally important is studying techniques that produce quick estimation of metrics of interest (e.g., power and memory usage). Optimizing compilers can use such estimates to rank different (and sometimes conflicting) optimizations and to select the best one given the constraints that need to be satisfied.

That is, we envision future MPSoC compilers to have numerous estimation and optimization modules (each targeting at a different metric) and a solver, which selects the best optimization under multiple constraints.

 Conclusion
The significant interest in MPSoC architectures in recent times has caused researchers to study architectural optimizations from a different perspective. With the full advance knowledge of the applications being implemented by the system, many design parameters can be optimized and/or customized. This is especially true of the memory subsystem, in which a vast array of different organizations can be employed for application-specific systems, and the designer is not restricted to the traditional cache hierarchy. The optimal memory architecture for an application-specific system can be significantly different from the typical cache hierarchy of processors.

Although some of the analytical techniques can be automated, much work remains to be performed before a completely "push-button" methodology evolves for application-specific customization of the memory organization in MPSoCs.

To read Part 1, go to Types of Memory Architectures.
To read Part 2, go to  Customization of memory architectures

This series of articles is based on copyrighted material submitted by Nikil Dutt and Mahmut Kandemir to "Multiprocessor Systems-On-Chips"  edited byWayne Wolf and Ahmed Amine Jerraya. It is used with the permission of the publisher, Morgan Kaufmann, an imprint of Elsevier. The book can be purchased on-line.

Mahmut Kandemir is an assistant professor in the Computer Science and Engineering Department at Pennsylvania State University. Nikil Dutt is a professor of computer science for Embedded Computer Systems at the University of California, Irvine