CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Providing memory system and compiler support for MPSoc designs: Part 3
Compiler Support



Embedded.com

An optimizing compiler that targets MPSoC environments should tackle a number of critical issues. Building on what was learned in Part 1 and Part 2, we first explain these issues and then study potential solutions. From the performance viewpoint, perhaps the two most important memory-related tasks to be performed in an MPSoC environment are optimizing parallelism and locality. Other important issues relate to power/energy consumption and memory space.

The problem with parallelism
Optimizing parallelism is obviously important, since parallelism is the main reason to employ multiple processors in a single unit. In fact, a parallelization strategy determines how memory is utilized by multiple on-chip processors and can be an important factor for achieving an acceptable performance. However, maximum parallelism may not always be easy to achieve because of several factors. For example, intrinsic data dependences in the code may not allow full utilization of all on-chip processors. Similarly, in some cases, interprocessor communication costs can be overwhelming as one increases the number of processors used.

Finally, performance benefits due to increased interprocessor parallelism may not be sufficient when one considers the increase in power consumption. Because of all these, it may be preferable to avoid increasing the number of processors arbitrarily. In addition, the possibility of different parts of the same application demanding different number of processors can make the problem much harder.

Instruction and Data Locality
An equally important problem is ensuring locality of data/instruction accesses. Although achieving acceptable instruction cache performance is not very difficult (since instructions are read-only and exhibit perfect spatial locality), the same cannot be said for data locality.

This is because straightforward coding of many applications can lead to poor data cache utilization. In addition, in an MPSoC environment, interprocessor communication can lead to frequent cache line invalidations/updates (due to interprocessor data sharing), which in turn increases overall latency.

This last issue becomes particularly problematic when false sharing occurs (i.e., the multiple processors share a cache line but not the same data in it). Therefore, an important task for the compiler is to minimize false sharing as much as possible.

Power/Energy Consumption
Considering the environments for which MPSoC based systems can be employed, one can see that performance is only one part of the big picture. Minimizing power/energy consumption can become vital for system availability. In fact, power/energy consideration is one of the most important factors that distinguish an MPSoC-based system from its high-end counterparts. . In other words, one cannot directly transfer pure performance-oriented compilation policies from a high-end parallel computing domain. This is because these techniques in general favor increasing the number of processors (to be used for execution of the application) as long as there are additional performance benefits in doing so.

However, such a strategy may not be preferable in our context since increasing the number of processors means powering up more processors along with their local caches (and/or SPMs). Obviously, this can lead to higher power consumption and a larger overall energy-delay product value. Therefore, an optimizing compiler should be able to balance the increase in power consumption and decrease in execution cycles. In addition, it is well known from high-end parallel computing that parallelism and data locality can sometimes impose conflicting demands and require different program/data optimizations.

Adding power/energy minimization as an additional metric makes the overall optimization problem extremely hard and justifies, in our opinion, the use of heuristic solutions.

Memory Space
Yet another memory-related issue is the memory space required by the application. Since most prior existing locality-oriented studies have focused on improving memory access patterns, there is no guarantee that such techniques will reduce space demands. However, reducing space consumption might be of critical importance as doing so increases the effectiveness of on-chip memory utilization and can reduce the number of off-chip references.

Looking for Solutions
Let's start the search with a discussion of several techniques that can be used for enhancing memory behavior of MPSoC architectures. We focus on both performance and energy aspects.

Optimizing Parallelism
Parallelism can either be expressed by the programmer at the source level or be automatically derived by an optimizing compiler from the sequential code [397]. In the former case (which is called the user-assisted method), the compiler's job is relatively easy, and consists of distributing the computations across the processors (as indicated by the user) and minimizing interprocessor communication.

This may involve program transformations that reduce the number of communications (messages) as well as the volume of data communicated at each message. In the latter case (i.e., when the compiler is to generate parallel code from a sequential one), the compiler needs to analyze data dependences and extract available parallelism. After that, by assessing the volume and intensity of interprocessor communication, the compiler can come up with a suitable parallelization strategy (e.g., in a loop nest-based code, it determines which loops should be run in parallel). After this step, the compilation process proceeds as in the user-directed (user-assisted) parallelism case.

Although in the high-end computing arena it is generally accepted that user-assisted parallelism is a more viable approach than compiler-initiated parallelism [397], the choice is not clear in the context of MPSoC due to additional constraints of low power and limited memory space. That is, it can be extremely difficult for the programmer to strike a balance among performance, power, and memory space consumption for a given application. What makes it even more difficult is that not all MPSoC applications are array/loop nest-based, making the task of identifying parallelism very difficult for the user.

In order for the compiler to parallelize a code given performance (execution cycles), energy, and memory space consumption constraints, it needs to carry out two major tasks: (1) estimating performance, power consumption, and space requirements of a given piece of code; and (2) optimizing the objective function under multiple constraints (which might be of performance, power, memory space, or a combination of these). Based on this observation, there are at least two different ways of dealing with this MPSoC parallelization problem:

  • Static approaches: These techniques are generally applicable when the code is statically analyzable. Many applications from embedded image and video processing areas fall into this category.

    In this case, the compiler can analyze the application code and estimate its execution cycles, power/energy consumption, and memory space demands. Then it can conduct a similar estimation process for each potential optimized version, and it selects the best candidate for the output code. Obviously, an important question here is how to generate different optimized versions.

    Among the techniques proposed so far in literature are heuristic schemes and integer linear programming (ILP)-based techniques that make use of profiling [399]. It is to be noted that the static analyzability of the code also makes it possible for the compiler to come up with a small set of candidate optimized versions.

  • Dynamic approaches: The schemes in this category also work with a candidate set of optimized versions; but it is less clear (compared with the previous case) at compile time how these candidates would rank with respect to each other.

    In other words, the selection of the best optimized version is postponed until run time. One approach recently proposed in ref. 400 is first to execute a portion of the code fragment to be optimized and then, based on the run-time statistics collected, select the most suitable parallelization strategy at run time. Clearly, the time (and energy) spent in determining the best optimized version at run time is an overhead and needs to be optimized away as much as possible.

<>It should also be emphasized that whether the memory hierarchy in question is constructed using caches or SPMs can make a difference in the success of the parallelization strategy chosen. More specifically, it is easier to estimate the performance and energy consumption with SPMs (as opposed to caches) since the compiler is in full control of memory transfers in the former (whereas the second one is managed by the hardware).

1 | 2 | 3 | 4

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :