Providing memory system and compiler support for MPSoc designs: Part 1
Memory Architectures
By Mahmut Kandemir and Nikil Dutt
Embedded.com
(01/05/09, 06:50:00 PM EST)
System-on-chip (SoC) architectures are being increasingly employed to solve a diverse spectrum of problems in the embedded and mobile systems domain. The resulting increase in the complexity of applications ported into SoC architectures places a tremendous burden on the computational resources required to deliver the required functionality.

An emerging architectural solution places multiple processor cores on a single chip to manage the computational requirements. Such a multiprocessor system-on-chip (MPSoC) architecture has several advantages over a conventional strategy that employs a single, more powerful (but complex) processor on the chip.

First, the design of an on-chip multiprocessor composed of multiple simple processor cores is simpler than that of a complex single-processor system. This simplicity also helps reduce the time spent in verification and validation..

Second, an on-chip multiprocessor is expected to result in better utilization of the silicon space. The extra logic that would be spent on register renaming, instruction wake-up, speculation/predication, and register bypass on a complex single processor can be spent on providing higher bandwidth on an on-chip multiprocessor.

Third, an MPSoC architecture can exploit loop-level parallelism at the software level in array-intensive embedded applications. In contrast, a complex single-processor architecture needs to convert loop-level parallelism to instruction-level parallelism at run time (that is, dynamically) using sophisticated (and power-hungry) strategies. During this process, some loss in parallelism is inevitable.

Finally, a multiprocessor configuration provides an opportunity for energy savings through careful and selective management of individual processors. Overall, an on-chip multiprocessor is a suitable platform for executing array-intensive computations commonly found in embedded image and video processing applications.

One of the most critical components that determine the success of an MPSoC based architecture is its memory system. This is because many applications spend a significant portion of their cycles in the memory hierarchy. In fact, one can expect this to be even more so in the future, considering the ever-increasing dataset sizes, coupled with the widening processor-memory gap.

In addition, from an energy consumption angle, the memory system can contribute up to 90% of the overall system power. In fact, one can expect that a significant portion of the transistors in an MPSoC-based architecture will be devoted to the memory hierarchy.

There are at least two major (and complementary) ways of optimizing the memory performance of an MPSoC-based system: (1) constructing a suitable memory organization/hierarchy and (2) optimizing the software (application) for it. This chapter focuses on these two issues and discusses different potential solutions for them.

On the architecture side, one can employ a traditional cache-based hierarchy or can opt to build a customized memory hierarchy, which can consist of caches, scratch pad memories, stream buffers, LIFOs, or a combination of these. It is also possible to make some architectural features reconfigurable and tune their parameters at run time according to the needs of the application being executed.

Traditional compilation techniques for multiprocessor architectures focus only on performance (execution cycles). However, in an MPSoC-based environment, one might want to include other metrics of interest as well, such as energy/power consumption and memory space usage. Therefore, the compiler's job is much more difficult in our context compared with the case of traditional high-end multiprocessors.

Memory Architectures
The application-specific nature of embedded systems presents new opportunities for aggressive customization and exploration of architectural issues. Since embedded systems typically implement a fixed application or problem in a particular domain, knowledge of the applications can be used to tailor the system architecture to suit the needs of the given application.

Such an architectural exploration scheme is quite different from the development of general-purpose computer systems that are designed for good average performance over a set of typical benchmark programs that cover a wide range of applications with different behaviors.

However, in the case of embedded systems, the features of the given application can be used to determine the architectural parameters. This is particularly true for MPSoC-based systems, in which we have numerous power-hungry components. For example, if an application does not use floating point arithmetic, then the floating point unit can be removed from the MPSoC, thereby saving area and power in the implementation.

Since the memory subsystem will dominate the cost (area), performance, and power of an MPSoC, we have to pay special attention to how the memory subsystem can benefit from customization. Unlike a general-purpose processor, in which a standard cache hierarchy is employed, the memory hierarchy - indeed the overall memory organization - of an MPSoC-based system can be tailored in various ways.

The memory can be selectively cached; the cache line size can be determined by the application; the designer can opt to discard the cache completely and employ specialized memory configurations such as FIFOs and stream buffers; and so on. The exploration space of different possible memory architectures is vast, and there have been attempts to automate or semiautomate this exploration process .

Traditionally, memory issues have been separately addressed by disparate research groups: computer architects, compiler writers, and the CAD/embedded systems community. Memory architectures have been studied extensively by computer architects. Memory hierarchy, implemented with cache structures, has received considerable attention from researchers.

Cache parameters such as line size, associativity, and write policy, and their impact on typical applications have been studied in detail. Recent studies have also quantified the impact of dynamic memory (DRAM) architectures. Since architectures are closely associated with compilation issues, compiler researchers have addressed the problem of generating efficient code for a given memory architecture by appropriately transforming the program and data. Compiler transformations such as blocking/tiling are examples of such optimizations. Note that many of these designs/optimizations need a fresh look when an MPSoC-based system is under consideration.

Finally, researchers in the area of computer-assisted design (CAD)/embedded systems have typically employed memory structures such as register files, static memory (SRAMs), and DRAMs in generating application-specific designs. Although the optimizations identified by the architecture and compiler community are still applicable in MPSoC design, the architectural flexibility available in the new context adds a new exploration dimension.

To be really effective, these optimizations need to be integrated into the design process as well as enhanced with new optimization and estimation techniques. In this section, we first present an overview of different memory architectures and then survey some of the ways in which these architectures have been customized.

Types of Architectures
Before we get into deails on how to customize memory architectures for MPSoCs, let's review some common architectures that can be used in memories of MPSoC-based systems, with particular emphasis on the more unconventional ones. Note that MPSoC-based systems can contain both hardware-managed and software-managed memory components.

Cache. As application-specific systems became large enough to use a processor core as a building block, the natural extension in terms of memory architecture was the addition of instruction and data caches. Since the organization of typical caches is well known, we omit the basic explanation. Caches have many parameters (e.g., line size, associativity) that can be customized for a given application. Some of these customizations are described later in this series.

Scratch Pad Memory. An MPSoC designer is not restricted to using only a traditional cached memory architecture. S/he can use unconventional architectural variations that suit the specific application under consideration. One such design alternative is scratch pad memory (SPM).

SPM refers to data memory residing on-chip that is mapped into an address space disjoint from the off-chip memory but connected to the same address and data busses. Both the cache and SPM (usually SRAM) allow fast access to their residing data, whereas an access to the off-chip memory requires relatively longer access times.

The main difference between the scratch pad SRAM and a conventional data cache is that the SRAM guarantees a single-cycle access time, whereas an access to the cache is subject to cache misses. The concept of SPM is an important architectural consideration in modern embedded systems, in which advances in embedded DRAM technology have made it possible to combine DRAM and logic on the same chip.

Since data stored in embedded DRAM can be accessed much faster and in a more power-efficient manner than that in off-chip DRAM, a related optimization problem that arises in this context is how to identify critical data in an application, for storage in on-chip memory.

Figure 9-1  below shows an SPM from the perspective of a single processor, with the parts enclosed in the dotted rectangle implemented in one chip, interfacing with an off-chip memory, usually realized with DRAM. The address and data busses from the CPU core connect to the data cache, SPM, and external memory interface (EMI) blocks.

On a memory access request from the CPU, the data cache indicates a cache hit to the EMI block through the C_HIT signal. Similarly, if the SRAM interface circuitry in the SPM determines that the referenced memory address maps into the on-chip SRAM, it assumes control of the data bus and indicates this status to the EMI through the signal S_HIT. If both the cache and SRAM report miss, the EMI transfers a block of data of the appropriate size (equal to the cache line size) between the cache and the DRAM.

Figure 9-1 Block diagram of a core with SPM.

One possible data address space mapping for this memory configuration is shown in Figure 9-2 below, for a sample addressable memory of size N data words. Memory addresses 0...(P - 1) map into the on-chip SPM and have a single cycle access time. Memory addresses P...(N - 1) map into the off-chip DRAM and are accessed by the CPU through the data cache.

A cache hit for an address in the range P...N - 1 results in a single-cycle delay, whereas a cache miss, which leads to a block transfer between off-chip and cache memory, may result in a delay of, say, 50 to 100 processor cycles for an embedded processor operating in the range of 100 to 400MHz. We illustrate the use of this SPM with the following example..

Figure 9-2. Dividing data address space between SPM and off-chip memory.

Example 1. A small (4 x 4) matrix of coefficients (mask) slides over the input image (source) covering a different 4 x 4 region in each iteration of y, as shown in Figure 9-3 below. In each iteration, the coefficients of the mask are combined with the region of the image currently covered, to obtain a weighted average, and the result (acc) is assigned to the pixel of the output array (dest) in the center of the covered region.

If the two arrays source and mask were to be accessed through the data cache, the performance would be affected by cache conflicts. This problem can be solved by storing the small mask array in the SPM. This assignment eliminates all data conflicts in the data cache - the data cache is now used for memory accesses to source, which are very regular. Storing mask on-chip ensures that frequently accessed data are never ejected off-chip, thereby significantly improving the memory performance and energy dissipation.



Figure 9-3 (TOP) Procedure CONV. (BOTTOM) Memory access pattern in CONV.

Another proposed memory assignment exploits this architecture by first determining a ttal conflict factor (TCF) for each array based on the access frequency and possibility of conflict with other arrays and then considering the arrays for assignment to SPM in the order of TCF/(array size), giving priority to high-conflict/small-size arrays.

Dynamic data transfers. In the above formulation, the data stored in the SPM were statically determined. This idea can be extended to the case of dynamic data storage. However, since there is no automatic hardware-controlled mechanism to transfer data between the SPM and the main memory, such transfers have to be explicitly managed by the compiler.

In another proposed technique, the compiler uses a tiling-like transformation, moves the data tiles (blocks) into SPM (for processing), and then moves it back to main memory after the computation is complete.

Storing instructions in SPM. An SPM storing a small amount of frequently accessed data on-chip has an equivalent in the instruction cache. The idea of using a small buffer to store blocks of frequently used instructions was first introduced by Jouppi.. Recent extensions of this strategy are the decoded instruction buffer and the L-cache.

Researchers have also examined the possibility of storing both instructions and data in the SPM. In one proposed formulation, the frequency of access for both data and program blocks is analyzed and the most frequently occurring ones among them are assigned to the SPM. Chen et al. describe a compiler-directed management strategy for an instruction SPM.

DRAM
DRAMs have been used in a processor-based environment for quite some time, but the context of their use in embedded systems - both from a hardware synthesis viewpoint and from an embedded compiler viewpoint - have been investigated relatively recently.

DRAMs offer better memory performance through the use of specialized access modes that exploit the internal structure and steering/buffering/banking of data within these memories. Explicit modeling of these specialized access modes allows the incorporation of such high-performance access modes into synthesis and compilation frameworks.

New synthesis and compilation techniques have been developed that employ detailed knowledge of the DRAM access modes and exploit advance knowledge of an embedded system's application to improve system performance and power.

A typical DRAM memory address is internally split into a row address consisting of the most significant bits and a column address consisting of the least significant bits. The row address selects a page from the core storage, and the column address selects an offset within the page to arrive at the desired word.

When an address is presented to the memory during a READ operation, the entire page addressed by the row address is read into the page buffer, in anticipation of spatial locality. If future accesses are to the same page, then there is no need to access the main storage area since it can just be read off the page buffer, which acts like a cache. Thus, subsequent accesses to the same page are very fast.

A scheme for modeling the various memory access modes and using them to perform useful optimizations in the context of behavioral synthesis has been described. The main observation is that the input behavior's memory access patterns can potentially exploit the page mode (or other specialized access mode) features of the DRAM.

The key idea is the representation of these specialized access modes as graph primitives that model individual DRAM access modes such as row decode, column decode, precharge, and so on; each DRAM family's specialized access modes are then represented using a composition of these graph primitives to fit the desired access mode protocol.

These composite graphs can then be scheduled together with the rest of the application behavior, both in the context of synthesis and for code compilation. For instance, some additional DRAM-specific optimizations include:

1. Read-modify-write (R-M-W) optimization that takes advantage of the R-M-W mode in modern DRAMs, which provides support for a more efficient realization of the common case in which a specific address is read, the data are involved in some computation, and then the output is written back to the same location.

2. Hoisting, whereby the row-decode node is scheduled ahead of a conditional node if the first memory access in both branches is on the same page.

3. Unrolling optimization in the context of supporting the page mode accesses.

Synchronous DRAM
As DRAM architectures evolve, new challenges are presented to the automatic synthesis of embedded systems based on these memories. Synchronous DRAM represents an architectural advance that presents another optimization opportunity: multiple memory banks.

The core memory storage is divided into multiple banks, each with its own independent page buffer, so that two separate memory pages can be simultaneously active in the multiple page buffers.

There a number of problems modeling the access modes of synchronous DRAMs, including:

Memory bank assignment can be performed by creating an interference graph between arrays and partitioning it into subgraphs so that data in each part are assigned to a different memory bank. The bank assignment algorithm is related to techniques that address memory assignment for DSP processors such as the Motorola 56000, which has a dual-bank internal memory/register file.

The bank assignment problem is targeted at scalar variables and is solved in conjunction with register allocation by building a constraint graph that models the data transfer possibilities between registers and memories followed by a simulated annealing step. Note that such techniques can be particularly suitable for MPSoCs in which a single reduced instruction set computing (RISC)-like core manages multiple DSP-like slave processors.

Chang and Lin approach the SDRAM bank assignment problem by first constructing an array distance table. This table stores the distance in the DFG (dataflow graph) between each pair of arrays in the specification. A short distance indicates a strong correlation, possibly indicating that they might be, for instance, two inputs of the same operation and hence would benefit from being assigned to separate banks. The bank assignment is finally performed by considering array pairs in increasing order of their array distance information.

Whereas the previous discussion has focused primarily on the context of hardware synthesis, similar ideas have been employed to exploit aggressively the memory access protocols for compilers. In the traditional approach of compiler/architecture co-design, the memory subsystem was separated from the microarchitecture; the compiler typically dealt with memory operations using the abstractions of memory loads and stores, with the architecture (e.g., the memory controller) providing the interface to the (typically yet unknown) family of DRAMs and other memory devices that would deliver the desired data.

However, in an embedded system, the system architect has advance knowledge of the specific memories (e.g., DRAMs) used; thus we can employ memory-aware compilation techniques that exploit the specific access modes in the DRAM protocol to perform better code scheduling. In a similar manner, it is possible for the code scheduler to employ global scheduling techniques to hide potential memory latencies using knowledge of the memory access protocols and in effect, improve the ability of the memory controller to boost system performance.

Special Purpose Memories
In addition to the general memories such as caches, and memories specific to embedded systems, such as SPMs, there exist various other types of custom memories that implement specific access protocols. Such memories include memory implementing last-in, first-out protocol (LIFO), memory implementing queue or first-in, first-out protocol (FIFO), and content-addressable memory (CAM). Typically, CAMs are used in search applications, LIFOs are used in microcontrollers, and FIFOs are used in network chips.

Next in Part 2: Customization of memory architectures

This series of articles is based on copyrighted material submitted by Nikil Dutt and Mahmut Kandemir to "Multiprocessor Systems-On-Chipsp  edited byWayne Wolf and Ahmed Amine Jerraya. It is used with the permission of the publisher, Morgan Kaufmann, an imprint of Elsevier. The book can be purchased on-line.

Mahmut Kandemir is an assistant professor in the Computer Science and Engineering Department at Pennsylvania State University. Nikil Dutt is a professor of computer science for Embedded Computer Systems at the University of California, Irvine.