CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Providing memory system and compiler support for MPSoc designs: Part 1
Memory Architectures



Embedded.com

Types of Architectures
Before we get into deails on how to customize memory architectures for MPSoCs, let's review some common architectures that can be used in memories of MPSoC-based systems, with particular emphasis on the more unconventional ones. Note that MPSoC-based systems can contain both hardware-managed and software-managed memory components.

Cache. As application-specific systems became large enough to use a processor core as a building block, the natural extension in terms of memory architecture was the addition of instruction and data caches. Since the organization of typical caches is well known, we omit the basic explanation. Caches have many parameters (e.g., line size, associativity) that can be customized for a given application. Some of these customizations are described later in this series.

Scratch Pad Memory. An MPSoC designer is not restricted to using only a traditional cached memory architecture. S/he can use unconventional architectural variations that suit the specific application under consideration. One such design alternative is scratch pad memory (SPM).

SPM refers to data memory residing on-chip that is mapped into an address space disjoint from the off-chip memory but connected to the same address and data busses. Both the cache and SPM (usually SRAM) allow fast access to their residing data, whereas an access to the off-chip memory requires relatively longer access times.

The main difference between the scratch pad SRAM and a conventional data cache is that the SRAM guarantees a single-cycle access time, whereas an access to the cache is subject to cache misses. The concept of SPM is an important architectural consideration in modern embedded systems, in which advances in embedded DRAM technology have made it possible to combine DRAM and logic on the same chip.

Since data stored in embedded DRAM can be accessed much faster and in a more power-efficient manner than that in off-chip DRAM, a related optimization problem that arises in this context is how to identify critical data in an application, for storage in on-chip memory.

Figure 9-1  below shows an SPM from the perspective of a single processor, with the parts enclosed in the dotted rectangle implemented in one chip, interfacing with an off-chip memory, usually realized with DRAM. The address and data busses from the CPU core connect to the data cache, SPM, and external memory interface (EMI) blocks.

On a memory access request from the CPU, the data cache indicates a cache hit to the EMI block through the C_HIT signal. Similarly, if the SRAM interface circuitry in the SPM determines that the referenced memory address maps into the on-chip SRAM, it assumes control of the data bus and indicates this status to the EMI through the signal S_HIT. If both the cache and SRAM report miss, the EMI transfers a block of data of the appropriate size (equal to the cache line size) between the cache and the DRAM.

Figure 9-1 Block diagram of a core with SPM.

One possible data address space mapping for this memory configuration is shown in Figure 9-2 below, for a sample addressable memory of size N data words. Memory addresses 0...(P - 1) map into the on-chip SPM and have a single cycle access time. Memory addresses P...(N - 1) map into the off-chip DRAM and are accessed by the CPU through the data cache.

A cache hit for an address in the range P...N - 1 results in a single-cycle delay, whereas a cache miss, which leads to a block transfer between off-chip and cache memory, may result in a delay of, say, 50 to 100 processor cycles for an embedded processor operating in the range of 100 to 400MHz. We illustrate the use of this SPM with the following example..

Figure 9-2. Dividing data address space between SPM and off-chip memory.

Example 1. A small (4 x 4) matrix of coefficients (mask) slides over the input image (source) covering a different 4 x 4 region in each iteration of y, as shown in Figure 9-3 below. In each iteration, the coefficients of the mask are combined with the region of the image currently covered, to obtain a weighted average, and the result (acc) is assigned to the pixel of the output array (dest) in the center of the covered region.

If the two arrays source and mask were to be accessed through the data cache, the performance would be affected by cache conflicts. This problem can be solved by storing the small mask array in the SPM. This assignment eliminates all data conflicts in the data cache - the data cache is now used for memory accesses to source, which are very regular. Storing mask on-chip ensures that frequently accessed data are never ejected off-chip, thereby significantly improving the memory performance and energy dissipation.



Figure 9-3 (TOP) Procedure CONV. (BOTTOM) Memory access pattern in CONV.

Another proposed memory assignment exploits this architecture by first determining a ttal conflict factor (TCF) for each array based on the access frequency and possibility of conflict with other arrays and then considering the arrays for assignment to SPM in the order of TCF/(array size), giving priority to high-conflict/small-size arrays.

Dynamic data transfers. In the above formulation, the data stored in the SPM were statically determined. This idea can be extended to the case of dynamic data storage. However, since there is no automatic hardware-controlled mechanism to transfer data between the SPM and the main memory, such transfers have to be explicitly managed by the compiler.

In another proposed technique, the compiler uses a tiling-like transformation, moves the data tiles (blocks) into SPM (for processing), and then moves it back to main memory after the computation is complete.

Storing instructions in SPM. An SPM storing a small amount of frequently accessed data on-chip has an equivalent in the instruction cache. The idea of using a small buffer to store blocks of frequently used instructions was first introduced by Jouppi.. Recent extensions of this strategy are the decoded instruction buffer and the L-cache.

Researchers have also examined the possibility of storing both instructions and data in the SPM. In one proposed formulation, the frequency of access for both data and program blocks is analyzed and the most frequently occurring ones among them are assigned to the SPM. Chen et al. describe a compiler-directed management strategy for an instruction SPM.

1 | 2 | 3

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :