CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Providing memory system and compiler support for MPSoc designs: Part 2
Customization of memory architectures



Embedded.com
To follow on the review and assessment of various memory architectures in Part 1 in this series, we will now survey some research efforts that address the exploration space involving on-chip memories. A number of distinct memory architectures could be devised to exploit different application-specific memory access patterns efficiently.

Even if we restrict the scope of the architecture to those involving on-chip memory only, the exploration space of different possible configurations is too large, making it infeasible to simulate exhaustively the performance and energy characteristics of the application for each configuration. Thus, exploration tools are necessary for rapidly evaluating the impact of several candidate architectures. Such tools can be of great utility to a system designer by giving fast initial feedback on a wide range of memory architectures.

Cache
Two of the most important aspects of data caches that can be customized for an application are: (1) the cache line size and (2) the cache size. The customization of cache line size for an application is performed in the study just referemced link above by using an estimation technique for predicting the memory access performance, that is, the total number of processor cycles required for all the memory accesses in the application.

There is a tradeoff in sizing the cache line. If the memory accesses are very regular and consecutive, i.e., exhibit spatial locality, a longer cache line is desirable, since it minimizes the number of off-chip accesses and exploits the locality by prefetching elements that will be needed in the immediate future.

On the other hand, if the memory accesses are irregular, or have large strides, a shorter cache line is desirable, as this reduces off-chip memory traffic by not bringing unnecessary data into the cache. The maximum size of a cache line is the DRAM page size.

The estimation technique uses data reuse analysis to predict the total number of cache hits and misses inside loop nests so that spatial locality is incorporated into the estimation. An estimate of the impact of conflict misses is also incorporated. The estimation is carried out for the different candidate line sizes, and the best line size is selected for the cache.

Scratch Pad Memory
MemExplore is an exploration framework for optimizing the on-chip data memory organization, and addresses the following problem: given a certain amount of on-chip memory space, partition this into data cache and SPM so that the total access time and energy dissipation is minimized, i.e., the number of accesses to off-chip memory is minimized.

In this formulation, an on-chip memory architecture is defined as a combination of the total size of on-chip memory used for data storage and the partitioning of this on-chip memory into: scratch memory, characterized by its size; data cache, characterized by the cache size; and the cache line size. For each candidate on-chip memory size T, the technique considers different divisions of T into cache (size C) and SPM (size S = T - C), selecting only powers of 2 for C.

Among the data assigned to be stored in off-chip memory (and hence accessed through the cache3), an estimation of the memory access performance is performed by combining an analysis of the array access patterns in the application and an approximate model of the cache behavior. The result of the estimation is the expected number of processor cycles required for all the memory accesses in the application. For each T, the (C, L) pair that is estimated to maximize performance is selected (L denotes the line size of the cache).

Example 2. Typical exploration curves of the MemExplore algorithm are shown in Figure 9-4 below. Figure 9-4a shows that the ideal division of a 2-K on-chip space is 1K SPM and 1K data cache. Figure 9-4b shows that very little performance improvement is observed beyond a total on-chip memory size of 2KB.

The exploration curves of Figure 9-4 are generated from fast analytical estimates, which are three orders of magnitude faster than actual simulations and are independent of data size. This estimation capability is important in the initial stages of memory design, in which the number of possible architectures is large, and simulation of each architecture is prohibitively expensive.

Figure 9-4 Histogram example. (a) Variation of memory performance with different mixes of cache and SPM, for total on-chip memory of 2KB. (b) Variation of memory performance with total on-chip memory space.).

DRAM
The presence of embedded DRAMs adds several new dimensions to traditional architecture exploration. One interesting aspect of DRAM architecture that can be customized for an application is the banking structure.

Figure 9-5a illustrates a common problem with the single-bank DRAM architecture. If we have a loop that accesses in succession data from three large arrays A, B, and C, each of which is much larger than a page, then each memory access leads to a fresh page being read from the storage, effectively canceling the benefits of the page buffer. This page buffer interference problem cannot be avoided if a fixed-architecture DRAM is used.

However, an elegant solution to the problem is available if the banking configuration of the DRAM can be customized for the application. Thus, in the example of Figure 9-5 below, the arrays can be assigned to separate banks, as shown in Figure 9-5b. Since each bank has its own private page buffer, there is no interference between the arrays, and the memory accesses do not represent a bottleneck.

Figure 9-5 (a) Arrays mapped to a single-bank memory. (b) A three-bank memory architecture.

In order to customize the banking structure for an application, we need to solve the memory bank assignment problem - determine an optimal banking structure (number of banks) and determine the assignment of each array variable into the banks such that the number of page misses is minimized. This objective optimizes both the performance and the energy dissipation of the memory subsystem.

As shown in the previous linked reference, the memory bank customization problem is solved by modeling the assignment as a partitioning problem, i.e., partition a given set of nodes into a given number of groups such that a given criterion (bank misses in this case) is optimized. The partitioning proceeds by associating a cost of assigning two arrays into the same bank, determined by the number of accesses to the arrays and the loop count.

If the arrays are accessed in the same loop, then the cost is high, thereby discouraging the partitioning algorithm from assigning them to the same bank. On the other hand, if two arrays are never accessed in the same loop, then they are candidates for assignment into the same bank. This pairing is associated with a low cost, guiding the partitioner to assign the arrays together.

Multiple SRAMs
In a custom memory architecture, the designer can choose memory parameters such as the number of memories and the size and number of ports on each memory. The number of memory modules used in a design has a significant impact on the access times and power consumption.

A single large monolithic memory to hold all the data is expensive in terms of both access time and energy dissipation than multiple memories of smaller size. However, the other extreme, in which all array data are stored in distinct memory modules, is also expensive, and the optimal allocation lies somewhere in between.

The memory allocation problem is closely linked to the problem of assigning array data to the individual memory modules. Arrays need to be clustered into memories based on their accesses . The clustering can be vertical (different arrays occupy different memory words) or horizontal (different arrays occupy different bit positions within the same word).

Parameters such as bit width, word count, and number of ports can be included in this analysis. The required memory bandwidth (number of ports allowing simultaneous access) can be formally determined by first building a conflict graph of the array accesses and then storing in the same memory module the arrays that do not conflict.

Special Purpose Memories
Special purpose memories such as stacks (LIFO), queues (FIFO), frame buffers, streaming buffers, and so on can be utilized when one is customizing the memory architecture for an application. Indeed, analysis of many large applications shows that a significant number of the memory references in data-intensive applications are made by a surprisingly small number of lines of code.

Thus it is possible to customize the memory subsystem by tuning the memories for these segments of code, with the goal of improving performance, and also for reducing the power dissipation. In one approached used by Dutt, et.al., the application is first analyzed and then different access patterns identified. Data for the most critical access patterns are assigned to memory modules that best fit the access pattern profiles. The system designer can then evaluate different cost/performance/power profiles for different realizations of the memory subsystem.

Processor-Memory Co-exploration
In many embedded applications, it is also critical to explore both processor and memory architectures simultaneously, to capture the synergy between them.

Datapath width and memory size. The CPU's bit width is an additional parameter that can be tuned during architectural exploration of customizable processors. Shackleford et al. studied the relationship between the width of the processor data path and the memory subsystem. This relationship is important when different data types with different sizes are used in the application.

The key observation made is that as datapath width is decreased, the data memory size decreases because of less wasted space. For example, storing 3-bit data in a 4-bit word (instead of 8-bit word) reduces memory space demand, but the required instruction memory capacity might increase.

For example, storing 7-bit data in an 8-bit word requires only one instruction to access it but requires two instructions if a 4-bit datapath is used. We used a RAM and ROM cost model to evaluate the cost of candidate bit widths in a combined CPU-memory exploration.

Architectural description language-driven co-exploration. Processor architecture description languages (ADLs) have been developed to allow for a language-driven co-exploration and software toolkit generation approach. Currently most ADLs assume an implicit/default memory organization or are limited to specifying the characteristics of a traditional memory hierarchy. Since embedded systems may contain nontraditional memory organizations, there is a great need to model explicitly the memory subsystem for an ADL-driven exploration approach.

One interesting approach describes the use of the EXPRESSION ADL to drive memory architecture exploration. The EXPRESSION ADL description of the processor-memory architecture is used to capture the memory architecture explicitly, including the characteristics of the memory modules (such as caches, DRAMs, SRAMs, and DMAs), the parallelism and pipelining present in the memory architecture (e.g., resources used, timings, access modes).

Each such explicit memory architecture description is then used to generate automatically the information needed by the compiler to utilize efficiently the features in the memory architecture and to generate a memory simulator, allowing feedback to the designer on the match among the application, the compiler, and the memory architecture.

1 | 2

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Ready for a change?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :