To follow on the review and assessment of various memory
architectures in
Part 1 in
this series, we will now survey some research efforts that address
the exploration space involving on-chip memories. A number of distinct
memory architectures could be devised to exploit different
application-specific memory access patterns efficiently.
Even if we restrict the scope of the architecture to those involving
on-chip memory only, the exploration space of different possible
configurations is too large, making it infeasible to simulate
exhaustively the performance and energy characteristics of the
application for each configuration. Thus, exploration tools are
necessary for rapidly evaluating the impact of several candidate
architectures. Such tools can be of great utility to a system designer
by giving fast initial feedback on a wide range of
memory architectures.
Cache
Two of the most important aspects of data caches that can be customized
for an application are: (1) the cache line size and (2) the cache size.
The customization of cache line size for an application is performed in
the study just referemced link above by using an estimation technique
for predicting the memory access performance, that is, the total number
of processor cycles required for all the memory accesses in the
application.
There is a tradeoff in sizing the cache line. If the memory accesses
are very regular and consecutive, i.e., exhibit spatial locality, a
longer cache line is desirable, since it minimizes the number of
off-chip accesses and exploits the locality by prefetching elements
that will be needed in the immediate future.
On the other hand, if the memory accesses are irregular, or have
large strides, a shorter cache line is desirable, as this reduces
off-chip memory traffic by not bringing unnecessary data into the
cache. The maximum size of a cache line is the DRAM page size.
The estimation technique uses data reuse analysis to predict the
total number of cache hits and misses inside loop nests so that spatial
locality is incorporated into the estimation. An estimate of the impact
of conflict misses is also incorporated. The estimation is carried out
for the different candidate line sizes, and the best line size is
selected for the cache.
Scratch Pad Memory
MemExplore is
an exploration framework for optimizing the on-chip data memory
organization, and addresses the following problem:
given a certain amount of on-chip memory
space, partition this into data cache and SPM so that the total access
time and energy dissipation is minimized, i.e., the number of accesses
to off-chip memory is minimized.
In this formulation, an on-chip memory architecture is defined as a
combination of the total size of on-chip memory used for data storage
and the partitioning of this on-chip memory into: scratch memory,
characterized by its size; data cache, characterized by the cache size;
and the cache line size. For each candidate on-chip memory size T, the
technique considers different divisions of T into cache (size C) and
SPM (size S = T - C), selecting only powers of 2 for C.
Among the data assigned to be stored in off-chip memory (and hence
accessed through the cache3), an estimation of the memory access
performance is performed by combining an analysis of the array access
patterns in the application and an approximate model of the cache
behavior. The result of the estimation is the expected number of
processor cycles required for all the memory accesses in the
application. For each T, the (C, L) pair that is estimated to maximize
performance is selected (L denotes the line size of the cache).
Example 2.
Typical exploration curves of the MemExplore algorithm are shown in Figure 9-4 below. Figure 9-4a shows
that the ideal division of a 2-K on-chip space is 1K SPM and 1K data
cache. Figure 9-4b shows that very little performance improvement is
observed beyond a total on-chip memory size of 2KB.
The exploration curves of Figure 9-4 are generated from fast
analytical estimates, which are three orders of magnitude faster than
actual simulations and are independent of data size. This estimation
capability is important in the initial stages of memory design, in
which the number of possible architectures is large, and simulation of
each architecture is prohibitively expensive.
 |
| Figure
9-4 Histogram example. (a) Variation of memory performance with
different mixes of cache and SPM, for total on-chip memory of 2KB. (b)
Variation of memory performance with total on-chip memory space.). |
DRAM
The presence of embedded DRAMs adds several new dimensions to
traditional architecture exploration. One interesting aspect of DRAM
architecture that can be customized for an application is the banking
structure.
Figure 9-5a illustrates a common problem with the single-bank DRAM
architecture. If we have a loop that accesses in succession data from
three large arrays A, B, and C, each of which is much larger than a
page, then each memory access leads to a fresh page being read from the
storage, effectively canceling the benefits of the page buffer. This
page buffer interference problem cannot be avoided if a
fixed-architecture DRAM is used.
However, an elegant solution to the problem is available if the banking
configuration of the DRAM can be customized for the application.
Thus, in the example of Figure 9-5
below, the arrays can be assigned to separate banks, as shown in
Figure 9-5b. Since each bank has its own private page buffer, there is
no interference between the arrays, and the memory accesses do not
represent a bottleneck.
 |
| Figure
9-5 (a) Arrays mapped to a single-bank memory. (b) A three-bank memory
architecture. |
In order to customize the banking structure for an application, we
need to solve the memory bank assignment problem - determine an optimal
banking structure (number of banks) and determine the assignment of
each array variable into the banks such that the number of page misses
is minimized. This objective optimizes both the performance and the
energy dissipation of the memory subsystem.
As shown in the previous linked reference, the memory bank
customization problem is solved by modeling the assignment as a
partitioning problem, i.e., partition a given set of nodes into a given
number of groups such that a given criterion (bank misses in this case)
is optimized. The partitioning proceeds by associating a cost of
assigning two arrays into the same bank, determined by the number of
accesses to the arrays and the loop count.
If the arrays are accessed in the same loop, then the cost is high,
thereby discouraging the partitioning algorithm from assigning them to
the same bank. On the other hand, if two arrays are never accessed in
the same loop, then they are candidates for assignment into the same
bank. This pairing is associated with a low cost, guiding the
partitioner to assign the arrays together.
Multiple SRAMs
In a custom memory architecture, the designer can choose memory
parameters such as the number of memories and the size and number of
ports on each memory. The number of memory modules used in a design has
a significant impact on the access times and power consumption.
A single large monolithic memory to hold all the data is expensive
in terms of both access time and energy dissipation than multiple
memories of smaller size. However, the other extreme, in which all
array data are stored in distinct memory modules, is also expensive,
and the optimal allocation lies somewhere in between.
The memory allocation problem is closely linked to the problem of
assigning array data to the individual memory modules. Arrays need to
be clustered
into memories based on their accesses . The clustering can be
vertical (different arrays occupy different memory words) or horizontal
(different arrays occupy different bit positions within the same word).
Parameters such as bit width, word count, and number of ports can be
included in
this analysis. The required memory bandwidth (number of ports
allowing simultaneous access) can be formally determined by first
building a conflict graph of the array accesses and then storing
in the same memory module the arrays that do not conflict.
Special Purpose Memories
Special purpose memories such as stacks (LIFO), queues (FIFO), frame
buffers, streaming buffers, and so on can be utilized when one is
customizing the memory architecture for an application. Indeed,
analysis of many large applications shows that a significant number of
the memory references in data-intensive applications are made by a
surprisingly small number of lines of code.
Thus it is possible to customize the memory subsystem by tuning the
memories for these segments of code, with the goal of improving
performance, and also for reducing the power dissipation. In one
approached used by Dutt,
et.al., the application is first analyzed and then different access
patterns identified. Data for the most critical access patterns are
assigned to memory modules that best fit the access pattern profiles.
The system designer can then evaluate different cost/performance/power
profiles for different realizations of the memory subsystem.
Processor-Memory
Co-exploration
In many embedded applications, it is also critical to explore both
processor and memory architectures simultaneously, to capture the
synergy between them.
Datapath width and memory size. The CPU's bit width is an additional
parameter that can be tuned during architectural exploration of
customizable processors. Shackleford et al.
studied the relationship between the width of the processor data
path and the memory subsystem. This relationship is important when
different data types with different sizes are used in the application.
The key observation made is that as datapath width is decreased, the
data memory size decreases because of less wasted space. For example,
storing 3-bit data in a 4-bit word (instead of 8-bit word) reduces
memory space demand, but the required instruction memory capacity might
increase.
For example, storing 7-bit data in an 8-bit word requires only one
instruction to access it but requires two instructions if a 4-bit
datapath is used. We used a RAM and ROM cost model to evaluate the cost
of candidate bit widths in a combined CPU-memory exploration.
Architectural
description language-driven co-exploration. Processor
architecture description languages (ADLs) have been developed to allow
for a language-driven co-exploration and
software
toolkit generation approach. Currently most ADLs assume an
implicit/default memory organization or are limited to specifying the
characteristics of a traditional memory hierarchy. Since embedded
systems may contain nontraditional memory organizations, there is a
great need to model explicitly the memory subsystem for an ADL-driven
exploration approach.
One interesting
approach describes the use of the EXPRESSION ADL to
drive memory architecture exploration. The EXPRESSION ADL description
of the processor-memory architecture is used to capture the memory
architecture explicitly, including the characteristics of the memory
modules (such as caches, DRAMs, SRAMs, and DMAs), the parallelism and
pipelining present in the memory architecture (e.g., resources used,
timings, access modes).
Each such explicit memory
architecture description is then used to generate automatically the information
needed by the compiler to utilize efficiently the features in the
memory architecture and to generate a memory simulator, allowing
feedback to the designer on the match among the application, the
compiler, and the memory architecture.