Even if we restrict the scope of the architecture to those involving on-chip memory only, the exploration space of different possible configurations is too large, making it infeasible to simulate exhaustively the performance and energy characteristics of the application for each configuration. Thus, exploration tools are necessary for rapidly evaluating the impact of several candidate architectures. Such tools can be of great utility to a system designer by giving fast initial feedback on a wide range of memory architectures.
Cache
Two of the most important aspects of data caches that can be customized
for an application are: (1) the cache line size and (2) the cache size.
The customization of cache line size for an application is performed in
the study just referemced link above by using an estimation technique
for predicting the memory access performance, that is, the total number
of processor cycles required for all the memory accesses in the
application.
There is a tradeoff in sizing the cache line. If the memory accesses are very regular and consecutive, i.e., exhibit spatial locality, a longer cache line is desirable, since it minimizes the number of off-chip accesses and exploits the locality by prefetching elements that will be needed in the immediate future.
On the other hand, if the memory accesses are irregular, or have
large strides, a shorter cache line is desirable, as this reduces
off-chip memory traffic by not bringing unnecessary data into the
cache. The maximum size of a cache line is the DRAM page size.
The estimation technique uses data reuse analysis to predict the total number of cache hits and misses inside loop nests so that spatial locality is incorporated into the estimation. An estimate of the impact of conflict misses is also incorporated. The estimation is carried out for the different candidate line sizes, and the best line size is selected for the cache.
Scratch Pad Memory
In this formulation, an on-chip memory architecture is defined as a
combination of the total size of on-chip memory used for data storage
and the partitioning of this on-chip memory into: scratch memory,
characterized by its size; data cache, characterized by the cache size;
and the cache line size. For each candidate on-chip memory size T, the
technique considers different divisions of T into cache (size C) and
SPM (size S = T - C), selecting only powers of 2 for C.
Among the data assigned to be stored in off-chip memory (and hence accessed through the cache3), an estimation of the memory access performance is performed by combining an analysis of the array access patterns in the application and an approximate model of the cache behavior. The result of the estimation is the expected number of processor cycles required for all the memory accesses in the application. For each T, the (C, L) pair that is estimated to maximize performance is selected (L denotes the line size of the cache).
Example 2.
Typical exploration curves of the MemExplore algorithm are shown in Figure 9-4 below. Figure 9-4a shows
that the ideal division of a 2-K on-chip space is 1K SPM and 1K data
cache. Figure 9-4b shows that very little performance improvement is
observed beyond a total on-chip memory size of 2KB.
The exploration curves of Figure 9-4 are generated from fast analytical estimates, which are three orders of magnitude faster than actual simulations and are independent of data size. This estimation capability is important in the initial stages of memory design, in which the number of possible architectures is large, and simulation of each architecture is prohibitively expensive.
![]() |
| Figure 9-4 Histogram example. (a) Variation of memory performance with different mixes of cache and SPM, for total on-chip memory of 2KB. (b) Variation of memory performance with total on-chip memory space.). |
DRAM
The presence of embedded DRAMs adds several new dimensions to
traditional architecture exploration. One interesting aspect of DRAM
architecture that can be customized for an application is the banking
structure.
Figure 9-5a illustrates a common problem with the single-bank DRAM
architecture. If we have a loop that accesses in succession data from
three large arrays A, B, and C, each of which is much larger than a
page, then each memory access leads to a fresh page being read from the
storage, effectively canceling the benefits of the page buffer. This
page buffer interference problem cannot be avoided if a
fixed-architecture DRAM is used.
However, an elegant solution to the problem is available if the banking configuration of the DRAM can be customized for the application. Thus, in the example of Figure 9-5 below, the arrays can be assigned to separate banks, as shown in Figure 9-5b. Since each bank has its own private page buffer, there is no interference between the arrays, and the memory accesses do not represent a bottleneck.
![]() |
| Figure 9-5 (a) Arrays mapped to a single-bank memory. (b) A three-bank memory architecture. |
In order to customize the banking structure for an application, we
need to solve the memory bank assignment problem - determine an optimal
banking structure (number of banks) and determine the assignment of
each array variable into the banks such that the number of page misses
is minimized. This objective optimizes both the performance and the
energy dissipation of the memory subsystem.
As shown in the previous linked reference, the memory bank
customization problem is solved by modeling the assignment as a
partitioning problem, i.e., partition a given set of nodes into a given
number of groups such that a given criterion (bank misses in this case)
is optimized. The partitioning proceeds by associating a cost of
assigning two arrays into the same bank, determined by the number of
accesses to the arrays and the loop count.
If the arrays are accessed in the same loop, then the cost is high, thereby discouraging the partitioning algorithm from assigning them to the same bank. On the other hand, if two arrays are never accessed in the same loop, then they are candidates for assignment into the same bank. This pairing is associated with a low cost, guiding the partitioner to assign the arrays together.
Multiple SRAMs
In a custom memory architecture, the designer can choose memory
parameters such as the number of memories and the size and number of
ports on each memory. The number of memory modules used in a design has
a significant impact on the access times and power consumption.
A single large monolithic memory to hold all the data is expensive in terms of both access time and energy dissipation than multiple memories of smaller size. However, the other extreme, in which all array data are stored in distinct memory modules, is also expensive, and the optimal allocation lies somewhere in between.
The memory allocation problem is closely linked to the problem of
assigning array data to the individual memory modules. Arrays need to
be clustered
into memories based on their accesses . The clustering can be
vertical (different arrays occupy different memory words) or horizontal
(different arrays occupy different bit positions within the same word).
Parameters such as bit width, word count, and number of ports can be included in this analysis. The required memory bandwidth (number of ports allowing simultaneous access) can be formally determined by first building a conflict graph of the array accesses and then storing in the same memory module the arrays that do not conflict.
Special Purpose Memories
Special purpose memories such as stacks (LIFO), queues (FIFO), frame
buffers, streaming buffers, and so on can be utilized when one is
customizing the memory architecture for an application. Indeed,
analysis of many large applications shows that a significant number of
the memory references in data-intensive applications are made by a
surprisingly small number of lines of code.
Thus it is possible to customize the memory subsystem by tuning the memories for these segments of code, with the goal of improving performance, and also for reducing the power dissipation. In one approached used by Dutt, et.al., the application is first analyzed and then different access patterns identified. Data for the most critical access patterns are assigned to memory modules that best fit the access pattern profiles. The system designer can then evaluate different cost/performance/power profiles for different realizations of the memory subsystem.
Processor-Memory
Co-exploration
In many embedded applications, it is also critical to explore both
processor and memory architectures simultaneously, to capture the
synergy between them.
Datapath width and memory size. The CPU's bit width is an additional
parameter that can be tuned during architectural exploration of
customizable processors. Shackleford et al.
studied the relationship between the width of the processor data
path and the memory subsystem. This relationship is important when
different data types with different sizes are used in the application.
The key observation made is that as datapath width is decreased, the
data memory size decreases because of less wasted space. For example,
storing 3-bit data in a 4-bit word (instead of 8-bit word) reduces
memory space demand, but the required instruction memory capacity might
increase.
For example, storing 7-bit data in an 8-bit word requires only one instruction to access it but requires two instructions if a 4-bit datapath is used. We used a RAM and ROM cost model to evaluate the cost of candidate bit widths in a combined CPU-memory exploration.
Architectural description language-driven co-exploration. Processor architecture description languages (ADLs) have been developed to allow for a language-driven co-exploration and software toolkit generation approach. Currently most ADLs assume an implicit/default memory organization or are limited to specifying the characteristics of a traditional memory hierarchy. Since embedded systems may contain nontraditional memory organizations, there is a great need to model explicitly the memory subsystem for an ADL-driven exploration approach.
One interesting
approach describes the use of the EXPRESSION ADL to
drive memory architecture exploration. The EXPRESSION ADL description
of the processor-memory architecture is used to capture the memory
architecture explicitly, including the characteristics of the memory
modules (such as caches, DRAMs, SRAMs, and DMAs), the parallelism and
pipelining present in the memory architecture (e.g., resources used,
timings, access modes).
Each such explicit memory
architecture description is then used to generate automatically the information
needed by the compiler to utilize efficiently the features in the
memory architecture and to generate a memory simulator, allowing
feedback to the designer on the match among the application, the
compiler, and the memory architecture.
Split spactial and temporal caches
Various specialized memory structures proposed over the years could be
candidates for MPSoC-based embedded systems. One such concept is split
spatial/temporal caches.
Variables in real life applications present a wide variety of access
patterns and locality types (for instance scalars, such as indexes,
usually present high temporal and moderate spatial locality, whereas
vectors with small stride present high spatial locality, and vectors
with large stride present low spatial locality and may or may not have
temporal locality).
Several approaches have proposed splitting a cache into a spatial cache and a temporal cache that store data structures with high temporal and high spatial locality, respectively. These approaches rely on a dynamic prediction mechanism to route the data to either the spatial or the temporal caches, based on a history buffer.
In an embedded system context, the approach of Grun
et al. uses similar split-cache architecture but allocates the
variables statically to the different local memory modules, avoiding
the power and area overhead of the dynamic prediction mechanism.
Thus, by targeting the specific locality types of the different
variables, better utilization of the main memory bandwidth is achieved.
The useless fetches due to locality mismatch are thus avoided. For
instance, if a variable with low spatial locality is serviced by a
cache with a large line size, a large number of the values read from
the main memory will never be used.
The approach described by Grun et al.shows that the memory bandwidth and memory power consumption could be reduced significantly. Note that, in an MPSoC-based architecture, each processor may demand a customized cache (or SPM) for the best behavior.
Reconfigurability and Challenges
In MPSoC-based embedded systems, modifying a given code to improve data
locality is one way of enhancing performance. An alternative approach
is to reconfigure the cache (or SPM) architecture dynamically according
to the application at hand.
That is, it might be useful to have a morphable (reconfigurable) memory/cache system that adapts itself to the application's requirements (from both performance and energy/power consumption angles) dynamically. In fact, an optimizing compiler can analyze a given application, divide its code into regions, and, for each region, select an optimum cache configuration for each processor. However, there are several key issues that need to be addressed in translating the promise of reconfigurable cache architectures into practice:
Kadayif et
al. focus on a morphable cache architecture and
array-dominated embedded codes (which are suitable for an MPSoC-based
environment) and show the potential benefits that can be obtained from
such a system. They conduct a limit study for potential benefits (from
energy and performance perspectives) for going from one configuration
to another.
The granularity that they focus on is a nested loop, which is the natural computation/access pattern boundary for array-dominated applications from a scientific domain and an image/video processing domain. Using a set of array-dominated codes, they investigate what the best cache configuration is for each nested loop under different objective functions (optimization criteria) such as cache energy, memory energy, cache misses, performance (execution time), overall energy, and energy-delay (energy-execution time) product.
In addition to morphing conventional cache parameters, they also
consider reconfigurability of energy-aware features found in some cache
architectures such as block buffering.
Their results indicate that there are potential performance and energy
benefits in adopting a morphable cache subsystem.
The results also show that, depending on the optimization objective targeted, one may select an entirely different cache configuration. For example, minimizing cache memory energy requires a cache configuration for each nest that is different from an objective criterion that tries to minimize the overall memory system energy under a performance constraint.
To read Part 1, go to Types of Memory Architectures.Mahmut Kandemir is an assistant professor in the Computer Science and Engineering Department at Pennsylvania State University. Nikil Dutt is a professor of computer science for Embedded Computer Systems at the University of California, Irvine.