Types of Architectures
Before we get into deails on how to customize memory architectures for
MPSoCs, let's review some common architectures that can be used
in memories of MPSoC-based systems, with particular emphasis on the
more unconventional ones. Note that MPSoC-based systems can contain
both hardware-managed and software-managed memory components.
Cache. As
application-specific systems became large enough to use a processor
core as a building block, the natural extension in terms of memory
architecture was the addition of instruction and data caches. Since the
organization
of typical caches is well known, we omit the basic
explanation. Caches have many parameters (e.g., line size,
associativity) that can be customized for a given application. Some of
these customizations are described later in this series.
Scratch Pad
Memory. An MPSoC designer is not restricted to using only a
traditional cached memory architecture. S/he can use unconventional
architectural variations that suit the specific application under
consideration. One such design alternative is scratch
pad memory (SPM).
SPM refers to data memory residing on-chip that is mapped into an
address space disjoint from the off-chip memory but connected to the
same address and data busses. Both the cache and SPM (usually SRAM)
allow fast access to their residing data, whereas an access to the
off-chip memory requires relatively longer access times.
The main difference between the scratch pad SRAM and a conventional
data cache is that the SRAM guarantees a single-cycle access time,
whereas an access to the cache is subject to cache misses. The concept
of SPM is an important architectural consideration in modern embedded
systems, in which advances in embedded DRAM technology have made it
possible to combine DRAM and logic on the same chip.
Since data stored in embedded DRAM can be accessed much faster and
in a more power-efficient manner than that in off-chip DRAM, a related
optimization problem that arises in this context is how to identify
critical data in an application, for storage in on-chip memory.
Figure 9-1 below shows
an SPM from the perspective of a single processor, with the parts
enclosed in the dotted rectangle implemented in one chip, interfacing
with an off-chip memory, usually realized with DRAM. The address and
data busses from the CPU core connect to the data cache, SPM, and
external memory interface (EMI) blocks.
On a memory access request from the CPU, the data cache indicates a
cache hit to the EMI block through the C_HIT signal. Similarly, if the
SRAM interface circuitry in the SPM determines that the referenced
memory address maps into the on-chip SRAM, it assumes control of the
data bus and indicates this status to the EMI through the signal S_HIT.
If both the cache and SRAM report miss, the EMI transfers a block of
data of the appropriate size (equal to the cache line size) between the
cache and the DRAM.
 |
| Figure
9-1 Block diagram of a core with SPM. |
One possible data address space mapping for this memory
configuration is shown in Figure 9-2
below, for a sample addressable memory of size N data words.
Memory addresses 0...(P - 1) map into the on-chip SPM and have a single
cycle access time. Memory addresses P...(N - 1) map into the off-chip
DRAM and are accessed by the CPU through the data cache.
A cache hit for an address in the range P...N - 1 results in a
single-cycle delay, whereas a cache miss, which leads to a block
transfer between off-chip and cache memory, may result in a delay of,
say, 50 to 100 processor cycles for an embedded processor operating in
the range of 100 to 400MHz. We illustrate the use of this SPM with the
following example..
 |
| Figure
9-2. Dividing data address space between SPM and off-chip memory. |
Example 1. A small (4
x 4) matrix of coefficients (mask) slides over the input image (source)
covering a different 4 x 4 region in each iteration of y, as shown in Figure 9-3 below. In each iteration,
the coefficients of the mask are combined with the region of the image
currently covered, to obtain a weighted average, and the result (acc)
is assigned to the pixel of the output array (dest) in the center of
the covered region.
If the two arrays source and mask were to be accessed through the
data cache, the performance would be affected by cache conflicts. This
problem can be solved by storing the small mask array in the SPM. This
assignment eliminates all data conflicts in the data cache - the data
cache is now used for memory accesses to source, which are very
regular. Storing mask on-chip ensures that frequently accessed data are
never ejected off-chip, thereby significantly improving the memory
performance and energy dissipation.
 |
| Figure
9-3 (TOP) Procedure CONV. (BOTTOM) Memory access pattern in CONV. |
Another proposed memory assignment exploits this architecture
by first determining a ttal conflict factor (TCF) for each array based
on the access frequency and possibility of conflict with other arrays
and then considering the arrays for assignment to SPM in the order of
TCF/(array size), giving priority to high-conflict/small-size arrays.
Dynamic data transfers. In the above formulation, the data stored in
the SPM were statically determined. This idea can be extended to the
case of dynamic data storage. However, since there is no automatic
hardware-controlled mechanism to transfer data between the SPM and the
main memory, such transfers have to be explicitly managed by the
compiler.
In another
proposed technique, the compiler uses a tiling-like
transformation, moves the data tiles (blocks) into SPM (for
processing), and then moves it back to main memory after the
computation is complete.
Storing
instructions in SPM. An SPM storing a small amount of frequently
accessed data on-chip has an equivalent in the instruction cache. The
idea of using a small buffer to store blocks of frequently used
instructions was first introduced by Jouppi..
Recent extensions of
this strategy are the decoded
instruction buffer and the L-cache.
Researchers have also examined the possibility of storing both
instructions and data in the SPM. In one proposed
formulation,
the frequency of access for both data and program blocks is analyzed
and the most frequently occurring ones among them are assigned to the
SPM. Chen et al.
describe a compiler-directed management strategy
for an instruction SPM.