An emerging architectural solution places multiple processor cores on a single chip to manage the computational requirements. Such a multiprocessor system-on-chip (MPSoC) architecture has several advantages over a conventional strategy that employs a single, more powerful (but complex) processor on the chip.
First, the design of an on-chip multiprocessor composed of multiple simple processor cores is simpler than that of a complex single-processor system. This simplicity also helps reduce the time spent in verification and validation..
Second, an on-chip multiprocessor is expected to result in better utilization of the silicon space. The extra logic that would be spent on register renaming, instruction wake-up, speculation/predication, and register bypass on a complex single processor can be spent on providing higher bandwidth on an on-chip multiprocessor.
Third, an MPSoC architecture can exploit loop-level parallelism at the software level in array-intensive embedded applications. In contrast, a complex single-processor architecture needs to convert loop-level parallelism to instruction-level parallelism at run time (that is, dynamically) using sophisticated (and power-hungry) strategies. During this process, some loss in parallelism is inevitable.
Finally, a multiprocessor configuration provides an opportunity for energy savings through careful and selective management of individual processors. Overall, an on-chip multiprocessor is a suitable platform for executing array-intensive computations commonly found in embedded image and video processing applications.
One of the most critical components that determine the success of an
MPSoC based architecture is its memory system. This is because many
applications spend a significant portion of their cycles in the memory
hierarchy. In fact, one can expect this to be even more so in the
future, considering the ever-increasing dataset sizes, coupled with the
widening processor-memory gap.
In addition, from an energy consumption angle, the memory system can contribute up to 90% of the overall system power. In fact, one can expect that a significant portion of the transistors in an MPSoC-based architecture will be devoted to the memory hierarchy.
There are at least two major (and complementary) ways of optimizing
the memory performance of an MPSoC-based system: (1) constructing a
suitable memory organization/hierarchy and (2) optimizing the software
(application) for it. This chapter focuses on these two issues and
discusses different potential solutions for them.
On the architecture side, one can employ a traditional cache-based
hierarchy or can opt to build a customized memory hierarchy, which can
consist of caches, scratch pad memories, stream buffers, LIFOs, or a
combination of these. It is also possible to make some architectural
features reconfigurable and tune their parameters at run time according
to the needs of the application being executed.
Traditional compilation techniques for multiprocessor architectures
focus only on performance (execution cycles). However, in an
MPSoC-based environment, one might want to include other metrics of
interest as well, such as energy/power consumption and memory space
usage. Therefore, the compiler's job is much more difficult in our
context compared with the case of traditional high-end multiprocessors.
Memory Architectures
The application-specific nature of embedded systems presents new
opportunities for aggressive customization and exploration of
architectural issues. Since embedded systems typically implement a
fixed application or problem in a particular domain, knowledge of the
applications can be used to tailor the system architecture to suit the
needs of the given application.
Such an architectural exploration scheme is quite different from the development of general-purpose computer systems that are designed for good average performance over a set of typical benchmark programs that cover a wide range of applications with different behaviors.
However, in the case of embedded systems, the features of the given
application can be used to determine the architectural parameters. This
is particularly true for MPSoC-based systems, in which we have numerous
power-hungry components. For example, if an application does not use
floating point arithmetic, then the floating point unit can be removed
from the MPSoC, thereby saving area and power in the implementation.
Since the memory subsystem will dominate the cost (area),
performance,
and power of an MPSoC, we have to pay special attention to how the
memory subsystem can benefit from customization. Unlike a
general-purpose processor, in which a standard cache hierarchy is
employed, the memory hierarchy - indeed the overall memory organization
- of an MPSoC-based system can be tailored in various ways.
The memory can be selectively cached; the cache line size can be determined by the application; the designer can opt to discard the cache completely and employ specialized memory configurations such as FIFOs and stream buffers; and so on. The exploration space of different possible memory architectures is vast, and there have been attempts to automate or semiautomate this exploration process .
Traditionally, memory issues have been separately addressed by
disparate research groups: computer architects, compiler writers, and
the CAD/embedded systems community. Memory architectures have been
studied extensively by computer architects. Memory hierarchy,
implemented with cache structures, has received considerable attention
from researchers.
Cache parameters such as line size, associativity, and write policy, and their impact on typical applications have been studied in detail. Recent studies have also quantified the impact of dynamic memory (DRAM) architectures. Since architectures are closely associated with compilation issues, compiler researchers have addressed the problem of generating efficient code for a given memory architecture by appropriately transforming the program and data. Compiler transformations such as blocking/tiling are examples of such optimizations. Note that many of these designs/optimizations need a fresh look when an MPSoC-based system is under consideration.
Finally, researchers in the area of computer-assisted design
(CAD)/embedded systems have typically employed memory structures such
as register files, static memory (SRAMs), and DRAMs in generating
application-specific designs. Although the optimizations identified by
the architecture and compiler community are still applicable in MPSoC
design, the architectural flexibility available in the new context adds
a new exploration dimension.
To be really effective, these
optimizations need to be integrated into the design process as well as
enhanced with new optimization and estimation techniques. In this
section, we first present an overview of different memory architectures
and then survey some of the ways in which these architectures have been
customized.
Types of Architectures
Before we get into deails on how to customize memory architectures for
MPSoCs, let's review some common architectures that can be used
in memories of MPSoC-based systems, with particular emphasis on the
more unconventional ones. Note that MPSoC-based systems can contain
both hardware-managed and software-managed memory components.
Cache. As application-specific systems became large enough to use a processor core as a building block, the natural extension in terms of memory architecture was the addition of instruction and data caches. Since the organization of typical caches is well known, we omit the basic explanation. Caches have many parameters (e.g., line size, associativity) that can be customized for a given application. Some of these customizations are described later in this series.
Scratch Pad Memory. An MPSoC designer is not restricted to using only a traditional cached memory architecture. S/he can use unconventional architectural variations that suit the specific application under consideration. One such design alternative is scratch pad memory (SPM).
SPM refers to data memory residing on-chip that is mapped into an
address space disjoint from the off-chip memory but connected to the
same address and data busses. Both the cache and SPM (usually SRAM)
allow fast access to their residing data, whereas an access to the
off-chip memory requires relatively longer access times.
The main difference between the scratch pad SRAM and a conventional
data cache is that the SRAM guarantees a single-cycle access time,
whereas an access to the cache is subject to cache misses. The concept
of SPM is an important architectural consideration in modern embedded
systems, in which advances in embedded DRAM technology have made it
possible to combine DRAM and logic on the same chip.
Since data stored in embedded DRAM can be accessed much faster and in a more power-efficient manner than that in off-chip DRAM, a related optimization problem that arises in this context is how to identify critical data in an application, for storage in on-chip memory.
Figure 9-1 below shows
an SPM from the perspective of a single processor, with the parts
enclosed in the dotted rectangle implemented in one chip, interfacing
with an off-chip memory, usually realized with DRAM. The address and
data busses from the CPU core connect to the data cache, SPM, and
external memory interface (EMI) blocks.
On a memory access request from the CPU, the data cache indicates a cache hit to the EMI block through the C_HIT signal. Similarly, if the SRAM interface circuitry in the SPM determines that the referenced memory address maps into the on-chip SRAM, it assumes control of the data bus and indicates this status to the EMI through the signal S_HIT. If both the cache and SRAM report miss, the EMI transfers a block of data of the appropriate size (equal to the cache line size) between the cache and the DRAM.
![]() |
| Figure 9-1 Block diagram of a core with SPM. |
One possible data address space mapping for this memory
configuration is shown in Figure 9-2
below, for a sample addressable memory of size N data words.
Memory addresses 0...(P - 1) map into the on-chip SPM and have a single
cycle access time. Memory addresses P...(N - 1) map into the off-chip
DRAM and are accessed by the CPU through the data cache.
A cache hit for an address in the range P...N - 1 results in a single-cycle delay, whereas a cache miss, which leads to a block transfer between off-chip and cache memory, may result in a delay of, say, 50 to 100 processor cycles for an embedded processor operating in the range of 100 to 400MHz. We illustrate the use of this SPM with the following example..
![]() |
| Figure 9-2. Dividing data address space between SPM and off-chip memory. |
Example 1. A small (4
x 4) matrix of coefficients (mask) slides over the input image (source)
covering a different 4 x 4 region in each iteration of y, as shown in Figure 9-3 below. In each iteration,
the coefficients of the mask are combined with the region of the image
currently covered, to obtain a weighted average, and the result (acc)
is assigned to the pixel of the output array (dest) in the center of
the covered region.
If the two arrays source and mask were to be accessed through the data cache, the performance would be affected by cache conflicts. This problem can be solved by storing the small mask array in the SPM. This assignment eliminates all data conflicts in the data cache - the data cache is now used for memory accesses to source, which are very regular. Storing mask on-chip ensures that frequently accessed data are never ejected off-chip, thereby significantly improving the memory performance and energy dissipation.

![]() |
| Figure 9-3 (TOP) Procedure CONV. (BOTTOM) Memory access pattern in CONV. |
Another proposed memory assignment exploits this architecture by first determining a ttal conflict factor (TCF) for each array based on the access frequency and possibility of conflict with other arrays and then considering the arrays for assignment to SPM in the order of TCF/(array size), giving priority to high-conflict/small-size arrays.
Dynamic data transfers. In the above formulation, the data stored in
the SPM were statically determined. This idea can be extended to the
case of dynamic data storage. However, since there is no automatic
hardware-controlled mechanism to transfer data between the SPM and the
main memory, such transfers have to be explicitly managed by the
compiler.
In another
proposed technique, the compiler uses a tiling-like
transformation, moves the data tiles (blocks) into SPM (for
processing), and then moves it back to main memory after the
computation is complete.
Storing instructions in SPM. An SPM storing a small amount of frequently accessed data on-chip has an equivalent in the instruction cache. The idea of using a small buffer to store blocks of frequently used instructions was first introduced by Jouppi.. Recent extensions of this strategy are the decoded instruction buffer and the L-cache.
Researchers have also examined the possibility of storing both
instructions and data in the SPM. In one proposed
formulation,
the frequency of access for both data and program blocks is analyzed
and the most frequently occurring ones among them are assigned to the
SPM. Chen et al.
describe a compiler-directed management strategy
for an instruction SPM.
DRAM
DRAMs have been used in a processor-based environment for quite some
time, but the context of their use in embedded systems - both from a
hardware synthesis viewpoint and from an embedded compiler viewpoint -
have been investigated relatively recently.
DRAMs offer better memory performance through the use of specialized
access modes that exploit the internal structure and
steering/buffering/banking of data within these memories. Explicit
modeling of these specialized access modes allows the incorporation of
such high-performance access modes into synthesis and compilation
frameworks.
New synthesis and compilation techniques have been developed that employ detailed knowledge of the DRAM access modes and exploit advance knowledge of an embedded system's application to improve system performance and power.
A typical DRAM memory address is internally split into a row address
consisting of the most significant bits and a column address consisting
of the least significant bits. The row address selects a page from the
core storage, and the column address selects an offset within the page
to arrive at the desired word.
When an address is presented to the memory during a READ operation, the entire page addressed by the row address is read into the page buffer, in anticipation of spatial locality. If future accesses are to the same page, then there is no need to access the main storage area since it can just be read off the page buffer, which acts like a cache. Thus, subsequent accesses to the same page are very fast.
A scheme for modeling the various memory access modes and using them
to perform useful optimizations
in the context of behavioral synthesis
has been described. The main observation is that the input
behavior's memory access patterns can potentially exploit the page mode
(or other specialized access mode) features of the DRAM.
The key idea is the representation of these specialized access modes
as graph primitives that model individual DRAM access modes such as row
decode, column decode, precharge, and so on; each DRAM family's
specialized access modes are then represented using a composition of
these graph primitives to fit the desired access mode protocol.
These composite graphs can then be scheduled together with the rest of the application behavior, both in the context of synthesis and for code compilation. For instance, some additional DRAM-specific optimizations include:
1. Read-modify-write
(R-M-W) optimization that takes advantage of the R-M-W mode in
modern DRAMs, which provides support for a more efficient realization
of the common case in which a specific address is read, the data are
involved in some computation, and then the output is written back to
the same location.
2. Hoisting,
whereby the row-decode node is scheduled ahead of a conditional node if
the first memory access in both branches is on the same page.
3. Unrolling optimization in the context of supporting the page mode accesses.
Synchronous DRAM
As DRAM architectures evolve, new challenges are presented to the
automatic synthesis of embedded systems based on these memories.
Synchronous DRAM represents an architectural advance that presents
another optimization opportunity: multiple memory banks.
The core memory storage is divided into multiple banks, each with its own independent page buffer, so that two separate memory pages can be simultaneously active in the multiple page buffers.
There a number of problems modeling the access modes of synchronous DRAMs, including:
Memory bank assignment can be performed by creating an interference
graph between arrays and partitioning it into subgraphs so that data in
each part are assigned to a different memory bank. The bank assignment
algorithm is related to techniques that address memory assignment for
DSP processors such as the Motorola 56000, which has a dual-bank
internal memory/register file.
The bank assignment problem is targeted at scalar variables and is solved in conjunction with register allocation by building a constraint graph that models the data transfer possibilities between registers and memories followed by a simulated annealing step. Note that such techniques can be particularly suitable for MPSoCs in which a single reduced instruction set computing (RISC)-like core manages multiple DSP-like slave processors.
Chang and Lin approach the SDRAM bank assignment problem by first constructing an array distance table. This table stores the distance in the DFG (dataflow graph) between each pair of arrays in the specification. A short distance indicates a strong correlation, possibly indicating that they might be, for instance, two inputs of the same operation and hence would benefit from being assigned to separate banks. The bank assignment is finally performed by considering array pairs in increasing order of their array distance information.
Whereas the previous discussion has focused primarily on the context
of hardware synthesis, similar ideas have been employed to exploit
aggressively the memory
access protocols for compilers. In
the traditional approach of compiler/architecture co-design, the memory
subsystem was separated from the microarchitecture; the compiler
typically dealt with memory operations using the abstractions of memory
loads and stores, with the architecture (e.g., the memory controller)
providing the interface to the (typically yet unknown) family of DRAMs
and other memory devices that would deliver the desired data.
However, in an embedded system, the system architect has advance knowledge of the specific memories (e.g., DRAMs) used; thus we can employ memory-aware compilation techniques that exploit the specific access modes in the DRAM protocol to perform better code scheduling. In a similar manner, it is possible for the code scheduler to employ global scheduling techniques to hide potential memory latencies using knowledge of the memory access protocols and in effect, improve the ability of the memory controller to boost system performance.
Special Purpose Memories
In addition to the general memories such as caches, and memories
specific to embedded systems, such as SPMs, there exist various other
types of custom memories that implement specific access protocols. Such
memories include memory implementing last-in, first-out protocol
(LIFO), memory implementing queue or first-in, first-out protocol
(FIFO), and content-addressable memory (CAM). Typically, CAMs are used
in search applications, LIFOs are used in microcontrollers, and FIFOs
are used in network chips.
Next in Part 2: Customization of
memory architectures
Mahmut Kandemir is an assistant professor in the Computer Science and Engineering Department at Pennsylvania State University. Nikil Dutt is a professor of computer science for Embedded Computer Systems at the University of California, Irvine.