DRAM
DRAMs have been used in a processor-based environment for quite some
time, but the context of their use in embedded systems - both from a
hardware synthesis viewpoint and from an embedded compiler viewpoint -
have been investigated relatively recently.
DRAMs offer better memory performance through the use of specialized
access modes that exploit the internal structure and
steering/buffering/banking of data within these memories. Explicit
modeling of these specialized access modes allows the incorporation of
such high-performance access modes into synthesis and compilation
frameworks.
New synthesis and compilation techniques have been developed that
employ detailed knowledge of the DRAM access modes and exploit advance
knowledge of an embedded system's application to improve system
performance and power.
A typical DRAM memory address is internally split into a row address
consisting of the most significant bits and a column address consisting
of the least significant bits. The row address selects a page from the
core storage, and the column address selects an offset within the page
to arrive at the desired word.
When an address is presented to the memory during a READ operation,
the entire page addressed by the row address is read into the page
buffer, in anticipation of spatial locality. If future accesses are to
the same page, then there is no need to access the main storage area
since it can just be read off the page buffer, which acts like a cache.
Thus, subsequent accesses to the same page are very fast.
A scheme for modeling the various memory access modes and using them
to perform useful optimizations
in the context of behavioral synthesis
has been described. The main observation is that the input
behavior's memory access patterns can potentially exploit the page mode
(or other specialized access mode) features of the DRAM.
The key idea is the representation of these specialized access modes
as graph primitives that model individual DRAM access modes such as row
decode, column decode, precharge, and so on; each DRAM family's
specialized access modes are then represented using a composition of
these graph primitives to fit the desired access mode protocol.
These composite graphs can then be scheduled together with the rest
of the application behavior, both in the context of synthesis and for
code compilation. For instance, some
additional DRAM-specific
optimizations include:
1. Read-modify-write
(R-M-W) optimization that takes advantage of the R-M-W mode in
modern DRAMs, which provides support for a more efficient realization
of the common case in which a specific address is read, the data are
involved in some computation, and then the output is written back to
the same location.
2. Hoisting,
whereby the row-decode node is scheduled ahead of a conditional node if
the first memory access in both branches is on the same page.
3. Unrolling optimization
in the context of supporting the page mode accesses.
Synchronous DRAM
As DRAM architectures evolve, new challenges are presented to the
automatic synthesis of embedded systems based on these memories.
Synchronous DRAM represents an architectural advance that presents
another optimization opportunity: multiple memory banks.
The core memory storage is divided into multiple banks, each with
its own independent page buffer, so that two separate memory pages can
be simultaneously active in the multiple page buffers.
There a number of problems modeling the access modes of synchronous
DRAMs, including:
- burst mode read/write: fast successive accesses to data in the
same page
- interleaved row read/write modes: alternating burst accesses
between banks
- interleaved column access: alternating burst accesses between two
chosen rows in different banks
Memory bank assignment can be performed by creating an interference
graph between arrays and partitioning it into subgraphs so that data in
each part are assigned to a different memory bank. The bank assignment
algorithm is related to techniques that address memory assignment for
DSP processors such as the Motorola 56000, which has a dual-bank
internal memory/register file.
The bank assignment problem is targeted at scalar variables and is
solved in conjunction with register allocation by building a constraint
graph that models the data transfer possibilities between registers and
memories followed by a simulated annealing step. Note that such
techniques can be particularly suitable for MPSoCs in which a single
reduced instruction set computing (RISC)-like core manages multiple
DSP-like slave processors.
Chang and Lin
approach the SDRAM bank assignment problem by
first constructing an array distance table. This table stores the
distance in the DFG (dataflow graph) between each pair of arrays in the
specification. A short distance indicates a strong correlation,
possibly indicating that they might be, for instance, two inputs of the
same operation and hence would benefit from being assigned to separate
banks. The bank assignment is finally performed by considering array
pairs in increasing order of their array distance information.
Whereas the previous discussion has focused primarily on the context
of hardware synthesis, similar ideas have been employed to exploit
aggressively the memory
access protocols for compilers. In
the traditional approach of compiler/architecture co-design, the memory
subsystem was separated from the microarchitecture; the compiler
typically dealt with memory operations using the abstractions of memory
loads and stores, with the architecture (e.g., the memory controller)
providing the interface to the (typically yet unknown) family of DRAMs
and other memory devices that would deliver the desired data.
However, in an embedded system, the system architect has advance
knowledge of the specific memories (e.g., DRAMs) used; thus we can
employ memory-aware
compilation techniques that
exploit the
specific access modes in the DRAM protocol to perform better code
scheduling. In a similar manner, it is possible for the code scheduler
to employ global scheduling techniques to hide potential memory
latencies using knowledge of the memory
access protocols and in effect,
improve the ability of the memory controller to boost system
performance.
Special Purpose Memories
In addition to the general memories such as caches, and memories
specific to embedded systems, such as SPMs, there exist various other
types of custom memories that implement specific access protocols. Such
memories include memory implementing last-in, first-out protocol
(LIFO), memory implementing queue or first-in, first-out protocol
(FIFO), and content-addressable memory (CAM). Typically, CAMs are used
in search applications, LIFOs are used in microcontrollers, and FIFOs
are used in network chips.
Next in Part 2: Customization of
memory architectures
This series of
articles is based on
copyrighted material submitted by Nikil Dutt and Mahmut Kandemir to "Multiprocessor
Systems-On-Chipsp edited byWayne
Wolf and Ahmed Amine
Jerraya. It
is used with the permission of the publisher, Morgan Kaufmann, an
imprint of Elsevier. The book can be purchased on-line.
Mahmut
Kandemir is an assistant professor in the Computer Science and
Engineering Department at
Pennsylvania State University. Nikil
Dutt is a professor of computer science for Embedded Computer
Systems at the University of California, Irvine.