System-on-chip (SoC) architectures are being increasingly employed
to solve a diverse spectrum of problems in the embedded and mobile
systems domain. The resulting increase in the complexity of
applications ported into SoC architectures places a tremendous burden
on the computational resources required to deliver the required
functionality.
An emerging architectural solution places multiple processor cores
on a single chip to manage the computational requirements. Such a
multiprocessor system-on-chip (MPSoC) architecture has several
advantages over a conventional strategy that employs a single, more
powerful (but complex) processor on the chip.
First, the design of an on-chip multiprocessor composed of multiple
simple processor cores is simpler than that of a complex
single-processor system. This simplicity also helps reduce the time
spent in verification and validation..
Second, an on-chip multiprocessor is expected to result in better
utilization of the silicon space. The extra logic that would be spent
on register renaming, instruction wake-up, speculation/predication, and
register bypass on a complex single processor can be spent on providing
higher bandwidth on an on-chip multiprocessor.
Third, an MPSoC architecture can exploit loop-level parallelism at
the software level in array-intensive embedded applications. In
contrast, a complex single-processor architecture needs to convert
loop-level parallelism to instruction-level parallelism at run time
(that is, dynamically) using sophisticated (and power-hungry)
strategies. During this process, some loss in parallelism is
inevitable.
Finally, a multiprocessor configuration provides an opportunity for
energy savings through careful and selective management of individual
processors. Overall, an on-chip multiprocessor is a suitable platform
for executing array-intensive computations commonly found in embedded
image and video processing applications.
One of the most critical components that determine the success of an
MPSoC based architecture is its memory system. This is because many
applications spend a significant portion of their cycles in the memory
hierarchy. In fact, one can expect this to be even more so in the
future, considering the ever-increasing dataset sizes, coupled with the
widening processor-memory gap.
In addition, from an energy consumption angle, the memory system can
contribute up to 90% of the overall system power. In fact, one can
expect that a significant portion of the transistors in an MPSoC-based
architecture will be devoted to the memory hierarchy.
There are at least two major (and complementary) ways of optimizing
the memory performance of an MPSoC-based system: (1) constructing a
suitable memory organization/hierarchy and (2) optimizing the software
(application) for it. This chapter focuses on these two issues and
discusses different potential solutions for them.
On the architecture side, one can employ a traditional cache-based
hierarchy or can opt to build a customized memory hierarchy, which can
consist of caches, scratch pad memories, stream buffers, LIFOs, or a
combination of these. It is also possible to make some architectural
features reconfigurable and tune their parameters at run time according
to the needs of the application being executed.
Traditional compilation techniques for multiprocessor architectures
focus only on performance (execution cycles). However, in an
MPSoC-based environment, one might want to include other metrics of
interest as well, such as energy/power consumption and memory space
usage. Therefore, the compiler's job is much more difficult in our
context compared with the case of traditional high-end multiprocessors.
Memory Architectures
The application-specific nature of embedded systems presents new
opportunities for aggressive customization and exploration of
architectural issues. Since embedded systems typically implement a
fixed application or problem in a particular domain, knowledge of the
applications can be used to tailor the system architecture to suit the
needs of the given application.
Such an architectural exploration
scheme is quite different from the development of general-purpose
computer systems that are designed for good average performance over a
set of typical benchmark programs that cover a wide range of
applications with different behaviors.
However, in the case of embedded systems, the features of the given
application can be used to determine the architectural parameters. This
is particularly true for MPSoC-based systems, in which we have numerous
power-hungry components. For example, if an application does not use
floating point arithmetic, then the floating point unit can be removed
from the MPSoC, thereby saving area and power in the implementation.
Since the memory subsystem will dominate the cost (area),
performance,
and power of an MPSoC, we have to pay special attention to how the
memory subsystem can benefit from customization. Unlike a
general-purpose processor, in which a standard cache hierarchy is
employed, the memory hierarchy - indeed the overall memory organization
- of an MPSoC-based system can be tailored in various ways.
The memory
can be selectively cached; the cache line size can be determined by the
application; the designer can opt to discard the cache completely and
employ specialized memory configurations such as FIFOs and stream
buffers; and so on. The exploration space of different possible memory
architectures is vast, and there have been attempts
to automate or
semiautomate this exploration process .
Traditionally, memory issues have been separately addressed by
disparate research groups: computer architects, compiler writers, and
the CAD/embedded systems community. Memory architectures have been
studied extensively by computer architects. Memory hierarchy,
implemented with cache structures, has received considerable attention
from researchers.
Cache parameters such as line size, associativity,
and write policy, and their impact on typical applications have been
studied in detail. Recent studies have also quantified
the impact
of dynamic memory (DRAM) architectures. Since architectures are
closely associated with compilation issues, compiler researchers have
addressed the problem of generating efficient code for a given memory
architecture by appropriately transforming the program and data.
Compiler
transformations such as blocking/tiling are examples of such
optimizations. Note that many of these designs/optimizations
need a fresh look when an MPSoC-based system is under consideration.
Finally, researchers in the area of computer-assisted design
(CAD)/embedded systems have typically employed memory structures such
as register files, static memory (SRAMs), and DRAMs in generating
application-specific designs. Although the optimizations identified by
the architecture and compiler community are still applicable in MPSoC
design, the architectural flexibility available in the new context adds
a new exploration dimension.
To be really effective, these
optimizations need to be integrated into the design process as well as
enhanced with new optimization and estimation techniques. In this
section, we first present an overview of different memory architectures
and then survey some of the ways in which these architectures have been
customized.