Protected Access to Resources
A fundamental property of Linux and most general-purpose operating
systems is the separation of user-space programs from the underlying
system resources that is utilized. Direct access to memory and device
peripherals is permitted only when operating in supervisor (i.e.
kernel) mode.
When a user program desires access to system resources, it must
request them from the kernel through kernel modules called drivers. The
application exists in a user memory space and will accesses the driver
through virtual files. The virtual files then translate the
application's requests into the kernel memory space in which the driver
executes.
Linux provides an extremely feature-rich driver model that
encompasses standard streaming peripherals, block storage devices and
file systems, and even networking and network-based file systems.
The separation of these drivers from the user-space application
provides robustness. Furthermore, the abstraction to a common driver
interface makes it easy to stream data to a serial port, to a flash
file system or to a network shared folder " all with little change to
the underlying application code.
This flexibility, however, comes at a price. The strict separation
between applications and physical resources adds some degree of
overhead. When a user space program accesses a device peripheral, a
context switch must be made into kernel mode in order to process the
request.
Typically this is not a significant limitation because the data is
accessed in blocks as opposed to sample-by-sample, so the context
switch into kernel mode needs to be made only once per block access.
There are cases, however, when application code requires a tight
coupling with physical hardware. This situation occurs frequently when
using high-performance processors such as DSPs where data throughput is
a key element to processing without stalls. In these cases, the
separation of physical resources in kernel space from the application
in user space may be a significant detriment to the system.
Coupling of Application and Hardware
Let us examine a typical situation encountered when performing block
video processing using the TMS320DM643x processor architecture, which
incorporate a 600 MHz / 4800 MIPS DSP processing core and a wide range
of multimedia peripherals, including a feature-rich video port sub
system. A typical application of this hardware would be the compression
of an incoming video stream using H.264.
In order to take full advantage of the processing capability of the
DSP core, processed data should be accessed from single-cycle internal
memory as opposed to slower external memory. Although it would be
technically possible to enable the processors with enough fast on-chip
memory to store one or more full video frames, this approach would be
cost prohibitive to most target markets. Instead, the processor
provides 80 Kbytes of single-cycle on-chip data memory .
While small relative to a full frame, 80 Kbytes has been determined
by TI through simulation to give the optimal area/performance tradeoff
for H.264 and other video processing algorithms.
To keep this memory
fed with data, the DSP uses a Direct Memory Access (DMA) controller,
which can also be utilized to efficiently transfer sub-blocks of data
between external and internal memory without using cycles from the
processing core (Figure 1 below).
 |
| Figure
1. DSP Processor utilizes DMA hardware to transfer small sub-blocks of
a video frame in external memory into internal memory to be processed
by the DSP core. |
From a whole-system perspective, this method can provide nearly the
same performance as a chip with an entire video buffer but at a
fraction of the cost. To achieve this performance, however, requires a
very tight coupling between the application, the operating system and
the underlying memory and DMA hardware.
First, the application must have a means of distinguishing between
fast internal memory and bulk external memory. Second, the application
must be able to execute many small, precisely-timed DMA operations.
Since all latency incurred when accessing the DMA is magnified by
hundreds or possibly thousands of DMA accesses per video frame,
efficient performance of these DMA operations within the Linux driver
model is difficult, if not impossible, to achieve.
Practical implementations of this method have been demonstrated
utilizing DSP/BIOS, providing native APIs to allow applications to
request internal versus external memory. This also allows applications
to access DMA registers directly with no context switching penalty.