Of the design benefits that FPGAs provide embedded systemsdesigners, one key advantage is the ability to adapt and quicklyrespond to changing system requirements. FPGAs have evolved from thesimple interface logic devices of yesterday into highly sophisticatedprocessing devices that are capable of integrating and acceleratingentire embedded systems. Modern FPGA-based systems often includemultiple soft and hard processors running industry-standard real-timeoperating systems (RTOSs), along with processor peripherals and customhardware accelerators for performance-critical algorithms. As a directresult of these capabilities, FPGAs are now being used to develophighly flexible, hybrid multiprocessing applications and systems.
Embedded systems designers face a wide range ofprocessing-related design challenges. Real-time andperformance-critical systems demand increased performance, but alsorequire lowered power consumption. Critical embedded applications mayrequire dedicated computing hardware or the use of additionalprocessors to meet performance and power constraints.
To address the performance barrier, a standard approach in thepast has been to raise the operating frequency of the processor.Increasing clock speeds increases power consumption, however, soembedded systems designers have turned to other approaches to improvethe performance/power ratio. These approaches include the use ofadditional processors or through the use of specialized coprocessorsincluding FPGAs.
Adding additional devices to a system can be costly,especially when considering the requirements for increased systemreliability and sustainable power budgets as well as physical size,thermal, and packaging constraints. Adding more devices to resolveperformance issues forces other tradeoffs and adds yet anothercomponent to an already lengthy bill of materials. Modern FPGAs, withtheir ability to integrate multiple processors and coprocessors in asingle device, provide one solution to this problem.
In a modern FPGA-based application, one processor may be usedto run an operating system. Further integration may be achieved byadding additional coprocessors for noncritical algorithms. Theseprocessors can be integrated with dedicated hardware accelerators, allin the same programmable FPGA device.
The result is a hybrid multiprocessing application with a reducedcomponent count.
Solving complex computational problems through integration andparallelism is not new. It's long been recognized that many of thecomputing challenges in embedded and high-performance systems can beaddressed using parallel-processing techniques. The use of dual- orquad-core processors, multiple processing boards, or even clustered PCshas become commonplace in many applications. In embedded applications,traditional processors can be paired with DSPs, which are often pairedwith custom or off-the-shelf hardware accelerators.
In recent years, the trend has been to combine multiple processingelements on one device. One example of this multicored approach is theCell Broadband Engine Architecture, jointly designed by Sony, Toshiba,and IBM.
The Cell architecture increases the performance of graphics andvideo applications by introducing system-level parallelism. It alsosupports a flexible, programmable acceleration that's highly optimizedand provides for high clock frequencies while minimizing power. Thekeys to the Cell architecture's high performance are the SynergisticProcessing Elements (SPEs) that provide coherent offload, abundantlocal memory, and asynchronous coherent DMA engines. End applications,such as multimedia and vector processing, benefit from the combinationof the general-purpose processor core and streamlined coprocessingelements. (Editor's note: see “Programming the Cell Broadband Engine,” Alex ChunghanChow, June 2006, for more info on SPEs.)
Figure 1 shows Nvidia's Compute Unified Device Architecture (CUDA),another type of parallel processing engine. It's based on standardgraphics processing units (GPUs), which are stream processors(highlighted in light green in the figure) that have been combined toform a general purpose, streams-oriented parallel processing engine.CUDA provides access to the native instruction set and memory of theparallel computation elements in the GPUs. Like the Cell processor, theCUDA architecture promises higher performance over standard processors,while simplifying software development using the standard C languagefor data-intensive problems.
Parallelism at manylevels
These architectures accelerate performance by providing dedicatedprocessing engines operating in parallel. Parallelism can exist at manylevels
System level through using multiple CPUs and coprocessors
Process level via multiple threads or communicating processeswithin each processor
Subroutine and loop levels using unrolling and pipelining forexample
Statement level via instruction scheduling and via parallel ALUs
Where FPGAs offer a significant advantage is in the latter two typesof parallelism. Parallelism is inherent in an FPGA's architecture andcan be leveraged by hardware designers or by software-to-hardwarecompilers for algorithm acceleration. For this purpose, FPGAs are nowbeing deployed alongside traditional processors in high-end computingsystems, creating what might be called a hybrid multiprocessingapproach to computing.