Accelerate system performance with hybrid multiprocessing and FPGAs -

Accelerate system performance with hybrid multiprocessing and FPGAs

Multiprocessing is becoming a key differentiator for FPGA-based processor architectures.

Of the design benefits that FPGAs provide embedded systems designers, one key advantage is the ability to adapt and quickly respond to changing system requirements. FPGAs have evolved from the simple interface logic devices of yesterday into highly sophisticated processing devices that are capable of integrating and accelerating entire embedded systems. Modern FPGA-based systems often include multiple soft and hard processors running industry-standard real-time operating systems (RTOSs), along with processor peripherals and custom hardware accelerators for performance-critical algorithms. As a direct result of these capabilities, FPGAs are now being used to develop highly flexible, hybrid multiprocessing applications and systems.

Embedded systems designers face a wide range of processing-related design challenges. Real-time and performance-critical systems demand increased performance, but also require lowered power consumption. Critical embedded applications may require dedicated computing hardware or the use of additional processors to meet performance and power constraints.

To address the performance barrier, a standard approach in the past has been to raise the operating frequency of the processor. Increasing clock speeds increases power consumption, however, so embedded systems designers have turned to other approaches to improve the performance/power ratio. These approaches include the use of additional processors or through the use of specialized coprocessors including FPGAs.

Adding additional devices to a system can be costly, especially when considering the requirements for increased system reliability and sustainable power budgets as well as physical size, thermal, and packaging constraints. Adding more devices to resolve performance issues forces other tradeoffs and adds yet another component to an already lengthy bill of materials. Modern FPGAs, with their ability to integrate multiple processors and coprocessors in a single device, provide one solution to this problem.

In a modern FPGA-based application, one processor may be used to run an operating system. Further integration may be achieved by adding additional coprocessors for noncritical algorithms. These processors can be integrated with dedicated hardware accelerators, all in the same programmable FPGA device.

The result is a hybrid multiprocessing application with a reduced component count.

Leveraging parallelism
Solving complex computational problems through integration and parallelism is not new. It's long been recognized that many of the computing challenges in embedded and high-performance systems can be addressed using parallel-processing techniques. The use of dual- or quad-core processors, multiple processing boards, or even clustered PCs has become commonplace in many applications. In embedded applications, traditional processors can be paired with DSPs, which are often paired with custom or off-the-shelf hardware accelerators.

In recent years, the trend has been to combine multiple processing elements on one device. One example of this multicored approach is the Cell Broadband Engine Architecture, jointly designed by Sony, Toshiba, and IBM.

The Cell architecture increases the performance of graphics and video applications by introducing system-level parallelism. It also supports a flexible, programmable acceleration that's highly optimized and provides for high clock frequencies while minimizing power. The keys to the Cell architecture's high performance are the Synergistic Processing Elements (SPEs) that provide coherent offload, abundant local memory, and asynchronous coherent DMA engines. End applications, such as multimedia and vector processing, benefit from the combination of the general-purpose processor core and streamlined coprocessing elements. (Editor's note: see “Programming the Cell Broadband Engine,” Alex Chunghan Chow, June 2006, for more info on SPEs.)

Figure 1 shows Nvidia's Compute Unified Device Architecture (CUDA), another type of parallel processing engine. It's based on standard graphics processing units (GPUs), which are stream processors (highlighted in light green in the figure) that have been combined to form a general purpose, streams-oriented parallel processing engine. CUDA provides access to the native instruction set and memory of the parallel computation elements in the GPUs. Like the Cell processor, the CUDA architecture promises higher performance over standard processors, while simplifying software development using the standard C language for data-intensive problems.

View the full-size image

Parallelism at many levels
These architectures accelerate performance by providing dedicated processing engines operating in parallel. Parallelism can exist at many levels

• System level through using multiple CPUs and coprocessors

• Process level via multiple threads or communicating processes within each processor

• Subroutine and loop levels using unrolling and pipelining for example

• Statement level via instruction scheduling and via parallel ALUs

Where FPGAs offer a significant advantage is in the latter two types of parallelism. Parallelism is inherent in an FPGA's architecture and can be leveraged by hardware designers or by software-to-hardware compilers for algorithm acceleration. For this purpose, FPGAs are now being deployed alongside traditional processors in high-end computing systems, creating what might be called a hybrid multiprocessing approach to computing.

Examples of off-the-shelf FPGA technologies with soft- and hard-logic embedded processors extendable using coprocessing engines include:

• Actel Cortex M1 32-bit soft core

• Actel CoreMP7 32-bit soft core

• Altera NIOS II 32-bit soft core

• Lattice Freedom core 32-bit soft core

• Xilinx MicroBlaze 32-bit soft core

• Xilinx PowerPC 32-bit hard core

FPGA programming models
FPGAs represent a new class of parallel processing devices that are quite unlike traditional processors. For software developers, the added tasks associated with programming FPGAs may outweigh their potential benefits for performance and power. Fortunately, FPGA programming tools are making rapid advances. With tools available today, it's possible to program FPGAs using familiar software programming languages and multiple parallel programming models.

For C programmers, multiple tools are now available that combine C-compatible parallel programming methods with automated compiler and optimizer technologies. These tools can automatically detect and exploit parallelism at a lower level, for example at the level of individual subroutines and inner code loops. At the same time, they enable programmers to describe and verify parallel behaviors at the system level, using familiar C-language compilers and debuggers combined with well-defined models for parallel computation.

Three primary programming models are highly dependent on the processor selected and the types of interfaces that are used within the system architecture. Coprocessing models are generated using C-based language extensions and related constructs.

When considering a multiprocessing strategy for any application, it's important to consider appropriate methods of process-to-process communication. In fact, for practical applications, the quality of application partitioning and the throughput of data from one processor to another may require more development time than any other single area. Choosing an appropriate programming model for a specific application and target platform is a critical step in achieving performance goals.

In this context, the programming model is a method of abstracting the architecture of a target platform in such a way that applications can be more easily and efficiently designed for that platform. An effective programming model is one that can properly express the application while enabling developers to exploit the advantages of the platform.

C programming models for FPGAs
Programming software algorithms into FPGA hardware has traditionally required specific knowledge of hardware design methods, including the use of hardware description languages (HDLs) such as VHDL or Verilog. Although these methods may be productive for hardware designers, they typically aren't suitable for embedded systems programmers or higher-level software programmers.

Fortunately, software-to-hardware tools now exist that enable software programmers to describe their algorithms using more familiar methods and standard programming languages. For example, using a C-to-FPGA compiler tool such as Impulse C from Impulse Accelerated Technologies, a software programmer can describe an application and its key algorithms in C with the addition of relatively simple library functions to specify interprocess communications. The algorithms can then be compiled automatically into HDL representations, which are subsequently synthesized into lower-level hardware targeting one or more FPGAs. While a certain level of FPGA knowledge and in-depth hardware understanding may still be needed to optimize the application's performance, the formulation of the algorithm, the initial testing, and the prototype hardware generation can now be left to the software programmer.

For system- or application-level parallelism, standard C can be extended in support of various parallel programming models. The standard ANSI C language doesn't include support for parallel programming, but there are many thousands of software programmers today using C-language thread libraries to manage parallel control flows in an application. Threading can be an effective strategy within a single processor, if threading capabilities have been provided in the form of run-time libraries or an operating system.

Streaming is another programming model that supports parallel applications. Streaming is an ideal programming model for many applications in the domain of high-performance embedded computing, in which low-latency yet complex computation of data is required. The streaming programming model is also well suited to mixed-processor/FPGA platforms. With a streaming approach, software running on an embedded processor can be easily interfaced to one or more hardware accelerators running in the FPGA.

Shared memory is a programming model that emphasizes the use of internal or external memory resources as a communication mechanism between connected processes. Shared memory can be quite effective for platforms that don't have high-performance streaming interfaces or in which larger amounts of data must be quickly moved from one processing element to another.

While not a distinct programming model, user-defined instructions (UDIs) are another method that can be effective for creating accelerated, multiprocess applications. In this approach, a traditional processor makes a direct call (perhaps using a C function interface) to a custom hardware element, with data (or addresses to data) being transferred as part of the call.

The role of C-to-FPGA tools
Using standard C for application development has many advantages, not the least of which is the opportunity to use iterative, software-oriented methods of design optimization and debugging. With the Impulse C tools, for example, both hardware and software elements of the complete application can be described, partitioned, and debugged with standard C programming tools. During this process, programmers can employ familiar C-code optimizations to increase performance and throughput, without having much FPGA-specific hardware knowledge.

In streaming applications, hardware and software processes communicate through buffered data streams that are implemented directly in hardware. C-to-FPGA tools make it possible to write piplined, parallel applications at a relatively high level of abstraction, without the cycle-by-cycle synchronization that would otherwise be required.

Throughput optimization is critical to system performance. Computation gains can be erased by poor pipelining or C code, or by inefficient use of parallel computing resources. Hardware and software optimization is required to keep the pipelines full and data moving. Optimization is best achieved by combining automated and interactive techniques and by selecting hardware/ software communication methods that are appropriate for the target FPGA platform.

Let's now look at the pros and cons of the different multiprocessing acceleration options. Bus-based memory is used to transfer data on a shared bus where an accelerator is designed as a bus-connected peripheral. Examples of this include IBM's PowerPC CoreConnect and Altera's Avalon. The advantages are that it's simpler from a system-interface perspective. Bus-based memory is good for transferring larger chunks of data, and to the processor the accelerator looks just like standard peripheral. The disadvantages are that performance may suffer due to contention if the bus is shared.

Streaming data is used to transfer data in packets where the FPGA and processor are connected by multiple high-performance dedicated channels. Examples of these channels include the Xilinx FSL and APU interfaces. Advantages are high throughput rates via inherently streaming interfaces without bus contention issues. The disadvantages are that streaming interfaces requires the use of dedicated hardware for each communication channel, and that the data being processed must be accessed serially.

The UDI is used to create a dedicated hardware replacement for a callable function. In this case, streaming or memory interfaces connect the accelerator to the CPU. One or more C functions are converted to hardware equivalents. UDIs replace original function call. The advantages are that less change is required to the original software application (just call the accelerated function), while providing a more natural way for programmers to think about acceleration. UDIs achieve the most impact when used for creating small accelerated functions. Multiple-process, pipelined hardware accelerators don't recognize as large an improvement.

With the advent of FPGA-embedded processors, a common design environment for both hardware and software is highly desirable. FPGA-specific development tools can help to manage most aspects of peripheral selection and system bring-up for FPGA-based embedded applications. Board support packages can also simplify this process, resulting in a working hardware/software prototype created in minutes using hardware peripheral cores. However, the creation of custom hardware acceleration cores, such as a specialized DSP filter, requires either hand coding in HDL or the use of C-to-FPGA tools.

For development and prototyping, preconfigured boards can be used to create complete systems-on-an-FPGA. These single-chip systems might include one or more embedded soft processors, processor peripherals, and associated C hardware accelerators. In this use model, the Impulse C compiler serves as a peripheral generator, using platform-specific knowledge and automatically creating all necessary bus interface wrappers associated with the generated hardware as shown in Figure 2. Using such a design flow, it's practical for a software engineer–one who has experience with embedded systems but not necessarily FPGA design–to specify, generate, and bring-up a complete hardware-accelerated application without having to write HDL for any part of the system.

View the full-size image

Scalability, hybrid processing
One example of a highly scalable, multi-dimensional, and flexible hybrid processing approach uses hard and/or soft processing cores efficiently attached to scalable coprocessing functions implemented in the soft FPGA logic fabric. Here, a 32-bit processor core is implemented in soft logic. Impulse C tools used to accelerate and generate the soft coprocessing acceleration core to increase the frequency of the FIR filter are shown in Figure 3 and Table 1.

The key advantage to this custom accelerated system is computation performance: the C-language FIR filter algorithm coupled tightly to the soft processor via the FSL achieves over 400X acceleration in performance over a processor-only equivalent algorithm.

View the full-size image

View the full-size image

The ability to add coprocessing engines is limited only by the device's size. This scalability can also be applied to hard embedded processors in FPGAs; coprocessing acceleration engines implemented as hardware peripherals in the FPGA allow direct access to the processor's pipeline.

Like the accelerated FIR filter implementation just described, similar levels of scalability can be accomplished using multiple coprocessing engines attached to the dedicated APU interface. Using these hardware acceleration engines and the APU quad-word instructions (enabling the transfer of 128 bits from cache or external memory directly to/from the fabric) achieves even greater acceleration when compared with single-word accesses.

Accelerating video decoding
An accelerated streaming video example demonstrates the concept of a single-chip, multiprocessor accelerated embedded systems. In this video streaming application, one PowerPC processor runs an operating system, acts as Web server, and provides some of the computation processing. The second processor is dedicated to performance-critical processing including IDCT and color conversion. Quad-word transfers are used as part of the IDCT, color conversion, and for transferring video images from one processor to the other and into the TFT memory for display.

Using the PowerPC accelerated system, the performance of the color conversion algorithms was increased by a factor of 10 compared with running the algorithm only in software. Accelerating the IDCT algorithm using the APU achieved 3X increase in performance. Note that this architecture is highly flexible and can accommodate additional processors to increase computational capability.

Dan Isaacs is the director of embedded processor marketing in Xilinx's Advanced Products Division. He can be reached at

Ed Trexel is a senior applications engineer at Impulse Accelerated Technologies. He has extensive experience with high-performance embedded applications including image processing, VOIP, and FPGA-based systems. Trexel, an electrical engineering graduate of the University of Colorado, can be reached at

Bruce Karsten is a senior processor specialist in Xilinx's platform sales group. He can be reached at

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.