Accelerator IP speeds DSP algorithms -

Accelerator IP speeds DSP algorithms

The DFA1000 family of accelerators from Elixent are for use withstandard RISC processors running DSP algorithms. Designed to improveperformance compared with DSPs, this is the first implementation ofD-Fabrix, the company's reconfigurable signal processing (RSP)technology – see Embedded Systems, November 2001, page 18.

There are five members of the DFA1000 family; DFA1000-128,DFA1000-256, DFA1000-512, DFA1000-1024 and DFA1000-2048, containing;128, 256, 512, 1024 or 2048 ALU arrays respectively.

Each DFA1000 accelerator is delivered as a hard macro, a formatthat provides a number of advantages.

Hard macros are ready-validated, eliminating many of the risks andunknowns in developing an SoC, while the the time to port to a newsilicon process, a traditional concern with hard macros, iseliminated by the small number of leaf cells.

A DFA1000 accelerator maps directly to a standard RISC via the AHBsystem bus and has its own dedicated I/O ports that can be used forthe high bandwidth/low latency interfaces required in multimedia andcommunications systems.

As the footprint of the macro is parameterisable, it can be variedto fit the space available. This enables the accelerator to be scaledto meet both performance and area requirements.

FIR filters can be built with sampling rates of greater than100MHz and an 8×8 DCT (a major building block of JPEG and MPEG videocompression systems) achieves greater than 100Mpixel/s performancedespite being implemented using only a couple of hundred ALUs.

The DFA1000 family is supported with a programming toolset,providing several options for design entry.

Because reconfigurable signal processing is a new designtechnique, many hardware and software engineers are starting to learnthe skill. Elixent has anticipated this by providing each group witha familiar programming interface.

For hardware engineers, designs can be entered as if the DFA1000accelerator was a simple FPGA. The design is described in Verilog,and synthesised down onto the array.

Each design operates like a 'virtual hardware accelerator' – afamiliar concept for hardware engineers. For software engineers,Celoxica provides the DK1 tool. This allows the accelerator to beprogrammed from C.

Both tools feed into the DFA1000 family's standard toolflow at ahigh level. The interface from the design entry tools into the backend of the toolflow is well defined, allowing third party suppliersto provide additional design entry options in the future.

Regardless of the programming style used, once the program iswritten and downloaded, the DFA1000 functions as a dedicated hardwareaccelerator, eliminating real-time conflicts, multiple operatingsystems and other issues that can cause problems.

The D-Fabrix reconfigurable signal processing (RSP) architectureuses an array of 4bit arithmetic logic units (ALUs), register andmemory blocks that may be combined to support variable data wordwidths.

The ALUs are positioned in the style of a chessboard, alternatingwith adjacent 'switchboxes'.

The D-Fabrix technology supports platform devices that can betruly multi-functional and have the ability to support multipleapplications, adapting efficiently to changing specifications.

The architecture integrates local high-speed RAMs, directlyaccessible by the array or by the RISC; and the D-Fabrix arrayitself.

These high-speed data interfaces are supplemented by the AMBA businterface. This is implemented as a 32bit AHB port, fully compliantwith the AMBA v2.0 release. As well as being used to load theD-Fabrix array with each 'virtual hardware' setup it is also used fordata transfer, where other bus masters (typically DMAs) can accessthe RAM or array, loading and unloading data for processing. It isalso used for communications/control of the application by the hostRISC.

The DFA1000 RISC Accelerator

In the DFA1000 accelerator's coprocessor configuration, thisallows the D-Fabrix array to be programmed, loaded and unloaded.

The two high-speed I/O ports give a 32bit data pipe capable ofmoving large amounts of data at high speed and with very low latency.They are connected directly to the processing fabric, givingperipherals the ability to drive data straight into the array. Thistakes very high sample rate data off the system bus.

Each 32bit port can be used as multiple synchronous ports ofreduced width. Width is allocated in 4bit increments; so the portcould be used to implement 4 8bit ports for imaging use, or 2 16bitports for audio.

If used in a pipeline construct, data can be piped from one arrayto another; or the ports may be used to draw data from a high-speedperipheral like an imager or fast ADC while simultaneously drivingout to a DAC or display.

The D-Fabrix array is made up of hundreds or thousands of simple4bit ALUs, giving very fine granularity of resource.

Creating wider execution units is simply a matter of combiningALUs – typically into 8, 12 or 16bit units, but occasionally into farlarger units – eg the CIC filter used widely in software definedradio applications would often require a 64bit datapath.

Benchmarks – based on the DF1-1024 (1024 ALU arrays)

5th order CIC filter


UMTS Viterbi

~1024 voice channels

Floyd-Steinberg Color



16 image lines in parallel

8 x 8 DCT

400Mpixels/s (four DCTs in parallel)

JPEG encoder

200Mpixel/s (two macroblocks in parallel)

In most cases Elixent says the solutions outlined can be scaled togive near-linear performance enhancement by using a larger array.

Published in Embedded Systems (Europe) May2002

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.