Developing DSP code on converged hybrid DSP/RISC cores -

Developing DSP code on converged hybrid DSP/RISC cores

Recently processor designers have been reducing power consumption andpart cost in a new way – by combining RISC and DSP features into asingle core, known as a ‘convergent’ processor core.

Two examples of architectures that were designed from the beginningto be convergent are the Analog Devices Blackfin processor and StarCoreprocessors (SC1000, SC2000, and SC v5). Some other examples ofconvergent architectures are based on well established RISCarchitectures and have been modified to efficiently fulfill DSPfunctions. These include the MIPS 24KE, Renesas SH3-DSP, PowerPC withAltiVec, and ARM966E-S.

Compared to standalone RISC systems, such convergent processors canbe much more efficient at performing DSP tasks. Yet compared totraditional DSP processor designs, convergent processors generally havemore complex pipelines and can run at high enough speeds to allow forfast control-intensive computing.

But even with the design simplification and improvements inproduction costs and power usage such hybrid architectures can offer tothe developer, programming can still be a chore. The choice of toolscan make a huge difference in the ability to bring a product to marketquickly that is maintainable, robust, and uniquely poised for thechallenges a successful product can bring.

For example, many of the DSP algorithms that can run on a convergedprocessor have been written in assembly. Assembly programming providesthe ultimate level of control over hardware and allows a cleverprogrammer to squeeze every last drip of processing power out of aprocessor. As DSP applications get larger and more complex, the cost ofprogramming in assembly consequently increases.

While it still may be useful to profile the DSP core and find the20% of your code that takes 80% of the processing time and rewrite thatcode in assembly, the remainder of your code can be written in a higherlevel programming language.

Writing all DSP processing code in low-level assembly can slow downdevelopment and leave the product development team with a completelynon-portable, confusing, unmanageable source base. The C language typesystem, run-time, and control-flow constructs greatly improvesdevelopment time, ease of debugging, code browsing, and maintenance.

But the ISO C99 language standard still does not make provisions foroperations commonly used in DSP applications, such as multiple-dataoperations, saturating arithmetic, or fixed point types. With astandard C compiler, an application could still be partitioned withcontrol tasks written in C and DSP written in assembly.

One way tool vendors have improved this situation is to allow DSPprogramming in C by adding ‘intrinsic functions’ to their compiler.Intrinsic functions are recognized by name in the compiler and arehandled differently than regular function calls in that the compilerwill generate a known instruction or series of instructions, typicallywithout making a function call at all.

For instance, an intrinsic function call might be used to signal tothe compiler that a saturating add should take place on two integers.It would be called as if a function were written that takes twointegral integer parameters (whose values should be added) and returnsthe result in an integer. For instance, if the instruction to perform asaturating add is ‘adds’ on the convergent architecture, the result of:

k = sat_add(i, j )

wouldsimply be (assuming i, j, and k were in r1, r2, and r3 respectively):

adds r3, r1, r2

The expressive power of C and availability of optimizations (such asregister allocation, instruction scheduling, etc.) are possible withthe intrinsic function approach. The solution of intrinsic functionsalone is certainly an improvement over being required to write all DSPcode in assembly, but it also has limitations.

First, the use of intrinsic functions does not harness theportability of C. An architecture designer may mandate the intrinsicnames and so these names will not be shared between differentarchitectures.

Secondly, an intrinsic function’s name may not be particularlyrevealing of what it does. For instance, fixed point types arearithmetic types that can support many of the same operations asfloating point types. Yet if the user wants to port code that works forfloating point types to fixed point types, the code would need to berewritten to use less obvious intrinsic calls.

As an aside, the GNU compiler provides the ability to writearbitrary code directly in assembly within C source. The code can use Cvariables directly and is a way to provide access to functionality thatcannot be expressed by the standard C language. While this approach ismore general than the intrinsic function approach, it suffers fromsimilar problems.

The GNU ‘asm ’statement syntax is obscure and somewhat fragile. A mistype inparameters to the assembly code can result in low-level errors thatmanifest themselves at later points in C code.

Despite its potential, many regard this syntax to be harder to usethan intrinsic functions while serving the exact same purpose. Andrewriting assembly code for a new architecture is usually moretime-consuming and error prone than changing intrinsic function names,making code using assembly statements less portable than code usingintrinsic functions.

An arguably cleaner solution can be provided by C languageextensions. One set of appropriate language extensions are the DSP-Clanguage extensions. A subset of these extensions has becomestandardized in the ISO Embedded C specification. These extensions aredesigned to simplify the programming of DSPs using fixed point types,saturating types, and circular addressing features. With DSP extensionsto C, it would be possible to write a fixed point dot product in C codewith this more portable syntax:

Figure1. Use of DSP C to perform a dot-product computation

Another area that ISO C is lacking when describing operations usefulin DSP processing is in Single Instruction Multiple Data (SIMD)operations. SIMD instructions can be used in DSP algorithms in caseswhere a single operation could be applied independently to multiplevalues concurrently. C language extensions can be used to allow theprogrammer to think in terms of fixed sized vectors.

These vectors can be made part of the C type system so that afunction written with these extensions can explicitly indicate whatmultiple data operations should be performed at once. Awell-established C language extension for vectors exists for compilersthat target architectures with AltiVec and VMX extensions.

A function that adds n 16-bit integer elements in two arrays (a andb) and stores the result into a third result array can be written usingAltiVec extensions as such:

Figure2. Use of AltiVec extensions for SIMD vector addition computation

This syntax could be further simplified if the tools providercreates language extensions that allow, for instance, the additionoperator to be applied to vector types. Any architecture with SIMDoperations could be programmed more easily with similar vectorextensions.

The most portable and easiest way to write SIMD code is to write itas you would with regular C and allow the compiler to generate theappropriate SIMD instructions by advanced code analysis. Some compilersare able to perform this optimization called ‘automatic codevectorization’.

Even when using the most sophisticated tools, it is helpful to allowadvanced DSP developers the ability to hand-code their mosttime-critical algorithms. A tool suite can provide advanced assembleroptions to make assembly programming easier.

For instance, assembler extensions can be provided to make itpossible to write code that accesses C and C++ data structures,enumerator values, and preprocessing directives. As the C code ismaintained and improved, the assembly code can be rebuilt toautomatically reflect those improvements.

The best development tools for converged processors can accept C DSPextensions, vector extensions, can perform automatic codevectorization, and provides assembler extensions on variousarchitectures. But a powerful set of extensions in compilers andassemblers is really only part of the story when developing a systembased on a convergent processor. The linking and debugging environmentsshould also be able to simplify development.

Once you have written your code using DSP or vector extensions, yourdebugger should be able to debug your code that uses those extensions.For instance, viewing the value of a 16 bit signed fixed point typewhose value is 0.5 should show this value, not the underlying value inthe register (16384).

This may be an obvious expectation as floating point values arealways shown as their human readable value, not the IEEE-754 encodingstored in registers. But some tools environments essentially ‘forget’the types of DSP enhanced code, and lose the ability to display theirvalues, only showing their encodings. Similarly, a vector of eight16-bit integer types should be viewable as a list of 16 integer valuesand not as its amorphous 128-bit encoding in the corresponding vectorhardware.

If your convergent processor has internal memory, assigning andallocating t can be vital to achieving maximum operating performance.Data and text references in internal memory are generally guaranteed tobe single cycle accesses (much as references to cached memory intraditional RISC architectures).

Your build environment should be able to simplify the process ofallocating and reserving that memory. Language extensions can be usedto indicate which functions and data objects go into which sections.Powerful linkers can also take descriptions of sections in object filesshould be placed in the various kinds of memory.

For instance, it may be useful to have all of the text and data in firfilt.o, convolv.o,fft.o be placed inside of fast internal memory without modifyingthe source files themselves. This should be possible by modifying yourlinker description file either textually or graphically with a userinterface that allows assignment of object sections to particulartarget-resident memory areas. Because internal memory is often quitelimited, it is essential for the build environment to double check thatyou are not requesting more memory than is available on your core’son-chip memory.

When it comes to actually running the code you have built, yougenerally have the option to run it in an instruction set simulator orrun on the actual target. With a target image built that optimally usesthe internal memory, it is necessary that your development environmentcan flash the resulting firmware to your board. However, flashing alarge image is generally slow.

A faster approach to iterative debugging is to allow the user todownload the image directly into internal and external memory as itwould have been placed after booting, and then begin running theprogram. A state of the art probe can download a target image to someconvergent cores in less than a 10% of the time it would take to flashthe same image. This has an impact on development time that is hard tooverstate. Each time a minor change is made to an application todiagnose or fix a bug, having a faster way to test the change quicklyadds up to a lot of time savings.

Ken Mixter is engineering manager at Green Hills Software, Inc.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.