Developing DSP code on converged hybrid DSP/RISC coresRecently processor designers have been reducing power consumption and part cost in a new way - by combining RISC and DSP features into a single core, known as a ‘convergent’ processor core.
Two examples of architectures that were designed from the beginning to be convergent are the Analog Devices Blackfin processor and StarCore processors (SC1000, SC2000, and SC v5). Some other examples of convergent architectures are based on well established RISC architectures and have been modified to efficiently fulfill DSP functions. These include the MIPS 24KE, Renesas SH3-DSP, PowerPC with AltiVec, and ARM966E-S.
Compared to standalone RISC systems, such convergent processors can be much more efficient at performing DSP tasks. Yet compared to traditional DSP processor designs, convergent processors generally have more complex pipelines and can run at high enough speeds to allow for fast control-intensive computing.
But even with the design simplification and improvements in production costs and power usage such hybrid architectures can offer to the developer, programming can still be a chore. The choice of tools can make a huge difference in the ability to bring a product to market quickly that is maintainable, robust, and uniquely poised for the challenges a successful product can bring.
For example, many of the DSP algorithms that can run on a converged processor have been written in assembly. Assembly programming provides the ultimate level of control over hardware and allows a clever programmer to squeeze every last drip of processing power out of a processor. As DSP applications get larger and more complex, the cost of programming in assembly consequently increases.
While it still may be useful to profile the DSP core and find the 20% of your code that takes 80% of the processing time and rewrite that code in assembly, the remainder of your code can be written in a higher level programming language.
Writing all DSP processing code in low-level assembly can slow down development and leave the product development team with a completely non-portable, confusing, unmanageable source base. The C language type system, run-time, and control-flow constructs greatly improves development time, ease of debugging, code browsing, and maintenance.
But the ISO C99 language standard still does not make provisions for operations commonly used in DSP applications, such as multiple-data operations, saturating arithmetic, or fixed point types. With a standard C compiler, an application could still be partitioned with control tasks written in C and DSP written in assembly.
One way tool vendors have improved this situation is to allow DSP programming in C by adding ‘intrinsic functions’ to their compiler. Intrinsic functions are recognized by name in the compiler and are handled differently than regular function calls in that the compiler will generate a known instruction or series of instructions, typically without making a function call at all.
For instance, an intrinsic function call might be used to signal to the compiler that a saturating add should take place on two integers. It would be called as if a function were written that takes two integral integer parameters (whose values should be added) and returns the result in an integer. For instance, if the instruction to perform a saturating add is ‘adds’ on the convergent architecture, the result of:
k = sat_add(i, j )
would simply be (assuming i, j, and k were in r1, r2, and r3 respectively):
adds r3, r1, r2
The expressive power of C and availability of optimizations (such as register allocation, instruction scheduling, etc.) are possible with the intrinsic function approach. The solution of intrinsic functions alone is certainly an improvement over being required to write all DSP code in assembly, but it also has limitations.
First, the use of intrinsic functions does not harness the portability of C. An architecture designer may mandate the intrinsic names and so these names will not be shared between different architectures.
Secondly, an intrinsic function’s name may not be particularly revealing of what it does. For instance, fixed point types are arithmetic types that can support many of the same operations as floating point types. Yet if the user wants to port code that works for floating point types to fixed point types, the code would need to be rewritten to use less obvious intrinsic calls.
As an aside, the GNU compiler provides the ability to write arbitrary code directly in assembly within C source. The code can use C variables directly and is a way to provide access to functionality that cannot be expressed by the standard C language. While this approach is more general than the intrinsic function approach, it suffers from similar problems.
The GNU ‘asm’ statement syntax is obscure and somewhat fragile. A mistype in parameters to the assembly code can result in low-level errors that manifest themselves at later points in C code.
Despite its potential, many regard this syntax to be harder to use than intrinsic functions while serving the exact same purpose. And rewriting assembly code for a new architecture is usually more time-consuming and error prone than changing intrinsic function names, making code using assembly statements less portable than code using intrinsic functions.
An arguably cleaner solution can be provided by C language extensions. One set of appropriate language extensions are the DSP-C language extensions. A subset of these extensions has become standardized in the ISO Embedded C specification. These extensions are designed to simplify the programming of DSPs using fixed point types, saturating types, and circular addressing features. With DSP extensions to C, it would be possible to write a fixed point dot product in C code with this more portable syntax:
|Figure 1. Use of DSP C to perform a dot-product computation|
Another area that ISO C is lacking when describing operations useful in DSP processing is in Single Instruction Multiple Data (SIMD) operations. SIMD instructions can be used in DSP algorithms in cases where a single operation could be applied independently to multiple values concurrently. C language extensions can be used to allow the programmer to think in terms of fixed sized vectors.
These vectors can be made part of the C type system so that a function written with these extensions can explicitly indicate what multiple data operations should be performed at once. A well-established C language extension for vectors exists for compilers that target architectures with AltiVec and VMX extensions.
A function that adds n 16-bit integer elements in two arrays (a and b) and stores the result into a third result array can be written using AltiVec extensions as such:
|Figure 2. Use of AltiVec extensions for SIMD vector addition computation|
This syntax could be further simplified if the tools provider creates language extensions that allow, for instance, the addition operator to be applied to vector types. Any architecture with SIMD operations could be programmed more easily with similar vector extensions.
The most portable and easiest way to write SIMD code is to write it as you would with regular C and allow the compiler to generate the appropriate SIMD instructions by advanced code analysis. Some compilers are able to perform this optimization called ‘automatic code vectorization’.
Even when using the most sophisticated tools, it is helpful to allow advanced DSP developers the ability to hand-code their most time-critical algorithms. A tool suite can provide advanced assembler options to make assembly programming easier.
For instance, assembler extensions can be provided to make it possible to write code that accesses C and C++ data structures, enumerator values, and preprocessing directives. As the C code is maintained and improved, the assembly code can be rebuilt to automatically reflect those improvements.
The best development tools for converged processors can accept C DSP extensions, vector extensions, can perform automatic code vectorization, and provides assembler extensions on various architectures. But a powerful set of extensions in compilers and assemblers is really only part of the story when developing a system based on a convergent processor. The linking and debugging environments should also be able to simplify development.
Once you have written your code using DSP or vector extensions, your debugger should be able to debug your code that uses those extensions. For instance, viewing the value of a 16 bit signed fixed point type whose value is 0.5 should show this value, not the underlying value in the register (16384).
This may be an obvious expectation as floating point values are always shown as their human readable value, not the IEEE-754 encoding stored in registers. But some tools environments essentially ‘forget’ the types of DSP enhanced code, and lose the ability to display their values, only showing their encodings. Similarly, a vector of eight 16-bit integer types should be viewable as a list of 16 integer values and not as its amorphous 128-bit encoding in the corresponding vector hardware.
If your convergent processor has internal memory, assigning and allocating t can be vital to achieving maximum operating performance. Data and text references in internal memory are generally guaranteed to be single cycle accesses (much as references to cached memory in traditional RISC architectures).
Your build environment should be able to simplify the process of allocating and reserving that memory. Language extensions can be used to indicate which functions and data objects go into which sections. Powerful linkers can also take descriptions of sections in object files should be placed in the various kinds of memory.
For instance, it may be useful to have all of the text and data in firfilt.o, convolv.o, fft.o be placed inside of fast internal memory without modifying the source files themselves. This should be possible by modifying your linker description file either textually or graphically with a user interface that allows assignment of object sections to particular target-resident memory areas. Because internal memory is often quite limited, it is essential for the build environment to double check that you are not requesting more memory than is available on your core’s on-chip memory.
When it comes to actually running the code you have built, you generally have the option to run it in an instruction set simulator or run on the actual target. With a target image built that optimally uses the internal memory, it is necessary that your development environment can flash the resulting firmware to your board. However, flashing a large image is generally slow.
A faster approach to iterative debugging is to allow the user to download the image directly into internal and external memory as it would have been placed after booting, and then begin running the program. A state of the art probe can download a target image to some convergent cores in less than a 10% of the time it would take to flash the same image. This has an impact on development time that is hard to overstate. Each time a minor change is made to an application to diagnose or fix a bug, having a faster way to test the change quickly adds up to a lot of time savings.
Ken Mixter is engineering manager at Green Hills Software, Inc.
Currently no items