Recently processor designers have been reducing power consumption and
part cost in a new way - by combining RISC and DSP features into a
single core, known as a ‘convergent’ processor core.
Two examples of architectures that were designed from the beginning
to be convergent are the Analog Devices Blackfin processor and StarCore
processors (SC1000, SC2000, and SC v5). Some other examples of
convergent architectures are based on well established RISC
architectures and have been modified to efficiently fulfill DSP
functions. These include the MIPS 24KE, Renesas SH3-DSP, PowerPC with
AltiVec, and ARM966E-S.
Compared to standalone RISC systems, such convergent processors can
be much more efficient at performing DSP tasks. Yet compared to
traditional DSP processor designs, convergent processors generally have
more complex pipelines and can run at high enough speeds to allow for
fast control-intensive computing.
But even with the design simplification and improvements in
production costs and power usage such hybrid architectures can offer to
the developer, programming can still be a chore. The choice of tools
can make a huge difference in the ability to bring a product to market
quickly that is maintainable, robust, and uniquely poised for the
challenges a successful product can bring.
For example, many of the DSP algorithms that can run on a converged
processor have been written in assembly. Assembly programming provides
the ultimate level of control over hardware and allows a clever
programmer to squeeze every last drip of processing power out of a
processor. As DSP applications get larger and more complex, the cost of
programming in assembly consequently increases.
While it still may be useful to profile the DSP core and find the
20% of your code that takes 80% of the processing time and rewrite that
code in assembly, the remainder of your code can be written in a higher
level programming language.
Writing all DSP processing code in low-level assembly can slow down
development and leave the product development team with a completely
non-portable, confusing, unmanageable source base. The C language type
system, run-time, and control-flow constructs greatly improves
development time, ease of debugging, code browsing, and maintenance.
But the ISO C99 language standard still does not make provisions for
operations commonly used in DSP applications, such as multiple-data
operations, saturating arithmetic, or fixed point types. With a
standard C compiler, an application could still be partitioned with
control tasks written in C and DSP written in assembly.
One way tool vendors have improved this situation is to allow DSP
programming in C by adding ‘intrinsic functions’ to their compiler.
Intrinsic functions are recognized by name in the compiler and are
handled differently than regular function calls in that the compiler
will generate a known instruction or series of instructions, typically
without making a function call at all.
For instance, an intrinsic function call might be used to signal to
the compiler that a saturating add should take place on two integers.
It would be called as if a function were written that takes two
integral integer parameters (whose values should be added) and returns
the result in an integer. For instance, if the instruction to perform a
saturating add is ‘adds’ on the convergent architecture, the result of:
k = sat_add(i, j )
would
simply be (assuming i, j, and k were in r1, r2, and r3 respectively):
adds r3, r1, r2
The expressive power of C and availability of optimizations (such as
register allocation, instruction scheduling, etc.) are possible with
the intrinsic function approach. The solution of intrinsic functions
alone is certainly an improvement over being required to write all DSP
code in assembly, but it also has limitations.
First, the use of intrinsic functions does not harness the
portability of C. An architecture designer may mandate the intrinsic
names and so these names will not be shared between different
architectures.
Secondly, an intrinsic function’s name may not be particularly
revealing of what it does. For instance, fixed point types are
arithmetic types that can support many of the same operations as
floating point types. Yet if the user wants to port code that works for
floating point types to fixed point types, the code would need to be
rewritten to use less obvious intrinsic calls.
As an aside, the GNU compiler provides the ability to write
arbitrary code directly in assembly within C source. The code can use C
variables directly and is a way to provide access to functionality that
cannot be expressed by the standard C language. While this approach is
more general than the intrinsic function approach, it suffers from
similar problems.
The GNU ‘asm’
statement syntax is obscure and somewhat fragile. A mistype in
parameters to the assembly code can result in low-level errors that
manifest themselves at later points in C code.
Despite its potential, many regard this syntax to be harder to use
than intrinsic functions while serving the exact same purpose. And
rewriting assembly code for a new architecture is usually more
time-consuming and error prone than changing intrinsic function names,
making code using assembly statements less portable than code using
intrinsic functions.
An arguably cleaner solution can be provided by C language
extensions. One set of appropriate language extensions are the DSP-C
language extensions. A subset of these extensions has become
standardized in the ISO Embedded C specification. These extensions are
designed to simplify the programming of DSPs using fixed point types,
saturating types, and circular addressing features. With DSP extensions
to C, it would be possible to write a fixed point dot product in C code
with this more portable syntax:
 |
| Figure
1. Use of DSP C to perform a dot-product computation |
Another area that ISO C is lacking when describing operations useful
in DSP processing is in Single Instruction Multiple Data (SIMD)
operations. SIMD instructions can be used in DSP algorithms in cases
where a single operation could be applied independently to multiple
values concurrently. C language extensions can be used to allow the
programmer to think in terms of fixed sized vectors.
These vectors can be made part of the C type system so that a
function written with these extensions can explicitly indicate what
multiple data operations should be performed at once. A
well-established C language extension for vectors exists for compilers
that target architectures with AltiVec and VMX extensions.
A function that adds n 16-bit integer elements in two arrays (a and
b) and stores the result into a third result array can be written using
AltiVec extensions as such:
 |
| Figure
2. Use of AltiVec extensions for SIMD vector addition computation |
This syntax could be further simplified if the tools provider
creates language extensions that allow, for instance, the addition
operator to be applied to vector types. Any architecture with SIMD
operations could be programmed more easily with similar vector
extensions.
The most portable and easiest way to write SIMD code is to write it
as you would with regular C and allow the compiler to generate the
appropriate SIMD instructions by advanced code analysis. Some compilers
are able to perform this optimization called ‘automatic code
vectorization’.
Even when using the most sophisticated tools, it is helpful to allow
advanced DSP developers the ability to hand-code their most
time-critical algorithms. A tool suite can provide advanced assembler
options to make assembly programming easier.
For instance, assembler extensions can be provided to make it
possible to write code that accesses C and C++ data structures,
enumerator values, and preprocessing directives. As the C code is
maintained and improved, the assembly code can be rebuilt to
automatically reflect those improvements.
The best development tools for converged processors can accept C DSP
extensions, vector extensions, can perform automatic code
vectorization, and provides assembler extensions on various
architectures. But a powerful set of extensions in compilers and
assemblers is really only part of the story when developing a system
based on a convergent processor. The linking and debugging environments
should also be able to simplify development.
Once you have written your code using DSP or vector extensions, your
debugger should be able to debug your code that uses those extensions.
For instance, viewing the value of a 16 bit signed fixed point type
whose value is 0.5 should show this value, not the underlying value in
the register (16384).
This may be an obvious expectation as floating point values are
always shown as their human readable value, not the IEEE-754 encoding
stored in registers. But some tools environments essentially ‘forget’
the types of DSP enhanced code, and lose the ability to display their
values, only showing their encodings. Similarly, a vector of eight
16-bit integer types should be viewable as a list of 16 integer values
and not as its amorphous 128-bit encoding in the corresponding vector
hardware.
If your convergent processor has internal memory, assigning and
allocating t can be vital to achieving maximum operating performance.
Data and text references in internal memory are generally guaranteed to
be single cycle accesses (much as references to cached memory in
traditional RISC architectures).
Your build environment should be able to simplify the process of
allocating and reserving that memory. Language extensions can be used
to indicate which functions and data objects go into which sections.
Powerful linkers can also take descriptions of sections in object files
should be placed in the various kinds of memory.
For instance, it may be useful to have all of the text and data in firfilt.o, convolv.o,
fft.o be placed inside of fast internal memory without modifying
the source files themselves. This should be possible by modifying your
linker description file either textually or graphically with a user
interface that allows assignment of object sections to particular
target-resident memory areas. Because internal memory is often quite
limited, it is essential for the build environment to double check that
you are not requesting more memory than is available on your core’s
on-chip memory.
When it comes to actually running the code you have built, you
generally have the option to run it in an instruction set simulator or
run on the actual target. With a target image built that optimally uses
the internal memory, it is necessary that your development environment
can flash the resulting firmware to your board. However, flashing a
large image is generally slow.
A faster approach to iterative debugging is to allow the user to
download the image directly into internal and external memory as it
would have been placed after booting, and then begin running the
program. A state of the art probe can download a target image to some
convergent cores in less than a 10% of the time it would take to flash
the same image. This has an impact on development time that is hard to
overstate. Each time a minor change is made to an application to
diagnose or fix a bug, having a faster way to test the change quickly
adds up to a lot of time savings.
Ken Mixter is engineering manager at Green Hills Software, Inc.