Using Embedded-C for high performance DSP programming
High level language programming has been in use for a long time for embedded system development. However, assembly programming still prevails, particularly forDSP based systems. DSP processors are often programmed in assembly language by assembly programmers that know the processor architecture inside out. The key motivation for this practice is performance, despite the many disadvantages assembly programming has over high level language programming.
Performance is key to signal processing applications because it directly translatesinto end-user features. A 10% lower clock speed may, for example, result in a 20% longer battery life. If the video decoding takes 80% of the CPU-cycle budget instead of 90%, there are twice as many cycles available for audio processing. This coupling of performance to end-user features is characteristic of many of the real-time applications in which DSP processors are applied.
The fixed point and named address space extensions of Embedded C provide the programmer with direct access to these features in the target processor, thus significantly improving the performance of applications. The hardware I/O extension is a portability feature of Embedded C. Its goal is to allow easy porting of device driver code between systems.
Typical DSP Architectures
A look into the typical architecture of DSP processors is required to understand the need for an extension to C.
Figure 1, above shows an example DSP architecture. It has a highly specialized data-path that is optimized to do fixed point convolution operations. There are two memories that directly feed into the arithmetic unit.
To work most efficiently requires that the data is allocated in the appropriate memory. The second important feature is the arithmetic unit itself. It can do a multiply-accumulate operation in one action and it has a local wide data path and accumulator register to store an intermediate result. Additionally it has shifters to operate in a fixed point mode and can saturate overflowing results.
Local memory, fixed point arithmetic, saturation and the local accumulator register are not efficiently addressed when using standard C.
Changing Requirements and the Role
of the Compiler
DSP architectures are not easy to program optimally, either by hand or with a compiler. Manual assembly programming is awkward because of the non-orthogonality of the architecture and arbitrary restrictions that can be in place. Modern compilers can deal with non-orthogonality reasonably well, but are not good at exploiting the special features that DSP processors have in place.
Many embedded applications, mobile phones for example, are implemented using two main processors. One processor is a low-power RISC processor that takes care of all control processing, user interaction and display management. It is programmed in a high level language making use of a software development kit that includes a compiler. The other processor is a DSP, which takes care of all of the signal processing. The signal processing algorithms are typically hand coded in assembly.
Changes in technological and economic requirements make it more and more expensive to continue programming the DSP processor in assembly. Staying with the mobile phone as an example, the signal processing algorithms required become increasingly complex. Features such as stronger error correction and encryption must be added. Communication protocols become more sophisticated and require much more code to implement. In certain market areas, multiple protocol stacks are implemented to be compatible with multiple service providers. In addition, backward compatibility with older protocols is needed to stay synchronized with provider networks that are in a slow process of upgrading.
On the economic side, time to market for new technology puts increasing pressure on design time. In 2006, the number of mobile phones that is expected to be sold world-wide is in the order of 900 million. In the western world, the time to replacement for mobile phones is between 1 and 2 years, and is driven by new features and fashion. To stay ahead in this market requires extremely fast and streamlined design projects.
Assembly programming has no place in this world. Assembly programming is error prone and slow. Assembly programs are difficult to maintain and make a company dependent on a few specialists. By definition, assembly programs are non-portable. Legacy code makes it extremely expensive to switch to a new technology. These dependencies make a company vulnerable to changes of its employees and its supplier chain.
The Programming Mismatch
Today, most embedded processors are offered with a C compiler. Despite this, programming DSP processors is still done in assembly for the signal processing parts or, at best, by using assembly written libraries supplied by the manufacturer.
The key reason for this is that although the architecture is well matched to the requirements of the signal processing application, there is no way to express the algorithms efficiently and in a natural way in standard C.
Saturated arithmetic, for example, is required in many algorithms and supplied as a primitive in many DSP processors. However, there is no such primitive in standard C.
To express saturated arithmetic in C requires comparisons, conditional statements and correcting assignments. Instead of using a language primitive, the operation is spread over a number of statements that are difficult to recognize as a single instruction by a compiler.Enter Embedded C
Embedded C is designed to bridge the performance mismatch between the signal processing algorithms, standard C and the architecture. It extends the C language with the primitives that are needed by signal processing applications and that are commonly provided by DSP processors.
The design of Embedded C is based on DSP-C by ACE.DSP-C  is an industry designed extension of C with which experience was gained since 1998 by various DSP processor manufacturers in their compilers.
Embedded C makes life easier for the application programmer. The primitives provided by Embedded C are the primitives that fit the conceptual model of the application. The Embedded C extension to C unlocks the high performance features of DSP processors for C programmers. Assembly programming is no longer required for a vast body of performance critical code. Maintainability and portability of code are the key winners in this process.
Performance improving features introduced by Embedded C are fixed point and saturated arithmetic and named address spaces. For the details and language definition of Embedded C, see .
Arithmetic. Embedded C adds two new primitive types _Fract and _Accum, and one type qualifier named _Sat. The underscores are included in these new keywords to ensure compatibility with existing applications.
The _Fract type offers fractional (also known as fixed
point) data types that have avalue range of [-1.0, +1.0> (-1.0 included
but not +1.0). This is conveniently implemented using the
two-complement arithmetic typically used for integer arithmetic.In
two-complement notation, the dot of the fixed point value is imagined
right after thesign bit, before all value bits. The first value bit
represents 0.5, the second 0.25, etc. A fixed point
number has no integer part.
The Embedded C language does not specify the exact accuracy of the fixed point types although a minimum accuracy is defined to which an implementation must comply.The _Fract type can be qualified with the existing type specifiers short and long to define three different fractional types. The range of these types is the same, [-1.0, +1.0>, but the accuracy should be equal or get better when moving from short _Fract to _Fract to long _Fract.
The _Accum is also a fractional type and can also be combined with short and long. The three resulting _Accum types must match the three _Fract types in terms of accuracy, the number of bits in the fraction. Additionally, the _Accum types have an integer part in their value.<>So, the range of an _Accum value may be [-256.0, +256.0>. Again, the number of integer bits is not specified in the Embedded C definition. The accumulator types match the accumulator registers of typical DSP processors. The aim of these registers is to keep intermediate arithmetic results without having to worry about overflow.
The _Sat qualifier can be applied to fractional types. It makes that all operationswith operands of _Sat qualified type are saturated. It does not change the storage representation. Saturation means that if overflow occurs in an operation, the result will be set to the upper bound or lower bound of the type. For example, computing -0.75 + -0.75 results in -1.0 under saturated fixed point arithmetic.
Saturated arithmetic is important for signal processing applications because they often operate close to the boundaries of the arithmetic domain in order to get the best signal to noise ratio. This is unlike integer processing in C, which is usually considered "large enough" and needs bound checks only at specific places.
Spelling. The ISO report defines the "natural spelling" of the
new keywords _Fract, _Accum and _Sat to be fract,
accum and sat. This means that in typical programming the
natural spelling is to be used, based on definitions (#define)
in the system specific include file
The unsigned type specifier (already existing in standard C) can also be applied to the fractional types, providing arithmetic domains starting from 0.0. The range of the fract type becomes [0.0, 1.0>.
Unsigned arithmetic is typically used in image processing applications, but it is not universally present on all DSP processors.
Arithmetic Operations. The arithmetic operations for fract and accum include all those defined for the int type, but exclude ~, &, | and ^.
Conversions. Within the fractional type hierarchy, the usual implicit conversions are defined. For example, the promotion of fract to unsigned fract or long fract is automatic; unsigned fract can be promoted to accum, with similar promotions for the long qualified variants.
Implicit conversions between fractional types and other types are fully defined. It is possible to write mixed type expressions. Conversions in mixed expressions are based on the rank order, which is int, fract, accum, float. Extensions were made to allow for true mixed type operators, mixing integers and fractional types. This makes an expression like 3 * 0.1r (where "r" denotes a fractional constant) meaningful. Under the usual (non-mixed) arithmetic rules, the value "3" has to be converted to a fractional value first which is out of range and would lead to a meaningless result. With the extended rules, the intended outcome of 0.3r is obtained.
Fixed Point Design Rationale. An alternative to the current choice in the fixed point design is to allow the programmer to specify exactly the number of relevant bits of the fixed point types, or even to allow the programmer to specify the number of bits for every fixed point variable.In this way, the implementation could guarantee the outcome of the computations.
Such a design would raise the abstraction level of the Embedded C
language and increase the portability of code. However, it would also
completely bypass the rationale of Embedded C, which is to provide a
good match between the language and the performance increasing features
of the processor.
Enforcing an implementation of Embedded C to implement, for example, a 40 bit accum type on a processor that offers only 24 bit accumulators, is extremely awkward and would be highly inefficient. In that case Embedded C would be unusable for its purpose, which is to provide the programmer with access to the high performance features of the processor.
Named Address Spaces.
Embedded C uses address space qualifiers to identify specific address
spaces invariable declarations. There are no predefined keywords for
this, as the actual memory segmentation is left to the implementation.
As an example, assume that X and Y are memory qualifiers. The
X int a ;
means that a is an array of 25 integers which is located in the X
Similarly (but less common):
X int * Y p ;
means that the pointer p is stored in the Y memory. This pointer points to integer data that is located in the X memory.
If no address qualifiers are used, the data is stored into unqualified memory that includes all address spaces. The unqualified memory abstraction is needed to keep the compatibility of the void * type, the NULL pointer and to avoid duplication of all library code that accesses memory through pointers that are passed as parameters.
I/O Hardware Addressing
The motivation to include primitives for I/O hardware addressing in Embedded C is to improve the portability of device driver code. In principle, a hardware device driver should only be concerned with the device itself. The driver operates on the device through device registers, which are device specific.
However, the method to access these registers can be very different on different systems, even though it is the same device that is connected. The I/O hardware access primitives aim to create a layer that abstracts the system specific access method from the device that is accessed. The ultimate goal is to allow source code portability of device drivers between different systems.
In the design of the I/O hardware addressing interface, three requirements needed to be fulfilled:
(1) The device drive source code must be portable.
(2) The interface must not prevent implementations to produce machine code that is as efficient as other methods.
(3) The design should permit encapsulation of the system dependent accessmethod.
The design is based on a small collection of functions that are
specified in the
Accessing the Device
To access the device, the following functions are defined by Embedded C:
These interfaces provide read and write access to device registers, as well as typical methods for setting and resetting individual bits. Variants of these functions are defined (with buf appended to the names) to access arrays of registers. Variants are also defined (with l appended) to operate with long values.
All of these interfaces take an I/O register designator ioreg_designator as one of the arguments. These register designators are an abstraction of the real registers provided by the system implementation and hide the access method from the driver source code.
Managing I/O Register Designators
Three functions are defined for managing the I/O register designators. Although these are abstract entities for the device driver, the driver does have the obligation to initialize and release the access methods. Note that these functions do not access, or initialize, the device itself since that is the task of the driver. They allow, for example, the operating system to provide a memory mapping of the device in the user address space.
The iogrp_designator specifies a logical group of I/O register designators; typically this will be all the registers of one device. Like the I/O register designator, the I/O group designator is an identifier or macro that is provided by the system implementation.
The map variant allows cloning of an access method when one device driver is to be used to access multiple identical devices.
C++ compatibility was another topic of debate in the design of Embedded C. Preferably the extension should be expressed in such a way that it can be implemented as C++ classes. It implies that the extension should not depend on the use of type specifiers and qualifiers.
This, however, is against the "spirit of C" and leads to a long list of types that result when all combinations of fractional types and type specifiers are expanded. While this would be feasible for the fractional types, it would still not provide a solution for the named address space qualifiers.
Hence the current design that follows standard C practice. In the case code must be written that is to be accepted by both a C and C++ compiler, one needs to define macros such as unsigned_long_accum, which should then expand into the right pattern for the specific compiler. For named address spaces there is no similar solution yet because these do not commonly appear in C++ compilers.
Features That Did Not Make It Into
The current specification of Embedded C is not the final station. Embedded C is defined to be a common playground for extensions to the C language required by typical application and system areas. Although there were more proposals for extensions in Embedded C, the committee decided to include only those that have shown a certain level of maturity.
Circular Buffers. A Circular Buffer is a memory addressing feature that implements wrap around access to arrays. Circular buffers are typically used to efficiently model sliding windows over streamed input data. Instead of shifting the array for every new data element in the window, the window simply wraps around at the end of the array. Hardware support for this operation avoids control flow operations that are normally required to implement such wrap around.
Although this feature was already (successfully) present in DSP-C, the committee found that there were too many different implementations of circular buffer support in DSP processors to be able to define a single unified specification in C that can map efficiently to all current implementations.
Fractional Complex Data Types. Complex data types are defined in the current ISO C standard (also known as C99).These are defined for floating point types. Although an extension to the fractional types is logical, this was not incorporated in the current report yet.
Binary Coded Decimal Types. BCD types were briefly considered in the discussions on Embedded C. The area of applications for these types is so diverse (also beyond embedded processing) that they were dropped. The current report deals with binary types only.
Modulo Wrapping. An alternative method of handling overflow named _Modwrap was considered until late in the design of Embedded C. It would provide an alternative qualifier for overflow rounding next to sat, which should round fixed point values modulo 1 in case of overflow. It was dropped mainly because the committee did not want to clutter the type system with another qualifier.
Example: FIR filter
The next figure shows an example of a FIR filter in Embedded C. It uses many of the fixed point and address space features of Embedded C.
Good Embedded C compilers translate the loop into a single cycle MAC-with- post-increment instruction with zero-overhead loop control, just like an assembly programmer would.
Individual Embedded C features:
Three case studies are reported here: measurement on individual features, NECcompiler results and a rewritten loop from the MiBench benchmark. The measurements are done using two CoSy based compilers for DSP-C. The results directly translate to Embedded C performance.
Table 1 above shows results for three experiments. Each set of results reports the number of instructions in the inner loop of the algorithm and the size (in bytes) of a whole procedure implementing the operation. Results marked * are optimal, they cannot be improved by assembly hand-coding.
Saturation is very expensive when implemented in ISO C because of the conditional branching and large constants involved. In DSP-C, the compiler can use the implicit saturation implemented by the hardware.
The inner loop of the FIR filter, as shown above, is translated to a single instructionwhen written in DSP-C. In ISO C additional shifting is needed inside the inner loop,causing the algorithm to run at half the speed. The explicit saturation in ISO C is outside the loop, thus impacting mostly the code size.
The array copy example uses address qualifiers to allocate the input and output arrays in different address spaces. Because of this the compiler is able to compute that there are no data dependencies in the loop and as a result it can use software pipelining to produce a single instruction inner loop. In the ISO C version the read and write operations must be strictly sequential because of possible overlap in the arrays. Again, this makes the code twice as slow.
Results Reported for the NEC
Table 2 below shows results reported by NEC for the CoSy based µPD77016 Compiler.
Four applications are reported in this table. For each the total number of clock cycles for running the application is given and the size of the application in bytes.
The first application is a control application. It is included to show the difference with signal processing applications and demonstrates that there are few places for DSP-C to be used in control code.
The other three application are signal processing codes. They show that a code size improvement up to a factor 4 can be achieved by employing DSP-C. Even more impressive are the speed improvements reported for the second and fourth application.They show almost a factor 10 performance increase!
This factor 10 raises some questions, actually. In the previous section on individual features it was shown that typical improvements of a factor 2 are reasonable to expect, unless the code is full of saturations. The next experiment is designed to shed some more light on this.
Rewriting a loop from MiBench
For the next experiment a procedure from the MiBench embedded processor benchmarks was taken. The selected code, written in ISO C, is from the telecomm/gsm directory, procedure Short_term_analysis_filtering. This procedure uses a number of macros and functions to implement fixed point arithmetic.The inner loop does two memory read operations and one memory write, as well as two fixed point multiplications and two fixed point additions.
The compiler used is the same as that used earlier. The target architecture is a 16-bit VLIW DSP that can do two memory operations and a dual MAC operation in a single cycle. The compiler translates the ISO C code by inlining all function calls in the code, thus obtaining maximal speed.
When rewriting the code in DSP-C, no more function calls remain. All operation scan be expressed in DSP-C. This makes an enormous difference not only for the compiler but also for the programmer: the clarity of the code is improved tremendously. While the original is tricky to understand because of the emulation of fixed point, the DSP-C version is absolutely clear in its intent. The results are shown in Table 3, below.
The columns ISO C and DSP-C show the huge impact of rewriting the code into DSP-C. For cycles, the number of clock cycles in the inner loop is reported. A stunning factor of 8 in performance improvement is reached. Examining the assembly code shows that this can be improved further.
Applying knowledge of implicit type conversions, two conversions to the accum type can be placed in the source code, forcing the compiler touse the efficient multiply-accumulate unit. This achieves the result in the last column, which is indeed optimal.
So what is the reason that such huge improvements can be achieved? Two factors contribute.
First, in the original ISO C code fixed point arithmetic was encoded using macros. This makes the code relatively easy to maintain but is not necessarily optimal. Careful study and optimization of the code makes it possible to postpone saturation, to temporarily skip shifts and move some expensive corrections out of the inner loop. Even in ISO C, a lot of improvements are possible in this way. The end-result would, however, be very complicated code that is extremely difficult to maintain and debug.
Second, the fixed point emulation in ISO C uses more variables. The architecture that was used has a rather small register file, just like most DSPs. For the given code the compiler runs out of registers and starts to generate spill code. Luckily for this architecture spilling and restoring is rather efficient, costing only a single clock cycle, otherwise the results would have been even worse.
These figures are a proof of concept for Embedded C. They show that for real world applications Embedded C enables the use of the high speed extensions that are found in modern DSP processors. It is possible to effectively program DSP processors in C!
Embedded C is a relatively small extension to the C language, but its impact on the programmability of embedded and DSP processors in particular is enormous. Specialized high performance features are the reasons why DSP processors exist. Without the Embedded C extension these features are inaccessible to the high level language application programmer.
Today it is still common to program these processors in assembly language. The industry cannot afford the increasing time to market that assembly programming of evermore complicated applications implies. Moreover the inherit dependency on a specific processor and a limited number of highly specialized assembly programmers incurs great risk and paralyzes future developments.
Embedded C offers a practical solution with proven results. It is being adopted by more and more compiler developers. With the ratification of the ISO technical report, Embedded C is the standard solution for high level language programming of the many billions of embedded processors out in the field.
Marcel Beemster is Senior
Software Engineer, Hans van Someren is Principal architect, and Willem
Wakker is Director of ACE Consulting at Associated
This article is excerpted from a paper of the same name presented at the Embedded Systems Conference Silicon Valley 2006. Used with permission of the Embedded Systems Conference. Please visit www.embedded.com/esc/sv.
 ACE. DSP-C, an extension to ISO/IEC IS 9899:1990. Technical Report CoSy-8025P-dsp-c, ACEAssociated Compiler Experts bv, Amsterdam, The Netherlands, 1998. Downloadable from http://www.dsp-c.org.
 JTC1/SC22/WG14. Extensions for the programming language C to support embedded processors.Technical report, ISO/IEC, 2003. Downloadable from http://www.dkuug.dk/JTC1/SC22/WG14.