Using Embedded-C for high performance DSP programming - Embedded.com

Using Embedded-C for high performance DSP programming

High level language programming has been in use for a long time forembedded system development. However, assembly programming stillprevails, particularly forDSP based systems. DSP processors are oftenprogrammed in assembly language by assembly programmers that know theprocessor architecture inside out. The key motivation for this practiceis performance, despite the many disadvantages assembly programming hasover high level language programming.

Performance is key to signal processing applications because itdirectly translatesinto end-user features. A 10% lower clock speed may,for example, result in a 20% longer battery life. If the video decodingtakes 80% of the CPU-cycle budget instead of 90%, there are twice asmany cycles available for audio processing. This coupling ofperformance to end-user features is characteristic of many of thereal-time applications in which DSP processors are applied.

The fixed point and named address space extensions of Embedded Cprovide the programmer with direct access to these features in thetarget processor, thus significantly improving the performance ofapplications. The hardware I/O extension is a portability feature ofEmbedded C. Its goal is to allow easy porting of device driver codebetween systems.

Typical DSP Architectures
A look into the typical architecture of DSP processors is required tounderstand the need for an extension to C.

Figure1

Figure 1, above shows anexample DSP architecture. It has a highly specialized data-path that isoptimized to do fixed point convolution operations. There are twomemories that directly feed into the arithmetic unit.

To work most efficiently requires that the data is allocated in theappropriate memory. The second important feature is the arithmetic unititself. It can do a multiply-accumulate operation in one action and ithas a local wide data path and accumulator register to store anintermediate result. Additionally it has shifters to operate in a fixedpoint mode and can saturate overflowing results.

Local memory, fixed point arithmetic, saturation and the localaccumulator register are not efficiently addressed when using standardC.

Changing Requirements and the Roleof the Compiler
DSP architectures are not easy to program optimally, either by hand orwith a compiler. Manual assembly programming is awkward because of thenon-orthogonality of the architecture and arbitrary restrictions thatcan be in place. Modern compilers can deal with non-orthogonalityreasonably well, but are not good at exploiting the special featuresthat DSP processors have in place.

Many embedded applications, mobile phones for example, areimplemented using two main processors. One processor is a low-powerRISC processor that takes care of all control processing, userinteraction and display management. It is programmed in a high levellanguage making use of a software development kit that includes acompiler. The other processor is a DSP, which takes care of all of thesignal processing. The signal processing algorithms are typically handcoded in assembly.

Changes in technological and economic requirements make it more andmore expensive to continue programming the DSP processor in assembly.Staying with the mobile phone as an example, the signal processingalgorithms required become increasingly complex. Features such asstronger error correction and encryption must be added. Communicationprotocols become more sophisticated and require much more code toimplement. In certain market areas, multiple protocol stacks areimplemented to be compatible with multiple service providers. Inaddition, backward compatibility with older protocols is needed to staysynchronized with provider networks that are in a slow process ofupgrading.

On the economic side, time to market for new technology putsincreasing pressure on design time. In 2006, the number of mobilephones that is expected to be sold world-wide is in the order of 900million. In the western world, the time to replacement for mobilephones is between 1 and 2 years, and is driven by new features andfashion. To stay ahead in this market requires extremely fast andstreamlined design projects.

Assembly programming has no place in this world. Assemblyprogramming is error prone and slow. Assembly programs are difficult tomaintain and make a company dependent on a few specialists. Bydefinition, assembly programs are non-portable. Legacy code makes itextremely expensive to switch to a new technology. These dependenciesmake a company vulnerable to changes of its employees and its supplierchain.

The Programming Mismatch
Today, most embedded processors are offered with a C compiler. Despitethis, programming DSP processors is still done in assembly for thesignal processing parts or, at best, by using assembly writtenlibraries supplied by the manufacturer.

The key reason for this is that although the architecture is wellmatched to the requirements of the signal processing application, thereis no way to express the algorithms efficiently and in a natural way instandard C.

Saturated arithmetic, for example, is required in many algorithmsand supplied as a primitive in many DSP processors. However, there isno such primitive in standard C.

To express saturated arithmetic in C requires comparisons,conditional statements and correcting assignments. Instead of using alanguage primitive, the operation is spread over a number of statementsthat are difficult to recognize as a single instruction by a compiler.

Enter Embedded C
Embedded C is designed to bridge the performance mismatch between thesignal processing algorithms, standard C and the architecture. Itextends the C language with the primitives that are needed by signalprocessing applications and that are commonly provided by DSPprocessors.

AcePanelThe design ofEmbedded C is based on DSP-C by ACE.DSP-C [1] is anindustry designed extension of C with which experience was gained since1998 by various DSP processor manufacturers in their compilers.

Embedded C makes life easier for the application programmer. Theprimitives provided by Embedded C are the primitives that fit theconceptual model of the application. The Embedded C extension to Cunlocks the high performance features of DSP processors for Cprogrammers. Assembly programming is no longer required for a vast bodyof performance critical code. Maintainability and portability of codeare the key winners in this process.

Performance improving features introduced by Embedded C are fixedpoint and saturated arithmetic and named address spaces. For thedetails and language definition of Embedded C, see [2].

Arithmetic. Embedded C adds two new primitive types _Fract and _Accum, andone type qualifier named _Sat. The underscores are included inthese new keywords to ensure compatibility with existing applications.

The _Fract type offers fractional (also known as fixedpoint) data types that have avalue range of [-1.0, +1.0> (-1.0 includedbut not +1.0 ). This is conveniently implemented using thetwo-complement arithmetic typically used for integer arithmetic.Intwo-complement notation, the dot of the fixed point value is imaginedright after thesign bit, before all value bits. The first value bitrepresents 0.5 , the second 0.25 , etc. A fixed pointnumber has no integer part.

The Embedded C language does not specify the exact accuracy of thefixed point types although a minimum accuracy is defined to which animplementation must comply.The _Fract type can be qualifiedwith the existing type specifiers short and long todefine three different fractional types. The range of these types isthe same, [-1.0, +1.0> , but the accuracy should be equal orget better when moving from short _Fract to _Fract to long_Fract.

The _Accum is also a fractional type and can also becombined with short and long . The three resulting _Accum types must match the three _Fract types in terms ofaccuracy, the number of bits in the fraction. Additionally, the _Accum types have an integer part in their value.

<>So, the range of an _Accum value may be [-256.0,+256.0> . Again, the number of integer bits is not specified inthe Embedded C definition. The accumulator types match the accumulatorregisters of typical DSP processors. The aim of these registers is tokeep intermediate arithmetic results without having to worry aboutoverflow.

The _Sat qualifier can be applied to fractional types. Itmakes that all operationswith operands of _Sat qualified typeare saturated. It does not change the storage representation.Saturation means that if overflow occurs in an operation, the resultwill be set to the upper bound or lower bound of the type. For example,computing -0.75 + -0.75 results in -1.0 undersaturated fixed point arithmetic.

Saturated arithmetic is important for signal processingapplications because they often operate close to the boundaries of thearithmetic domain in order to get the best signal to noise ratio. Thisis unlike integer processing in C, which is usually considered “largeenough” and needs bound checks only at specific places.

NaturalSpelling. The ISO report defines the “natural spelling” of thenew keywords _Fract, _Accum and _Sa t to be fract,accum and sat . This means that in typical programming thenatural spelling is to be used, based on definitions (#define )in the system specific include file thatmap the natural spelling into the official identifiers. The officialidentifiers are defined with underscores to avoid name clashes inexisting programs. This is similar to the way _Complex isintroduced in C99.

The unsigned type specifier (already existing in standardC) can also be applied to the fractional types, providing arithmeticdomains starting from 0.0 . The range of the fract typebecomes [0.0, 1.0> .

Unsigned arithmetic is typically used in image processingapplications, but it is not universally present on all DSP processors.

Arithmetic Operations. Thearithmetic operations for fract and accum include allthose defined for the int type, but exclude ~, &, | and^ .

Conversions. Within the fractional type hierarchy, the usualimplicit conversions are defined. For example, the promotion of fractto unsigned fract or long fract is automatic; unsigned fract can bepromoted to accum, with similar promotions for the long qualifiedvariants.

Implicit conversions between fractional types and other types arefully defined. It is possible to write mixed type expressions.Conversions in mixed expressions are based on the rank order, which isint, fract, accum, float. Extensions were made to allow for true mixedtype operators, mixing integers and fractional types. This makes anexpression like 3 * 0.1r (where “r” denotes afractional constant) meaningful. Under the usual (non-mixed) arithmeticrules, the value “3” has to beconverted to a fractional value first which is out of range and wouldlead to a meaningless result. With the extended rules, the intendedoutcome of 0.3r is obtained.

Fixed Point Design Rationale. Analternative to the current choice in the fixed point design is to allowthe programmer to specify exactly the number of relevant bits of thefixed point types, or even to allow the programmer to specify thenumber of bits for every fixed point variable.In this way, theimplementation could guarantee the outcome of the computations.

Such a design would raise the abstraction level of the Embedded Clanguage and increase the portability of code. However, it would alsocompletely bypass the rationale of Embedded C, which is to provide agood match between the language and the performance increasing featuresof the processor.

Enforcing an implementation of Embedded C to implement, for example,a 40 bit accum type on a processor that offers only 24 bitaccumulators, is extremely awkward and would be highly inefficient. Inthat case Embedded C would be unusable for its purpose, which is toprovide the programmer with access to the high performance features ofthe processor.

Named Address Spaces. Embedded C uses address space qualifiers to identify specific addressspaces invariable declarations. There are no predefined keywords forthis, as the actual memory segmentation is left to the implementation.As an example, assume that X and Y are memory qualifiers. Thedefinition:

X int a[25] ;

means that a is an array of 25 integers which is located in the Xmemory.

Similarly (but less common):

X int * Y p ;

means that the pointer p is stored in the Y memory. This pointerpoints to integer data that is located in the X memory.

If no address qualifiers are used, the data is stored intounqualified memory that includes all address spaces. The unqualifiedmemory abstraction is needed to keep the compatibility of the void * type, the NULL pointer andto avoid duplication of all library code that accesses memory throughpointers that are passed as parameters.

I/O Hardware Addressing
The motivation to include primitives for I/O hardware addressing inEmbedded C is to improve the portability of device driver code. Inprinciple, a hardware device driver should only be concerned with thedevice itself. The driver operates on the device through deviceregisters, which are device specific.

However, the method to access these registers can be very differenton different systems, even though it is the same device that isconnected. The I/O hardware access primitives aim to create a layerthat abstracts the system specific access method from the device thatis accessed. The ultimate goal is to allow source code portability ofdevice drivers between different systems.

In the design of the I/O hardware addressing interface, threerequirements needed to be fulfilled:

 (1) The device drive source code must be portable.
 (2) The interface must not prevent implementations to producemachine code that is as efficient as other methods.
 (3) The design should permit encapsulation of the systemdependent accessmethod.

The design is based on a small collection of functions that arespecified in the include file. These interfaces aredivided into two groups, one group providing access to the device, thesecond group is for maintaining the access method abstraction itself.

Accessing the Device
To access the device, the following functions are defined by EmbeddedC:


These interfaces provide read and write access to deviceregisters, as well as typical methods for setting and resettingindividual bits. Variants of these functions are defined (with buf appended to the names) toaccess arrays of registers. Variants are also defined (with l appended) to operate with longvalues.

All of these interfaces take an I/O register designatorioreg_designator as one of the arguments. These register designatorsare an abstraction of the real registers provided by the systemimplementation and hide the access method from the driver source code.

Managing I/O Register Designators
Three functions are defined for managing the I/O registerdesignators. Although these are abstract entities for the devicedriver, the driver does have the obligation to initialize and releasethe access methods. Note that these functions do not access, orinitialize, the device itself since that is the task of the driver.They allow, for example, the operating system to provide a memorymapping of the device in the user address space.


The iogrp_designator specifies a logical group of I/O register designators; typically thiswill be all the registers of one device. Like the I/O registerdesignator, the I/O group designator is an identifier or macro that isprovided by the system implementation.

The map variant allows cloning of an access method when one devicedriver is to be used to access multiple identical devices.

C++ Compatibility
C++ compatibility was another topic of debate in the design of EmbeddedC. Preferably the extension should be expressed in such a way that itcan be implemented as C++ classes. It implies that the extension shouldnot depend on the use of type specifiers and qualifiers.

This, however, is against the “spirit of C” and leads to a longlist of types that result when all combinations of fractional types andtype specifiers are expanded. While this would be feasible for thefractional types, it would still not provide a solution for the namedaddress space qualifiers.

Hence the current design that follows standard C practice. In thecase code must be written that is to be accepted by both a C and C++compiler, one needs to define macros such as unsigned_long_accum, whichshould then expand into the right pattern for the specific compiler.For named address spaces there is no similar solution yet because thesedo not commonly appear in C++ compilers.

Features That Did Not Make It IntoEmbedded C
The current specification of Embedded C is not the final station.Embedded C is defined to be a common playground for extensions to the Clanguage required by typical application and system areas. Althoughthere were more proposals for extensions in Embedded C, the committeedecided to include only those that have shown a certain level ofmaturity.

CircularBuffers. A Circular Buffer is a memory addressing feature thatimplements wrap around access to arrays. Circular buffers are typicallyused to efficiently model sliding windows over streamed input data.Instead of shifting the array for every new data element in the window,the window simply wraps around at the end of the array. Hardwaresupport for this operation avoids control flow operations that arenormally required to implement such wrap around.

Although this feature was already (successfully) present in DSP-C,the committee found that there were too many different implementationsof circular buffer support in DSP processors to be able to define asingle unified specification in C that can map efficiently to allcurrent implementations.

FractionalComplex Data Types. Complex data types are defined in thecurrent ISO C standard (also known as C99).These are defined forfloating point types. Although an extension to the fractional types islogical, this was not incorporated in the current report yet.

Binary CodedDecimal Types . BCD types were briefly considered in thediscussions on Embedded C. The area of applications for these types isso diverse (also beyond embedded processing) that they were dropped.The current report deals with binary types only.

Modulo Wrapping. An alternative method of handling overflow named _Modwrap wasconsidered until late in the design of Embedded C. It would provide analternative qualifier for overflow rounding next to sat, which shouldround fixed point values modulo 1 in case of overflow. It was droppedmainly because the committee did not want to clutter the type systemwith another qualifier.

Example: FIR filter
The next figure shows an example of a FIR filter in Embedded C. It usesmany of the fixed point and address space features of Embedded C.

Good Embedded C compilers translate the loop into a single cycleMAC-with- post-increment instruction with zero-overhead loop control,just like an assembly programmer would.

Individual Embedded C features:case studies
Three case studies are reported here: measurement on individualfeatures, NECcompiler results and a rewritten loop from the MiBenchbenchmark. The measurements are done using two CoSy based compilers forDSP-C. The results directly translate to Embedded C performance.

Table1

Table 1 above shows results for three experiments.Each set of results reports the number of instructions in the innerloop of the algorithm and the size (in bytes) of a whole procedureimplementing the operation. Results marked * are optimal, they cannotbe improved by assembly hand-coding.

Saturation is very expensive when implemented in ISO C because ofthe conditional branching and large constants involved. In DSP-C, thecompiler can use the implicit saturation implemented by the hardware.

The inner loop of the FIR filter, as shown above, is translated toa single instructionwhen written in DSP-C. In ISO C additional shiftingis needed inside the inner loop,causing the algorithm to run at halfthe speed. The explicit saturation in ISO C is outside the loop, thusimpacting mostly the code size.

The array copy example uses address qualifiers to allocate theinput and output arrays in different address spaces. Because of thisthe compiler is able to compute that there are no data dependencies inthe loop and as a result it can use software pipelining to produce asingle instruction inner loop. In the ISO C version the read and writeoperations must be strictly sequential because of possible overlap inthe arrays. Again, this makes the code twice as slow.

Results Reported for the NECµPD77016 Compiler
Table 2 below shows resultsreported by NEC for the CoSy based µPD77016 Compiler.

Table2

Four applications are reported in this table. For each the totalnumber of clock cycles for running the application is given and thesize of the application in bytes.

The first application is a control application. It is included toshow the difference with signal processing applications anddemonstrates that there are few places for DSP-C to be used in controlcode.

The other three application are signal processing codes. They showthat a code size improvement up to a factor 4 can be achieved byemploying DSP-C. Even more impressive are the speed improvementsreported for the second and fourth application.They show almost afactor 10 performance increase!

This factor 10 raises some questions, actually. In the previoussection on individual features it was shown that typical improvementsof a factor 2 are reasonable to expect, unless the code is full ofsaturations. The next experiment is designed to shed some more light onthis.

Rewriting a loop from MiBench
For the next experiment a procedure from the MiBench embeddedprocessor benchmarks was taken. The selected code, written in ISO C, isfrom the telecomm/gsm directory, procedureShort_term_analysis_filtering.This procedure uses a number of macros and functions to implement fixedpoint arithmetic.The inner loop does two memory read operations and onememory write, as well as two fixed point multiplications and two fixedpoint additions.

The compiler used is the same as that used earlier. The targetarchitecture is a 16-bit VLIW DSP that can do two memory operations anda dual MAC operation in a single cycle. The compiler translates the ISOC code by inlining all function calls in the code, thus obtainingmaximal speed.

When rewriting the code in DSP-C, no more function calls remain.All operation scan be expressed in DSP-C. This makes an enormousdifference not only for the compiler but also for the programmer: theclarity of the code is improved tremendously. While the original istricky to understand because of the emulation of fixed point, the DSP-Cversion is absolutely clear in its intent. The results are shown in Table 3, below .

Table3

The columns ISO C and DSP-C show thehuge impact of rewriting the code into DSP-C. For cycles, the number ofclock cycles in the inner loop is reported. A stunning factor of 8 inperformance improvement is reached. Examining the assembly code showsthat this can be improved further.

Applying knowledge of implicit type conversions, two conversions tothe accum type can be placed in the source code, forcing the compilertouse the efficient multiply-accumulate unit. This achieves the resultin the last column, which is indeed optimal.

So what is the reason that such huge improvements can be achieved?Two factors contribute.

First, in the original ISO C code fixed point arithmetic wasencoded using macros. This makes the code relatively easy to maintainbut is not necessarily optimal. Careful study and optimization of thecode makes it possible to postpone saturation, to temporarily skipshifts and move some expensive corrections out of the inner loop. Evenin ISO C, a lot of improvements are possible in this way. Theend-result would, however, be very complicated code that is extremelydifficult to maintain and debug.

Second, the fixed point emulation in ISO C uses more variables. Thearchitecture that was used has a rather small register file, just likemost DSPs. For the given code the compiler runs out of registers andstarts to generate spill code. Luckily for this architecture spillingand restoring is rather efficient, costing only a single clock cycle,otherwise the results would have been even worse.

These figures are a proof of concept for Embedded C. They show thatfor real world applications Embedded C enables the use of the highspeed extensions that are found in modern DSP processors. It ispossible to effectively program DSP processors in C!

Conclusion
Embedded C is a relatively small extension to the C language, but itsimpact on the programmability of embedded and DSP processors inparticular is enormous. Specialized high performance features are thereasons why DSP processors exist. Without the Embedded C extensionthese features are inaccessible to the high level language applicationprogrammer.

Today it is still common to program these processors in assemblylanguage. The industry cannot afford the increasing time to market thatassembly programming of evermore complicated applications implies.Moreover the inherit dependency on a specific processor and a limitednumber of highly specialized assembly programmers incurs great risk andparalyzes future developments.

Embedded C offers a practical solution with proven results. It isbeing adopted by more and more compiler developers. With theratification of the ISO technical report, Embedded C is the standardsolution for high level language programming of the many billions ofembedded processors out in the field.

Marcel Beemster is SeniorSoftware Engineer, Hans van Someren is Principal architect, and WillemWakker is Director of ACE Consulting at AssociatedCompiler Experts.

This article is excerpted from a paper of the same namepresented at the Embedded Systems Conference Silicon Valley 2006. Usedwith permission of the Embedded Systems Conference. Please visit www.embedded.com/esc/sv.

References
[1]
ACE. DSP-C, an extension to ISO/IEC IS 9899:1990.Technical Report CoSy-8025P-dsp-c, ACEAssociated Compiler Experts bv,Amsterdam, The Netherlands, 1998. Downloadable from http://www.dsp-c.org.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.