Optimizing C programs for embedded SoC applications - Embedded.com

Optimizing C programs for embedded SoC applications

Program developers of embedded processor cores within System-onChips (SoCs) want theircode to run fast tolessen processor-operating frequency, consume little memory and reducememory cost. Two key factors affecting the design team’s ability tomeet such goals are the compiler’s code-optimizing efficiency andsource-code programming styles.

Compilers consist of frontand back-ends. The front-end is for syntactic and semantic processing.The back-end performs general optimizations, code generation andfurther optimization for specific target processor. Good back-ends relyon multilevel intermediaterepresentations (IRs) .

Optimization and code generation pass down gradually through IR froma high level such as the program’s syntax to a low level thatapproaches code.

Machine-independent optimizations are performed earlier on higher IRlevels in the compilation while later processor- specific optimizationsare on lower IR levels. Information passes through various IR levels sothat low-level optimizations benefit from high-level informationcaptured in earlier compiler phases.

Table 1: Modern compilers contain optimizations that allow bettercompilation performance.

<>Modern compilers offer optimization levels (Table 1 above ). Tensilica’sXCC C/C++ has four basic optimization levels, -O0 to -O3.Furthermore, they use profile data during compilation, thus mitigatingbranch delays. Feedback allows the compiler to inline onlyfrequently-called functions and to place register spills in rarely-usedcode regions. This enables modern compilers to optimize anapplication’s critical portions for speed and space.

Refining scripts
For best compiler performance, programmers must think the way compilersdo and understand the relationship between the C language and target processor.Thefollowing basic rules can help developers achieve better code withoutmuch effort.

Examine thegenerated code. Although it is not possible to fully understandhow a compiler translates a given source program, most have –save-tempsflag or –S that generates an assembly language output file, withcomments provided for better understanding. For performance criticalcode, examine this output file to check if the generated code is theone expected. If not, consider the following alternative suggestions.

Observealiasing. The C language allows arbitrary use of pointers, thusenabling aliasing, which allows multiple reference to the same dataobject. If a global variable’s address is passed as subprogramargument, the variable can be referenced with its global name or via apointer.

In dealing with possibly-aliased objects, the compiler must keepsuch aliased variables in memory, not in registers, and retain theoriginal source program’s use order for all variables with possiblealiasing. Consider the code below:

void foo(int *a, int *b)
{
int i;
for (i=0; i<100; i++) {
*a += b[i];
}
}

It can be assumed that the compiler will generate code that loads *a into a register before the loopand that this will contain an inner loop that loads b[i] into a register, then adds itinto the register containing *a in every iteration. In fact, the compiler generates code storing *a into memory on every iterationbecause a and b might be aliased, as *a might be an array b element.Though it appears that the variables will not be aliased, the compileris unsure.

For better optimization in the presence of aliasing, compile using -IPA , use global values instead ofparameters, compile with special flags, or annotate a variabledeclaration with the __restrict attribute.

Pointers causealiasing problems. This makes it difficult for compilers toidentify target objects specified by a pointer. To avoid possiblealiases by using local variables, store values from dereferencedpointers, since indirect operations and calls affect dereferencedvalues, not local variables. Consequently, the compiler can put localvariables in registers.

Proper placement of pointers and elimination of aliasing producesbetter code as shown in the following code. But, the optimizer does notknow if *p++ = 0 will modify len, so it cannot optimize performance byplacing len in a register. Instead, len is loaded from memory per loopiteration.

<>int len = 10;
void
zero(char *p)
{
   int i;
   for (i=0; i>len; i++)*P++= 0;
}

Using a local instead of a global variable eliminated the aliasingproblem.

int len = 10;
void
zero(char *p)
{
   int local_len = len;
   int i;for (i=0; i< local_len; i++) *p++ = 0;
}

Use const andrestrict qualifiers . The__restrict keyword commands thatdereferencing the qualified pointer is the only way to access memorypointed to by that pointer. Loads and stores through such a pointer areassumed not to alias with other loads and stores in the same function,except for loads and stores through the same pointer variable.

Use local overglobal variables. Global variables carry values throughout theprogram. The compiler must then assume that the value of a globalvariable might be used by calls or pointer dereferences.

int g;
void foo()
{
    int i;
    for (i=0; i<100;i++;i
     fred(i,g);
    }

Ideally, g is loaded onceoutside the loop, and its value is passed in a register to functionfred. However, the compiler cannot distinguish that fred does notmodify g . Thus, if fred doesnot modify g, the code isrewritten using a local variable. Doing so saves a load of g into a register on every loopiteration.

int g;
void foo()
{
   int i, local_g=g;
   for (i=0; i<100; i++){
     fred(i,local_g);
   }
 }

Use appropriatedata types. C programmers assume the underlying representationof basic data types and so, the compiler must be careful to respectsuch. In most modern architectures, an unsigned char uses 8bits, thustaking the values 0 to 255. A C program can assume that adding 1 to anunsigned char variable that has the value 255 will wrap around andgenerate 0 as answer.

Modern 32bit processors do not perform 8 bit, but 32 bit addition.Consequently, if a local variable of type unsigned char is incremented,the compiler must use multiple instructions to zero-extend the variableafter the addition. Therefore, use type int variables for local, scalarvariables wherever possible, particularly for those used as loopindices.

Meanwhile, many embedded processors have 16bit, but lack 32bitmultiplication instructions. On such processors, 32bit multiplies areemulated slowly. For data computations not requiring more than 16bitprecision, use short or unsigned short variables.

Don’t useindirect calls. These are calls via function pointers includingthose passed as subprogram arguments. They can cause unknown sideeffects like modifying global variables that halt optimizationalgorithms.

Use functionsthat return values over pointer parameters . Pass scalarparameters by value instead of by reference (pointers) or use globalvariables. Pass larg structures by reference. Every structure passed byvalue must be completely copied on entry to the function.

Avoid taking avariable’s address. This degrades a program’s performancebecause a local variable whose address is taken might be aliased, likea global variable.

Declare pointerparameters as const in prototypes . Do this whenever possible,such as when there is no path through the routine that modifies thepointer. This helps the compiler avoid negative assumptions normallyrequired for pointer and reference parameters.

Use arrays overpointers. The following code accesses an array through apointer:

   for (i=0; i<100;i++)
      *p++ = …

Per loop iteration, *p isassigned and so is pointer p .Assignment to the pointer can hinder optimization.

Sometimes, the pointer can point to itself and thus assignment to *p changes the pointer’s value,forcing the compiler to generate code that reloads the pointer periteration. Furthermore, the compiler cannot prove that the pointer isunused outside the loop, so it generates code after the loop to updatethe pointer with its incremented value. Thus, it is better to usearrays over pointers.

for(i=0; i<100; i++)
     p[i] = …

Writestraightforward code . Compilers create complexoptimizations, but are not good at simplifying code. For example,unrolling a loop in the source code by an amount suitable for aprocessor makes code less portable, preventing a compiler from choosingthe unrolling amount for different target architectures.

Avoid functions that take variable arguments. If you must use suchfunctions, use the ANSI standard facilities of stdarg.h function. Use tablesrather than if-then-else or switch statements. For example, considerthe code: Instead of

switch ( c ) {
   case CASE0: x = 5; break;
   case CASE1: x = 10; break;
   case CASE2: x = 1; break;}

use

Rely on libclibrary functions. Such functions are coded for efficiency.

Compiler writers have developed many complex optimizations toachieve maximum performance from modern processors and are continuingto develop increasingly clever optimization algorithms. Applicationprogrammers can aid these through appropriate programming methods.

Dror Maydan is SoftwareEngineering Director and Steve Leibson is technology evangelist at Tensilica, Inc.

For a PDF version of this article go to OptimizeC programs for embedded SoC apps.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.