Editor's note: Excerpted from Software engineering for embedded systems , in the second in a two-part series the authors provide some additional guidelines on getting better performance out of your software.
In addition to the basic C coding techniques for optimizing embedded software for performance outlined in Part 1, there are a number of other techniques that embedded developers can use, such as pragmas, hardware and software loops, loop transformations and unrolling, multisampling partial summation, and software pipelining.
Pragmas can be used to communicate to the compiler information about loop bounds to help loop optimization. If the loop minimum and maximum are known, for example, the compiler may be able to make more aggressive optimizations.
In the example in Figure 11.9 , a pragma is used to specify the loop count bounds to the compiler. In this syntax, the parameters are minimum, maximum and multiple respectively. If a non-zero minimum is specified, the compiler can avoid generation of costly zero- iteration checking code. The compiler can use the maximum and multiple parameters to know how many times to unroll the loop if possible.
Hardware loops. These are mechanisms built into some embedded cores which allow zero- overhead (in most cases) looping by keeping the loop body in a buffer or prefetching. Hardware loops are faster than normal software loops (decrement counter and branch) because they have less change-of-flow overhead. Hardware loops typically use loop registers that start with a count equal to the number of iterations of the loop, decrease by 1 each iteration (step size of 21), and finish when the loop counter is zero (Figure 11.10 ).
Compilers most often automatically generate hardware loops from C even if the loop counter or loop structure is complex. However, there will be certain criteria under which the compiler will be able to generate a hardware loop (which vary depending on compiler/ architecture). In some cases, the loop structure will prohibit generation but if the programmer knows about this, the source can be modified so the compiler can generate the loop using hardware loop functionality.
The compiler may have a feature to tell the programmer if a hardware loop was not generated (compiler feedback). Alternatively, the programmer should check the generated code to ensure hardware loops are being generated for critical code. As an example the StarCore DSP architecture supports four hardware loops. Note the LOOPSTART and LOOPEND markings, which are assembler directives marking the start and end of the loop body, respectively (Figure 11.11 ).
Figure 11-11: LOOPSTART and LOOPEND markings
Additional tips and tricks
The following are some additional tips and tricks to use for further code optimization:
Memory contention. When data is placed in memory, be aware of how the data is accessed. Depending on the memory type, if two buses issue data transactions in a region/bank/etc., they could conflict and cause a penalty. Data should be separated appropriately to avoid this contention. The scenarios that cause contention are device-dependent because memory bank configuration and interleaving differs from device to device.
Use of unaligned accesses. In some embedded processors, devices support unaligned memory access. This is particularly useful for video applications. For example, a programmer might load four byte- values which are offset by one byte from the beginning of an area in memory. Typically there is a performance penalty for doing this.
Cache accesses. In the caches, place data that is used together next to each other in memory so that prefetching the caches is more likely to obtain the data before it is accessed. In addition, ensure that the loading of data for sequential iterations of the loop is in the same dimension as the cache prefetch.
Inlining small functions. The compiler normally inlines small functions, but the programmer can force inlining of functions if for some reason it isn’t happening (for example if size optimization is activated). For small functions the save, restore, and parameter-passing overhead can be significant relative to the number of cycles of the function itself. Therefore, inlining is beneficial. Also, inlining functions decreases the chance of an instruction cache miss because the function is sequential to the former caller function and is likely to be prefetched. Note that inlining functions increases the size of the code. On some processors, pragma inline forces every call of the function to be inlined (Figure 11.12 ).
Using vendor run-time libraries. Embedded processor vendors typically provide optimized library functions for common run- time routines like FFT, FIR, complex operations, etc. Normally, these are hand-written in assembly as it still may be possible to improve performance over C. These can be invoked by the programmer using the published API without the need to write such routines, speeding time to market.
General loop transformations
The optimization techniques that follow here are general in nature. They are critical to taking advantage of modern multi-ALU processors. A modern compiler will perform many of these optimizations, perhaps simultaneously. In addition, they can be applied on all platforms, at the C or assembly level. Therefore, throughout the section, examples are presented in general terms, in C and in assembly.
Loop unrolling. This is a technique whereby a loop body is duplicated one or more times. The loop count is then reduced by the same factor to compensate (see Table below). Loop unrolling can enable other optimizations such as:
- partial summation
- software pipelining.
Once a loop is unrolled, flexibility in coding is increased. For example, each copy of the original loop can be slightly changed. Different registers could be used in each copy. Moves can be done earlier and multiple register moves can be used.
The unrolling procedure is as follows: 1) Duplicate loop body N times, 2) Decrease loop count by factor of N
Figure 11.13 is an example of a correlation inner loop which as been unrolled by a factor of two.
Multisampling. This is a technique for maximizing theusage of multiple ALU execution units in parallel for the calculation ofindependent output values that have an overlap in input source datavalues. In a multisampling implementation, two or more output values arecalculated in parallel by leveraging the commonality of input sourcedata values in calculations. Unlike partial summation, multisampling isnot susceptible to output value errors from intermediate calculationsteps.
Multisampling can be applied to any signal-processing calculation of the form:
Thus, using C pseudocode, the inner loop for the output value calculation can be written as:
tmp1 5 x[n];
for(m 5 0; m,M; m 1 5 2)
tmp2 5 x[n 1 m 1 1];
y[n] 1 5 tmp1*h[m];
y[n 1 1] 1 5 tmp2*h[m];
tmp1 5 x[k 1 m 1 2];
y[n] 1 5 tmp2*h[m 1 1];
y[n 1 1] 1 5 tmp1*h[m 1 1];
tmp2 5 x[n 1 m 1 1];
y[n 1 1] 1 5 tmp2*h[m];
The multisampled version works on N output samples at once. An example implementation on a two-MAC DSP is shown in Figure 11.14 .
Transforming the kernel into a multisample version involves the following changes:
- changing the outer loop counters to reflect the multisampling by N
- use of N registers for accumulation of the output data
- unrolling the inner loop N times to allow for common data elements in the calculation of the N samples to be shared
- reducing the inner loop counter by a factor of N to reflect the unrolling by N.
Partial summation. Thisis an optimization technique whereby the computation for one output sumis divided into multiple smaller, or partial, sums. The partial sumsare added together at the end of the algorithm. Partial summation allowsmore use of parallelism since some serial dependency is broken,allowing the operation to complete sooner.
Partial summation can be applied to any signal-processing calculation of the form:
Todo a partial summation, each calculation is simply broken up intomultiple sums. For example, for the first output sample, assuming M = 3:
sum0 5 x[0 1 0]h 1 x[1 1 0]h
sum1 5 x[2 1 0]h 1 x[3 1 0]h
y 5 sum0 1 sum1
Thepartial sums can be chosen as any part of the total calculation. Inthis example, the two sums are chosen to be the first 1 the second, andthe third 1 the fourth calculations.
Note: partial summation can cause saturation arithmetic errors. Saturation is not associative. For example, saturate (as b)1 c may not equal saturate (as b1 c). Care must be taken to ensure that such differences don’t affect the program.
The partial summed implementation works on N partial sums at once. Figure 11.15 shows the implementation on a 2-MAC (multiply/accumulate) DSP.
Transforming the kernel involves the following changes:
- use of N registers for accumulation of the N partial sums;
- unrolling the inner loop will be necessary; the unrolling factor depends on the implementation, how values are reused and how multiple register moves are used;
- changing the inner loop counter to reflect the unrolling.
Softwarepipelining is an optimization whereby a sequence of instructions istransformed into a pipeline of several copies of that sequence. Thesequences then work in parallel to leverage more of the availableparallelism of the architecture. The sequence of instructions can beduplicated as many times as needed, substituting a different set ofregisters for each sequence. Those sequences of instructions can then beinterwoven.
For a given sequence of dependent operations:
a = operation();
b = operation(a);
c = operation(b);
Software pipelining gives (where operations on the same line can be parallelized):
a0 = operation();
b0 = 5 operation(a); a1 = operation();
c0 =_ operation(b); b1 = operation(a1);
c0 = operation(b1);
Implementation. A simple sequence of three dependent instructions can easily be software pipelined, for example the sequence in Figure 11.16 . A software pipelining of three sequences is shown.
Thesequence of code in the beginning where the pipeline is filling up(when there are less than three instructions grouped) is the prologue.Similarly, the end sequence of code with less than three instructionsgrouped is the epilogue. The grouping of three instructions in parallelcan be transformed into a loop kernel as shown in Figure 11.17 . ( Note: software pipelining will increase the code size. Ensure that optimizations are worth the increase in size.)
Rob Oshana has 30 years of experience in the software industry, primarily focusedon embedded and real-time systems for the defense and semiconductorindustries. He has BSEE, MSEE, MSCS, and MBA degrees and is a SeniorMember of IEEE. Rob is a member of several Advisory Boards including theEmbedded Systems group, where he is also an international speaker. Hehas over 200 presentations and publications in various technology fieldsand has written several books on embedded software technology. He is anadjunct professor at Southern Methodist University where he teachesgraduate software engineering courses. He is a Distinguished Member ofTechnical Staff and Director of Global Software R&D for DigitalNetworking at Freescale Semiconductor.
Mark Kraeling isProduct Manager at GE Transportation in Melbourne, Florida, where he isinvolved with advanced product development in real-time controls,wireless, and communications. He’s developed embedded software for theautomotive and transportation industries since the early 1990s. Mark hasa BSEE from Rose-Hulman, an MBA from Johns Hopkins, and an MSE fromArizona State.
Used with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2012, this article was excerpted from Software engineering for embedded systems , by Robert Oshana and Mark Kraeling.