Efficient coding has been an important subject for years especially for resource constrained systems running real-time applications.
Most people take compiler optimizations for granted and do not realize the effect of efficient ‘C’ programming on program execution time. The examples described in this article are intended to serve as a reminder and testament to the fact that program execution speed mainly relies on efficient programming practices
To exemplify the notion consider writing a multiplication subroutine for a processor that does not have a hardware multiplier. Using simple repetitive addition algorithm, the loop will take the multiplicand and adds to itself n times, where n is the multiplier. Think of the run time of 5×1000=5000, versus a 1000×5=5000. A smarter loop would take the smaller number as the multiplier and the larger as the multiplicand.
In this article I have implemented small incremental code improvements to an embedded Ethernet driver, and proved that a performance gain of nearly 17% has been achieved. This figure does not account for the effect of additional compiler optimizations or TCP windowing improvements.
|Table 1. Hardware configuration.|
Platform and Testing Strategy
We conducted the following pragmatic experiment on the AT91SAM9261 platform . The technical details of the hardware and software environment are furnished in Table 1 above and Table 2 below .
|Table 2. Software configuration.|
Test benchmarks have been obtained via the execution of two applications developed for internal use at Micrium Inc. . The first is an embedded target application designed to open a TCP listening socket and transmit 10MB of data upon connection.
The second is a Windows console application intended to connect to the socket server and receive data until the connection is terminated by the peer. Without stopping the target, the test was invoked from the PC multiple times and the best result for each incremental revision of the driver was recorded.
All ‘C’ source files were compiled using IAR Embedded Workbench for ARM v5.20  with medium code optimization and processor ‘Thumb mode’ selected.
The instruction count for each test was obtained via disassembly listing files and only account for the number of instructions within the loops under test. Instructions related to swapping data octets are not counted since octet swapping is disabled and the branch is always taken.
The effects of changes on performance
With same platform and consistent compilation parameters, the compilation was repeated six times with incremental enhancement of the same code.
|Table 3. Optimizations and their impact on performance.|
Table 3 above summarizes those incremental improvements. The performance is computed in Mbps as indicated in the second column.
|Table 4. Improvement results of the six tests.|
Table 4 above depicts the improvement percentage for the six tests with obvious incremental improvement as shown in Figure 1 below .
|Figure 1. Optimization effects on performance|
Test Details and Analysis
Test 1 . This test serves as the base performance figure from which percent improvement in the table below is computed. The NetDev_DataRd16() and NetDev_DataWr16() loops were written with the intention of implementing the required functionality without consideration for performance.
Consequently, this technique yields the lowest benchmark of the six tests performed. Both the read and write loops contain 13 instructions each and the data bit width of the variables used within the loops correspond to the minimum required width for the operation.
Test 2 . This test improves upon Test 1 by changing the data type of one variable such that its data width matches the processor data width. In this case, ‘DataOctetSwap’ was re-declared as a CPU_DATA (unsigned int) instead of a CPU_BOOLEAN (unsigned char). Consequently, the compiler exchanges the LDRB instruction for LDR and an 8% performance gain is realized.
Test 3. This test reverts the ‘DataOctetSwap’ data type back to its original data type of CPU_BOOLEAN and improves upon the loop design of NetDev_DataRd16() .
Also, NetDev_DataWr16() is left in its original non-optimized form. The conditional statement checking ‘DataOctetSwap’ has been relocated external to the loop at the expense of additional code size .
Additional code is required since the loop has been written twice; once for the octet swapping case, and once without. This test shows a marginal performance increase over the reference case because the test performed is mostly transmit oriented and only the NetDev_DataRd16() loop has been optimized.
Of course, small amounts of TCP acknowledgment data must be received in order to maintain a TCP connection; thus, a minimal performance increase is observed.
Test 4 . This test continues to leave ‘DataOctetSwap’ in its original data type of CPU_BOOLEAN and improves upon the loop design of both NetDev_DataRd16() (shown in Test 3) and NetDev_DataWr16(). A significant 15% performance increase over the reference test is obtained due to the transmit nature of the TCP test performed.
Test 5 . This test further improves upon the loop optimizations of the previous tests by changing the data types of the loop index and condition variables to match the processor data width.
These minor changes yield a 16.73% improvement over the reference case and a 1.64% over the structured optimized loops of Test 3 and Test 4.
Since an 8-bit CPU would have a CPU_DATA type size of 8 bits, pre-processor macros were utilized to ensure that the minimum data width of the loop index and conditional variables are declared to 16-bits, as required for reading or writing full sized Ethernet frames.
Otherwise, for 16 or 32 bit CPU’s, the variables are declared to the respective processor data width.
Test 6 . This last test maintains the previous optimizations and reverses the DataOctetSwap data type back to the optimum data width. This test does not show incremental performance improvement since the DataOctetSwap conditional statement is no longer within the loop body. As a result, additional execution cycles are not accrued over time.
In order to achieve optimized execution speed, manual code optimizations must be made. Compiler optimizations alone are not sufficient to recast variables to the natural processor width since the algorithmic use of those variables cannot necessarily be ascertained by the compiler.
It is up to the programmer to keep application performance considerations in mind and to determine when or if trading additional code space for added performance makes sense.
The examples above illustrate how changing only one variable to the natural width of the processor achieves nearly 50% of the realized gain.
Alternately, a similar gain may be obtained at the expense of additional code space, by re-writing the read and write loops in order to remove the conditional statement from within the loop body.
This shortens the number of executed loop instructions within the loop from 13 to 6 and thus, results in a significantly reduced worst case loop execution time.
Acknowledgment: I would like to acknowledge Ian Johns from Micrium Inc. for his assistance in debugging the above performance issues as well as his assistance in editing and refining the above test.
(Eric Shufro , who has worked as an embedded developer at Micrium is currently in the Department of Computer Science and Engineering Florida Atlantic University. He can be reached at firstname.lastname@example.org )
 Andrew Sloss, Dominic Symes and Chris Wright, ARM Systems Developers Guide: Designing and Optimizing System Software, Morgan Kaufmann, April 2004.
 Jean Labrosse, MicroC/OS-II: The Real-Time Kernel, 2nd ed., CMP Books, June 2002.
 Atmel Corporation AT91SAM9261 product documentation.
 Micrium Inc.
 IAR Systems Embedded Workbench for ARM v5.20 product documentation.
 Some paper on efficient code
 Mike Katz, Encourage Efficient Coding Without Affecting Coding.
 Google book review: Phillip A. Laplante, Real-Time Systems Design and Analysis, Wiley Interscience.
 Paul Hsieh, Programming Optimization, 2007.