Editor’s Note: In a two part article from High Performance Embedded Computing , author Wayne Wolf provides a review of some of the most effective ways to improve the code performance and reliability. Part 2: instruction selection and modeling and code placement.
Instruction selection becomes important when the processor has limited or irregular resources. Many DSPs have irregular register sets and operations, so instruction selection becomes very important for these machines. Limited or irregular processor resources are also influential when operations are performed. Instruction selection and scheduling may interact as a result of resource limitations.
As part of the FlexWare system, Liem et al.  developed an instructiongeneration system for ASIPs with irregular register files. Figure 3-8 shows the intermediate representation for programs, which includes both data and control flow aspects.
Target instructions are described in the same basic format. Each instruction is annotated with register classes that identify how registers can communicate with each other. To cover the program graph with instructions, they use dynamic programming for the data flow segments and heuristics for the control flow parts of the graph.
The second-generation FlexWare compiler  generates code in three main stages. It first uses rule-based techniques such as those just described to select instructions. It then performs peephole optimizations on the code. Finally, it compacts the code to create a schedule for the operations and generates assembly language.
The PEAS-III compiler generator as described in Kobayashi et al.  creates mapping rules based on instruction descriptions. It classifies instructions into five categories: arithmetic/logical, control, load/store, stack manipulation, and special. The compiler generator creates scheduling information by tracing how each instruction uses pipeline resources, then calculating the latency and throughput of each instruction.
Mesman et al.  developed a model for the constraints on code scheduling. Constraints determine the set of feasible times when operations can be scheduled; the feasible schedule space can be searched to find the desired schedule. They represent the set of constraints as a weighted directed graph. Vertices in the graph represent operations, and edges represent precedence relationships. The constraints can also be written as linear inequalities, as shown in Figure 3-9 . Each node is constrained against the other.
Mesman et al. consider several types of constraints. Data dependencies are represented by a single edge whose weight is equal to the operation’s execution time. If the data dependency is in a loop, the value must be consumed before the next iteration; a backward arc for each data dependency represents this requirement.
Other types of latencies are also represented by single edges. Multicycle operations are represented using one node per stage of execution. They use 0–1 variable to model resource conflicts. To solve the system of constraints and find a schedule, they add edges to the graph to fix the times of operations. For example, potential resource conflicts are avoided by adding an edge that ensures the operations are not scheduled at the same time.
Constraints are also added to ensure that all the operations in a subgraph have time to execute. In addition, edges can be added to make variable lifetimes sequential.
Araujo and Malik  developed an optimal algorithm for instruction selection, register allocation, and scheduling on a class of architectures in which either one location or an unbounded number of those of a given class is available. The TI TMS320C25 is an example of this class. They define this class using a register transfer graph: each node in the graph is a register or main memory; a directed edge from one node to another indicates a possible data transfer, with a label on the edge that defines which instruction(s) can perform the transfer. An example of such a register transfer graph is shown in Figure 3-10 : the r3 register can be written from both r1 and r2 using the l instruction; all the cycles between r1 and r2 include the memory node. This condition ensures that values can be spilled to main memory when needed. Based on this property, Araujo and Malik showed that expression trees can always be scheduled without a memory spill.
They use a tree-grammar parser to select instructions and allocate registers; the patterns that define the instructions include the registers used by the instructions. They then use an O(n) algorithm, shown in Figure 3-11 , to schedule the instructions.
The location of code in memory is importantbecause it affects memory system performance. Simply moving a section ofcode or data from one address to another can speed up the program andreduce its energy consumption. Figure 3-12 illustrates the code placement process.
Assemblylanguage uses relative addresses. Code placement assigns absoluteaddresses to optimize two types of memory access: cache access time andpartitioned memory access. The result is absolute addresses that can befed into a linker and loader.
Figure 3-13 illustrates therelationship between absolute addresses in main memory and cache lines.Many blocks of memory map into the same cache line. We cannot avoid allconflicts, but by moving one block we can avoid mapping that block tothe same address as another block.
Ifthe blocks go to different cache lines, we are guaranteed that theywill never conflict. A less strong form of code placement would assigntwo blocks that rarely conflict to the same cache line. If the amount ofcode that needs to remain in the cache is relatively small, thenheuristics may be sufficient to avoid the major cache conflicts.
Ingeneral, an optimization algorithm must balance the caching behavior ofthe blocks in a large program to find the least objectionablecombination of cache conflicts. Hwu and Chang  placed instructionsto improve instruction cache performance. They analyzed traces to findthe relative execution times of code sections. These authors inlineexpanded frequently called functions to eliminate function calloverhead. They used a greedy algorithm to place frequently used tracesin the program image.
McFarling  analyzed program structurein addition to using trace information to determine code segments thatshould be placed in non-interfering memory locations. He annotated theprogram with the average number of times that loops are executed, basicblock size, and procedure call frequency. He then walked through theprogram to propagate labels, grouped together code segments based ontheir labels, and placed those code segments so that they would notinterfere in the graph.
McFarling also studied procedure inlining. He developed the following formula to estimate the number of cachemisses in a loop.
where sl is the effective loop body size, sb is the basic block size, f is the average execution frequency of the block, Ml is number of misses per loop instance, l is the average number of loop iterations, and S is the cache size. McFarling used the model to estimate the new cachemiss rate when a procedure is inlined. He used a greedy algorithm todetermine what functions to inline.
Pettis and Hansen evaluated several methods for determining code placement from profilingdata. They profiled their programs using gprof . They positionedprocedures such that the caller and callee were close to each other inthe program, increasing the chance that they ended on the same page,reducing the size of the page working set, and decreasing the chancethat they will knock each other out of the cache.
They orderedprocedures by building a call graph and weighting each edge with thenumber of times that the caller invokes the callee, then merging nodesconnected by high-weight edges. They also experimented with placementmethods for basic blocks. Pettis and Hansen rearranged blocks fromif-then-else code such that the most frequently executed path in theconditional takes advantage of the processor’s branch predictionmechanism. They placed basic blocks by analyzing a control flow graphwhose edges are annotated with the number of times that the given pathis executed. A bottom-up algorithm examined the edges starting with themost heavily weighted edge and grouped nodes and edges into paths byadding heavily weighted edges to the path. The authors also identifiedbasic blocks that were never executed in the profile; they called theseblocks “fluff”.
The fact that a basic block does not appear in aprofile does not mean that it is dead code that can be eliminated,since the input data set may not fully exercise the program. However,moving fluff blocks to separate procedures reduces the size of theprocedures that include highly used code, improving their cachebehavior. Their procedure-splitting algorithm added long branches toconnect the nonfluff code to the fluff code. Pettis and Hansen foundthat procedure positioning significantly reduced the number of longbranches that were executed.
They found that programs ran 1% to6% faster even without using profiling data and that positioning basicblocks to generate more straight-line code sequences reduced the numberof executed penalty branches by 42% and increased the number ofinstructions between branches from 6.19 to 8.08. They found thatsplitting procedures added a very small number of overhead instructions.
Tomiyamaand Yasuura  formulated trace placement as an integer linearprogramming problem. Their basic method increased code size between 13%and 30% on a series of benchmarks. They reduced code size by combiningtraces such that the size of a merged trace was a multiple of the cacheline size, eliminating unused locations.
Whendesigning a custom processor, we need to create more than just acompiler to support the ASIP’s programmers. We must also create acomplete set of programming tools: assembler, linker, loader, debugger,and graphical programming environment.
Figure 3-14 showsthe elements of the FlexWare system  including the applicationhardware (a), the ASIP hardware (b), and the embedded software and itsdevelopment environment (c).
This two part series is based on material from High-Performance Embedded Computin g by Wayne Wolf, with permission from Morgan Kaufmann, a division ofElsevier, Copyright 2007. For more information about this title andother similar books, please visit www.elsevierdirect.com .
Wayne Wolf iscurrently the Georgia Research Alliance Eminent Scholar holding theRhesa “Ray” S. Farmer, Jr., Distinguished Chair in Embedded ComputerSystems at Georgia Institute of Technology's School of Electrical andComputer Engineering (ECE). Previously a professor of electricalengineering at Princeton University, he worked at AT&T BellLaboratories. He has served as editor in chief of the ACM Transactionson Embedded Computing and of Design Automation for Embedded Systems.
1. Embedded software in real-time signal processing systems: “Design technologies“, Gert Goossens, Van Praet, J.; Lanneer, D.; Geurts, W.; Kifli, A.; Liem, C.; Paulin, P.G., Proceedings of the IEEE, Mar 1997
2. “Code generation using tree matching and dynamic programming “, Alfred V. Aho, Mahadevan Ganapathi, Steven W. K. Tjiang; ACM Transactions on Programming Languages and Systems
3. “A compiler generation method for hw/sw codesign based on configurable processors “, Shinsuke Kobayashi, Kentaro Mita, Yoshinori Takeuchi, and Masaharu Imai, IEICE Transactions on Fundamentals , December, 2002
4. “Instruction-set modeling for asip code generation “, Rainer Leupers and P. Marwedel, Proceedings of the 9th International Conference on VLSI Design
5. “Automated synthesis of data paths in digital systems “, Chia-Jeng Tseng and Daniel P. Siewiorek, IEEE Trans. on CAD of Integrated Circuits and Systems, 1999
6. “Lower bound on latency for VLIW ASIP datapaths”, Margarida F. Jacome , Gustavo de Veciana, IEEE/ACM International Conference on Computer Design, 1999
7. “Instruction-set matching and selection for dsp and ASIP code generation “, Clifford Liem, T.C. May, and P.G. Paulin, Proceedings of EDAC-ETC-EUROASIC, 1994
8. “Flexware: a retargetable embedded-software development environment”, Pierre G. Paulin and, Miguel Santana., 2002 IEEE Design & Test of Computers
9. “Constraint analysis for DSP code generation “, Bart Mesman , Adwin H. Timmer , Jef L. van Meerbergen , Jochen A. G. Jess, ISSS '97 Proceedings of the 10th international symposium on System synthesis
10.”Optimal code generation for embedded memory non-homogeneous register architectures “, Guido Araujo and Sharad Malik, Eighth InternationalSymposium on System Synthesis
11.”Achieving high instruction cache performance with an optimizing compiler “, P. P. Chang and W.-W. Hwu, Proceedings of the 16th annualinternational symposium on Computer architecture
12. “Program optimization for instruction caches “, Scott McFarling, Proceedings of the thirdinternational conference on Architectural support for programminglanguages and operating systems
13.”Procedure merging with instruction caches “, Scott McFarling, Proceedingsof the ACM SIGPLAN 1991 conference on Programming language design andimplementation
14. “Profile guided code positioning “, Karl Pettis and Robert C. Hansen, Proceedingsof the ACM SIGPLAN Conference on Programming language design andimplementation
15. “Code placement techniques for cache miss rate reduction “, Hiroyuki Tomiyamaand Hiroto Yasuura, ACM Transaction on Design Automation of ElectronicSystems