Editor’s Note: In a two part article from High Performance Embedded Computing , Wayne Wolf provides a review of some of the most effective ways to improve the code performance and reliability. In Part 1: instruction modeling and register allocation.
Embedded processors are often selected for their ability to execute particular algorithms, whether they are standard parts such as DSPs or Application-specific instruction processors (ASIPs) designed for a particular project. Back-end compiler phases are often keys to exploiting the relationship between the processor architecture and the application. General-purpose compilers are designed to compile code for a wide range of purposes.
Compilation speed is often important—when designing large software systems, programmers must be able to run through the write/compile/execute/debug cycle quickly. Software developers for embedded systems have somewhat different needs. Embedded software must often meet hard targets for execution time, energy consumption, or code size.
As a result, a number of compilation algorithms developed for embedded systems are optimization algorithms; general- purpose compiler writers often have to balance the requirements of different users at the expense of precise constraint satisfaction. Embedded software developers may also be willing to wait longer for the compiler to run if it performs useful functions.
While some stages may require fast write/compile/execute/ debug cycles, other stages may require careful tuning to meet performance, size, or energy constraints. If the compiler can perform those tasks with minimal intercession by the programmer, then programmers generally are happy to wait for the compiler to do its work. As illustrated in Figure 3-1 , the major steps in code generation  are as follows:
1. Instruction selection determines which opcodes and modes are used to implement all the operations in the program. The abstract program must be covered by the selected instructions. They may also be selected to minimize program cost such as size or performance.
2. Register allocation determines which registers are used to hold the values in the program. In a general-register machine, in which all instructions can operate on all registers, register allocation can be performed strictly after instruction selection.
Many ASIPs and DSPs do not fit that model—some important instructions may work on only one or a few registers. In such cases, instruction selection must allocate critical values to registers, and the register-allocation phase allocates the remaining values to general-purpose or special-purpose registers.
3. Address generation does not serve the same purpose as the code-placement steps described later in in this series. Some instructions may use addressing modes that depend on properties of the addresses. For example, certain array-oriented instructions may work best at certain strides. Pre- or post-increment addressing can also be used to walk through the stack, providing data is put in the stack in the order in which it is used.
4. Instruction scheduling is important in pipelined and parallel machines. CPUs with branch delay slots must fill those slots with appropriate instructions. VLIW processors also require instruction scheduling.
Instruction modeling is key to the design of ASIP compilers. When designing specialized instructions, we also need to design compilers that can make use of those instructions. We cannot in general rely on compiler designers to create handcrafted compilers for ASIPs. For compilers to adapt automatically to new instructions, we must describe those instructions in ways that allow the compiler to generate code.
To understand the problems of instruction selection for complex processors, let us consider the basic problem formulation and an algorithm to solve it. We will use the twig code generator of Aho et al.  as an example, limiting ourselves to data flow operations.
We can formulate instruction selection as template matching. As shown in Figure 3-2 , we generate data flow graphs for the program. We also represent the instructions in the processor architecture as data flow graphs of the same form. A template-matching algorithm selects instructions by matching the instruction templates to the program.
The generated assembly language code must satisfy the basic correctness constraint of covering the entire program. If the program data flow graph is not fully covered, then some operations are not implemented by instructions.
Beyond correctness, the template-matching algorithm should optimize for some cost. The most basic cost is code size, but performance or energy may also be considered. The objectives may be specified by the programmer; the algorithm may make use of annotations, such as execution time or energy consumption, on the instruction templates.
A tree-rewriting rule has the form
where template is a tree, replacement is a single tree node, and cost and action are code fragments. Given a set of rules, twig generates a matching automaton that can find the low-cost match in a code segment that has been written in tree form.
Figure 3-3 shows an example of a tree-rewriting rule . The tree on the left describes an increment instruction, while the tree on the right describes an addition instruction. The right side of the rule corresponds to the right side of the formula, while the result of the formula’s left side corresponds to the left side of the rule. The cost and action code fragments are not shown in the figure.
The cost code fragment can compute the cost of a given match using arbitrary code that can examine the state of the tree matching, cost tables, or other information. The action routine is executed when a match is identified; it could, for example, generate the required code.
The algorithm used by the automaton to generate code is based on dynamic programming. Costs are evaluated at each node to guide the dynamic programming process.
This tree-based approach must be extended to accommodate ASIP processors. Most important, ASIPs do not have general-purpose register files, and many of the important instructions use specialized registers.
When designing code generators for existing machines, we only need to describe how the instruction modifies the programming model—the programmer- visible registers. When we generate ASIPs with custom instructions, we must describe the complete behavior of the instruction in the pipeline.
Paul Kohout and his coworkers , for example, describes the pipeline resources used by the instruction along with the instruction format and registers. Leupers and Marwedel  model instructions as register transfers interspersed with no-operations (NOPs). Each register transfer is executed under its execution conditions, which are described as Boolean expressions. NOPs are used to preserve live values for later consumption.
The methods used for register allocationcan be used, with some modification, for the allocation of manydifferent types of resources. We want to allocate variables to registersas illustrated in Figure 3-4 . In this case, the programvariables v1 and v3 are never used at the same time, so they can both beassigned to register R1 ; similarly, v2 and v4 can both be assigned toR1 .
Twovariables can be allocated to the same register if they are never liveat the same time—that is, if the last use of one of the variablespredates the first use of the other. Since we are assuming the scheduleof operations, we know the times at which the variable values areneeded.
Figure 3-5 shows the construction of a variable lifetimechart from a scheduled set of operations. The operations and theirschedule provide the graduation of the time axis. Each variable is givenits own column; a bar extends from the variable’s definition to itslast use. A horizontal line through the chart at any chosen time showswhich variables are in use at that time.
The example of Figure 3-5 shows only straight-line code—no loops or branches. As a result, thevariable lifetime chart has a simple structure. A more generalrepresentation for lifetimes is depicted as a conflict graph, as shownin Figure 3-6 .
Eachvariable is represented by a node. An edge joins two nodes if thosevariables have non-overlapping lifetimes. An edge, therefore, identifiesa pair of variables that can be allocated to the same register. Thepairwise information is easy to generate from the schedule. However, tochoose registers for the variables, we need to consider all theconflicts.
A register is represented in this formulation by aclique of nodes, a clique being a set in which every pair of vertices isconnected by an edge. To avoid defining a clique that is a subset of alarger clique, a clique must be the largest possible set of fullyconnected nodes. A graph and a set of cliques on it are shown in Figure 3-7 .
Dependingon the structure of the graph, some cliques may be very small: twonodes connected by a single edge or an isolated node. The fact that eachnode belongs to only one clique ensures that one variable is assignedto only one register. (So long as we keep the same value in bothregisters, duplicating a variable does no harm and, if we have nonuniform inter connections between registers and function units, multiplecopies may have some advantage. However, in most cases, duplicating avariable only takes up space that could be used by another variable.) Weensure that every variable has a register by requiring that every nodein the graph be a member of a clique. Needless to say, finding thecliques of a general graph is NP-hard.
Another way of viewingthe problem is that we want to assign a color to each node such that iftwo nodes are connected by an edge, they have different colors. If eachcolor denotes a distinct register, then forcing connected nodes to havedifferent colors ensures that variables with overlapping lifetimes havedistinct registers.
While there are many possible colorings ofthe graph, they are not equally good hardware designs. We actually wantto solve a more general problem than simply finding the cliques of theconflict graph. Using the simplest measure of cost—the total number ofregisters required–minimum cost allocation requires the fewest cliques(or, in the graph-coloring world, the coloring with the fewest colors).
Wecan define the problem more formally: given an undirected graph G = ⟨V,E⟩ , we want to partition the graph into the smallest number of cliques.Tseng and Siewiorek  developed a heuristic algorithm for finding theminimum size of the clique covering, in a conflict graph.
Registerallocation in VLIW processors has additional implications forperformance beyond the general CPU case. VLIW register files aretypically partitioned—each function unit in the data path can accessonly a subset of the registers.
When a function unit needs avalue that is in a register it cannot access the value must be copiedfrom one register file partition to another. Jacome and de Veciana developed methods to bound latency in VLIW processors with partitionedregister files.
A data flow graph is bound to the function unitsand register files of a data path. They divide the problem intowindows; the problem is defined by the start and finish steps for thewindow, the data path resource associated with the window, and the setof activities bound to that resource that we would like to schedulewithin the window’s time range. An individual window is defined on asingle operation in the data flow graph. These are used toconstruct basic windows, which group together operations that share thesame resource and scheduling range. Basic windows are then combined intoaggregated windows, which merge basic windows on the same data pathresource but not necessarily the same time window. These windows can bescheduled using an algorithm that propagates delays in order to find theglobal feasible times for the windows.
This two part series is based on material from High-Performance Embedded Computin g by Wayne Wolf, with permission from Morgan Kaufmann, a division ofElsevier, Copyright 2007. For more information about this title andother similar books, please visit www.elsevierdirect.com .
Wayne Wolf iscurrently the Georgia Research Alliance Eminent Scholar holding theRhesa “Ray” S. Farmer, Jr., Distinguished Chair in Embedded ComputerSystems at Georgia Institute of Technology's School of Electrical andComputer Engineering (ECE). Previously a professor of electricalengineering at Princeton University, he worked at AT&T BellLaboratories. He has served as editor in chief of the ACM Transactionson Embedded Computing and of Design Automation for Embedded Systems.
1. Embedded software in real-time signal processing systems: “Design technologies“, Gert Goossens, Van Praet, J.; Lanneer, D.; Geurts, W.; Kifli, A.; Liem, C.; Paulin, P.G., Proceedings of the IEEE, Mar 1997
2. “Code generation using tree matching and dynamic programming “, Alfred V. Aho, Mahadevan Ganapathi, Steven W. K. Tjiang; ACM Transactions on Programming Languages and Systems
3. “A compiler generation method for hw/sw codesign based on configurable processors “, Shinsuke Kobayashi, Kentaro Mita, Yoshinori Takeuchi, and Masaharu Imai, IEICE Transactions on Fundamentals , December, 2002
4. “Instruction-set modeling for asip code generation “, Rainer Leupers and P. Marwedel, Proceedings of the 9th International Conference on VLSI Design
5. “Automated synthesis of data paths in digital systems “, Chia-Jeng Tseng and Daniel P. Siewiorek, IEEE Trans. on CAD of Integrated Circuits and Systems, 1999
6. “Lower bound on latency for VLIW ASIP datapaths”, Margarida F. Jacome , Gustavo de Veciana, IEEE/ACM International Conference on Computer Design, 1999
7. “Instruction-set matching and selection for dsp and ASIP code generation “, Clifford Liem, T.C. May, and P.G. Paulin, Proceedings of EDAC-ETC-EUROASIC, 1994
8. “Flexware: a retargetable embedded-software development environment”, Pierre G. Paulin and, Miguel Santana., 2002 IEEE Design & Test of Computers
9. “Constraint analysis for DSP code generation “, Bart Mesman , Adwin H. Timmer , Jef L. van Meerbergen , Jochen A. G. Jess, ISSS '97 Proceedings of the 10th international symposium on System synthesis
10.”Optimal code generation for embedded memory non-homogeneous register architectures “, Guido Araujo and Sharad Malik, Eighth InternationalSymposium on System Synthesis
11.”Achieving high instruction cache performance with an optimizing compiler “, P. P. Chang and W.-W. Hwu, Proceedings of the 16th annualinternational symposium on Computer architecture
12. “Program optimization for instruction caches “, Scott McFarling, Proceedings of the thirdinternational conference on Architectural support for programminglanguages and operating systems
13.”Procedure merging with instruction caches “, Scott McFarling, Proceedingsof the ACM SIGPLAN 1991 conference on Programming language design andimplementation
14. “Profile guided code positioning “, Karl Pettis and Robert C. Hansen, Proceedingsof the ACM SIGPLAN Conference on Programming language design andimplementation
15. “Code placement techniques for cache miss rate reduction “, Hiroyuki Tomiyamaand Hiroto Yasuura, ACM Transaction on Design Automation of ElectronicSystems