Dead Code Elimination
Code that will never be executed can be safely removed from the
program. The general problem of identifying code that will never be
executed is difficult, but there are some important special cases where
it can be done. Programmers will intentionally introduce dead code in
certain situations.
Consider this C code fragment:
#define DEBUG 0 ...
if (DEBUG) print_debug_stuff();
In the above case, the
print_debug_stuff() function is never executed, but the code allows the
programmer to override the preprocessor variable definition (perhaps
with a compile-time flag) to enable the debugging code.
This case is
easy to analyze because the condition is the constant 0, which C uses
for the false condition. Since there is no else clause in the if
statement, the compiler can totally eliminate the if statement,
rewriting the CDFG to provide a direct edge between the statements
before and after the if.
Some dead code may be introduced by the compiler. For
example, certain optimizations introduce copy statements that copy one
variable to another. If uses of the first variable can be replaced by
references to the second one, then the copy statement becomes dead code
that can be eliminated.
Procedure Inlining
Another machine-independent transformation that requires a little more
evaluation is procedure inlining. An inlined procedure does not have a
separate procedure body and procedure linkage; rather, the body of the
procedure is substituted in place for the procedure call. Figure 5-17
below
shows an example of function inlining in C.
The C++ programming
language provides an inline construct that tells the compiler to
generate inline code for a function. In this case, an inlined procedure
is generated in expanded form whenever possible. However, inlining is
not always the best thing to do. It does eliminate the procedure
linkage instructions. However, in a cached system, having multiple
copies of the function body may actually slow down the fetches of these
instructions. Inlining also increases code size, and memory may be
precious.
Function
definition:
int foo(a,b,c) { return a + b " c; }
Function
call:
z = foo(w,x,y,);
Inlining
result:
z = w + x " y;
Figure 5-17. Function
inlining in C.
Loop Transformations
Loops are important program structures - although they are compactly
described in the source code, they often use a large fraction of the
computation time. Many techniques have been designed to optimize loops.
A simple but useful transformation is known as loop
unrolling, which is illustrated in Example 5-4 below. Loop unrolling
is important because it helps expose parallelism that can be used by
later stages of the compiler.
Example 5-4 Loop
Unrolling
A simple loop in C follows:
for (i = 0; i < N; i++) {
a[i]=b[i]*c[i];
}
This loop is executed a fixed number of times, namely,
N. A straightforward implementation of the loop would create and
initialize the loop variable i, update its value on every iteration,
and test it to see whether to exit the loop. However, since the loop is
executed a fixed number of times, we can generate more direct code.
If we let N = 4, then we can substitute the above C
code for the following loop:
a[0] = b[0]*c[0];
a[1] = b[1]*c[1];
a[2] = b[2]*c[2];
a[3] = b[3]*c[3];
This unrolled code has no loop overhead code at all,
that is, no iteration variable and no tests. But the unrolled loop has
the same problems as the inlined procedure—it may interfere with the
cache and expands the amount of code required.
We do not, of course, have to fully unroll loops.
Rather than unroll the above loop four times, we could unroll it twice.
The following code results:
for (i = 0; i < 2; i++) {
a[i*2] = b[i*2]*c[i*2];
a[i*2 + 1] = b[i*2 + 1]*c[i*2 + 1];
}
In this case, since all operations in the two lines of
the loop body are independent, later stages of the compiler may be able
to generate code that allows them to be executed efficiently on the
CPU's pipeline.
Loop fusion combines two or more loops
into a single loop. For this transformation to be legal, two conditions
must be satisfied. First, the loops must iterate over the same values.
Second, the loop bodies must not have dependencies that would be
violated if they are executed together - for example, if the second
loop's ith iteration depends on the results of the i + 1th iteration of
the first loop, the two loops cannot be combined. Loop
distribution is the opposite of loop fusion, that is,
decomposing a single loop into multiple loops.
Loop tiling breaks up a loop into a
set of nested loops, with each inner loop performing the operations on
a subset of the data. An example is shown in Figure 5-18 below. Here,
each
loop is broken up into tiles of size two. Each loop is split into two
loops - for example, the inner ii loop iterates within the tile and the
outer i loop iterates across the tiles.
The result is that the pattern
of accesses across the a array is drastically different - rather than
walking across one row in its entirety, the code walks through rows and
columns following the tile structure. Loop tiling changes the order in
which array elements are accessed, thereby allowing us to better
control the behavior of the cache during loop execution.
 |
| Figure
5-18. Loop tiling. |
We can also modify the arrays being indexed in loops.
Array
padding adds dummy data elements to a loop in order to change
the layout of the array in the cache. Although these array locations
will not be used, they do change how the useful array elements fall
into cache lines. Judicious padding can in some cases significantly
reduce the number of cache conflicts during loop execution.
Next in Part 5:
Register allocation
and scheduling
To read Part 1, go to "
Program
design and analysis."
To read Part 2 , go to "
Models
of program, assemblers and linkers."
To read Part 3, go to "
Basic
compilation techniques"
Used with the permission
of the publisher, Newnes/Elsevier,
this series of articles is based on copyrighted material from "Computers
as Components: Principles of Embedded Computer System Design" by
Wayne Wolf. The book
can be purchased on line.
Wayne Wolf is currently the Georgia
Research Alliance Eminent Scholar holding the Rhesa "Ray" S. Farmer,
Jr., Distinguished Chair in Embedded Computer Systems at Georgia Tech's
School of Electrical and Computer Engineering (ECE). Previously a
professor of
electrical engineering at Princeton University, he worked at AT&T
Bell Laboratories. He has served as editor in chief of the ACM Transactions
on Embedded Computing and
of Design
Automation for Embedded Systems.