Example 5-7 Software Pipelining
in SHARC
Software pipelining can be illustrated with a small loop on the SHARC.
Consider the following simple loop for a dot product computation:
for (i = 0; i < N; i++)
sum += a[i] * b[i];
The SHARC can perform several operations in parallel.
However, we can't perform the necessary loads and arithmetic operations
on the same cycle.
Assume that we want to rewrite the loop so that we
perform two loads, an addition, and a multiplication in one iteration.
However, because the result of one operation depends on others, we
can't do all these operations for the same iteration at the same time.
Instead, the loop body will perform operations from the following three
different iterations:
1. The two
fetches of the array elements are performed for availability
in the next cycle.
2. The multiplication
a[i]*b[i] is performed on the operands fetched by
the previous loop iteration.
3. The addition into the dot
product running sum is performed using the
result of the multiplication performed in the previous loop iteration.
When we rewrite the loop, we need to generate special
header and trailer code that takes care of the first and last
iterations that cannot be pipelined. The C code below is designed to
show which operations can be performed in parallel on the SHARC.

In this code, none of the operations in the loop body
depend on each other - remember that the p in p = ai*bi is writing a
different value than is being used by sum += p, so they can operate in
parallel. This allows all the operations to be performed in a single
SHARC instruction that performs two data reads, a multiply, and an
addition.
Instruction
Selection
Selecting the instructions to use to implement each operation is not
trivial. There may be several different instructions that can be used
to accomplish the same goal, but they may have different execution
times. Moreover, using one instruction for one part of the program may
affect the instructions that can be used in adjacent code. Although we
can't discuss all the problems and methods for code generation here, a
little bit of knowledge helps us envision what the compiler is doing.
One useful technique for generating code is template
matching, illustrated in Figure
5-21 below. We have a DAG that
represents the expression for which we want to generate code. In order
to be able to match up instructions and operations, we represent
instructions using the same DAG representation. We shaded the
instruction template nodes to distinguish them from code nodes.
Each
node has a cost, which may be simply the execution time of the
instruction or may include factors for size, power consumption, and so
on. In this case, we have shown that each instruction takes the same
amount of time, and thus all have a cost of 1. Our goal is to cover all
nodes in the code DAG with instruction DAGs - until we have covered the
code DAG we haven't generated code for all the operations in the
expression.
In this case, the lowest-cost covering uses the
multiply-add instruction to cover both nodes. If we first tried to
cover the bottom node with the multiply instruction, we would find
ourselves blocked from using the multiply-add instruction. Dynamic
programming can be used to efficiently find the lowest-cost covering of
trees, and heuristics can extend the technique to DAGs.
 |
| Figure
5-21. Code generation by template matching. |
Understanding and
Using Your Compiler
Clearly, the compiler can vastly transform your program during the
creation of assembly language. But compilers are also substantially
different in terms of the optimizations they perform. Understanding
your compiler can help you get the best code out of it.
Studying the assembly language output of the compiler
is a good way to learn about what the compiler does. Some compilers
will annotate sections of code to help you make the correspondence
between the source and assembler output. Starting with small examples
that exercise only a few types of state-ments will help.
You can
experiment with different optimization levels (the -O flag on most C
compilers). You can also try writing the same algorithm in several ways
to see how the compiler's output changes. If you can't get your
compiler to generate the code you want, you may need to write your own
assembly language. You can do this by writing it from scratch or
modifying the output of the compiler.
If you write your own assembly
code, you must ensure that it conforms to all compiler conventions,
such as procedure call linkage. If you modify the compiler output, you
should be sure that you have the algorithm right before you start
writing code so that you don't have to repeatedly edit the compiler's
assembly language output. You also need to clearly document the fact
that the high-level language source is, in fact, not the code used in
the system.
 |
| Figure
5-22. Structure of a program interpretation system |
Interpreters and JIT
Compilers
Programs are not always compiled and then separately executed. In some
cases, it may make sense to translate the program into instructions
during execution. Two well-known techniques for on-the-fly translation
are interpretation and just-in-time (JIT)
compilation. The trade-offs for both techniques are similar.
Interpretation or JIT compilation adds overhead - both time and memory
- to
execution.
However, that overhead may be more than made up for in
some
circumstances. For example, if only parts of the program are executed
over some period of time, interpretation or JIT compilation may save
memory, even taking overhead into account. Interpretation and JIT
compilation also provide added security when programs arrive over the
network.
An interpreter translates program
statements one at a time. The program may be expressed in a high-level
language, with Forth being a prime example of an embedded language that
is interpreted. An interpreter may also interpret instructions in some
abstract machine language.
As illustrated in Figure
5-22 above, the
interpreter sits between the program and the machine. It translates one
statement of the program at a time. The interpreter may or may not
generate an explicit piece of code to represent the statement.
Because
the interpreter translates only a very small piece of the program at
any given time, a small amount of memory is used to hold intermediate
representations of the program. In many cases, a Forth program plus the
Forth interpreter are smaller than the equivalent native machine code.
JIT compilers have been used for many
years, but are best known today for their use in Java environments. A
JIT compiler is somewhere between an interpreter and a
stand-alone compiler.
A JIT compiler produces executable code segments
for pieces of the program. However, it compiles a section of the
program (such as a function) only when it knows it will be executed.
Unlike an interpreter, it saves the compiled version of the code so
that the code does not have to be retranslated the next time it is
executed.
A JIT compiler saves some execution time overhead
relative to
an interpreter because it does not translate the same piece of code
multiple times, but it also uses more memory for the intermediate
representation. The JIT compiler usually generates machine code
directly rather than building intermediate program representation data
structures such as the CDFG. A JIT compiler also usually performs only
simple optimizations as compared to a stand-alone compiler.
Next in Part 6,
Analysis and
optimization of execution time.
To read Part 1, go to "
Program
design and analysis."
To read Part 2 , go to "
Models
of program, assemblers and linkers."
To read Part 3, go to "
Basic
Compilation Techniques"
To read Part 4, go to "
The creation of
procedures"
Used with the permission
of the publisher, Newnes/Elsevier,
this series of six articles is based on copyrighted material from "Computers
as Components: Principles of Embedded Computer System Design" by
Wayne Wolf. The book
can be purchased on line.
Wayne Wolf is currently the Georgia
Research Alliance Eminent Scholar holding the Rhesa "Ray" S. Farmer,
Jr., Distinguished Chair in Embedded Computer Systems at Georgia Tech's
School of Electrical and Computer Engineering (ECE). Previously a
professor of
electrical engineering at Princeton University, he worked at AT&T
Bell Laboratories. He has served as editor in chief of the ACM Transactions
on Embedded Computing and
of Design
Automation for Embedded Systems.