By Rishiyur S. Nikhil (Bluespec) and Arvind (MIT)
Polymorphism
and Overloading: The efficiency issues due to polymorphism and
overloading are similar. Polymorphism often forces a uniform
representation of data. A polymorphic procedure on arrays should work
on arrays of booleans as well as on arrays of floats, and this
naturally leads to representations where a float and a boolean occupy
the same storage space, resulting in wastage in the latter case.
For an overloaded procedure that works on arrays of booleans and on
arrays of floats, the overloading is typically accomplished by passing
in a table of methods-a table of float methods if the array contains
floats and a table of boolean methods if the array contains booleans.
The overloaded procedure then invokes the appropriate method from
the table. Unfortunately, this can add the overhead of a procedure call
for something as simple as an addition operation, which would normally
be a single machine instruction in the nonoverloaded case.
Type analysis is a compiler technique that attempts to detect more
precisely the types of certain program fragments to eliminate
polymorphism and overloading. In the best case, identification of
precise types can eliminate uniform representations in favor of natural
machine representations (this is often called "unboxing").
Finally, we should note that many of the optimizations described
above make it harder to implement a good debugger for the language. For
example, consider a debugger in which we wish to examine the data
passed into a polymorphic list-sorting function.
In order to display the objects in the input list, the debugger will
need a way to reconstruct the type of those objects. However, since the
function is polymorphic, the debugger needs to examine the caller of
the function, and perhaps its caller, and so on, in order to find out
the actual type of the input list.
This task is made more difficult if optimizations have stripped away
tags and other type-identifying information. This problem has been
well-studied and solutions are described in the literature.
Implementing
Implicit Parallelism: Earlier in Part 3 in this series, we
described the explicit fork statement of some parallel languages (such as C with P-Threads, or Java).
The fork statement allows a programmer to specify the creation of a
new, parallel, thread of control.
The fundamental problem with explicit forks is that it can be very
difficult for the programmer to choose the placement of fork calls.
Standard implementations of fork have a high overhead (it is often hundreds or thousands of times
more expensive than a procedure call).
Further, the cost of multiplexing the computer's processors amongst
threads is typically very high. Thus, it is necessary to find just the
right balance-too many threads will result in too much forking and
thread-switching overhead, whereas too few threads will leave
processors idle. Having too few threads can also introduce deadlock.
For example, suppose in a call fork f(), the
parent thread is supposed to produce data for the child thread.
Omitting the fork (thereby converting f(} into an
ordinary procedure call), will deadlock the program: the procedure f() will wait
for its data, but since it has not returned, the "parent" cannot
produce it.
This problem, of choosing where to fork, is faced by programmers of
explicitly parallel languages and by compilers of implicitly parallel
languages like pH. In principle, a pH program can be compiled into a
sequential programming language with explicit forks liberally sprinkled
at every point where there is potential parallelism according to pH
semantics.
Unfortunately, the right number of threads is generally not
syntactically obvious-it depends on the program structure and the
run-time synchronization and data dependencies. Further, it can depend
on program inputs; that is, there may be no single "correct"
identification of fork points in the program.
A first and simple step towards solving this problem is to treat
forks as hints rather than imperatives. When encountering a fork call,
the run-time system actually forks a thread only if there are idle
processors available; otherwise, it simply calls the specified function
using a normal function call.
Thus, the programmer or the compiler can freely introduce forks that
may potentially specify hundreds or thousands o£ threads,
confident that the runtime system will in fact create just enough
threads to keep the processors of the computer busy.
The language Cid is an example of a parallel language that explored
this technique [21]. Unfortunately, this solution is not a complete
solution. Most importantly, it does not solve the deadlock problem
described above, because once a fork is treated as a procedure call,
there is no going back.
Another problem, less serious, is that the decision whether or not
to fork is taken at the time when the fork is encountered, based on the
number of idle processors at that time. If the situation changes later,
there is no way to revisit the decision (e.g., the system decides not
to fork because all processors are busy, but processors become free
shortly after the decision is taken).
A more sophisticated approach is called Lazy Task Creation (or LTC).
It was originally pioneered in the language Multilisp [15] with an
efficient implementation by Marc Feeley [20] but has been further
developed in Lazy Threads,
StackThreads
and in the language Cilk from MIT [12].
Under LTC a fork statement is always treated as an ordinary function
call, but the runtime system performs just a little additional
book-keeping, leaving a small record with enough information to revoke
this decision if necessary.
If a processor becomes idle, it is able to examine the runtime data
structures and, if it finds such a record, it can retroactively convert
the procedure call into a fork point, and execute the pending work.
The processor that created the record may not be the same as the
processor that exploits such a record; in this case, the second
processor is said to "steal" the parent from the first.
The key to making this work well is to minimize the overhead of the
initial book-keeping, even if it makes stealing expensive. The
rationale is that, while there may be thousands or millions of
potential fork-points encountered during execution (it depends on the
program structure and the data), the actual number of thread creation
events (stealing) will be much lower, sometimes as low as the number of
processors, which is typically a much smaller number.