CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Making the transition from sequential to implicit parallel programming: Part 8
Turning parallel Haskell (pH) into a production language



Embedded.com

Polymorphism and Overloading: The efficiency issues due to polymorphism and overloading are similar. Polymorphism often forces a uniform representation of data. A polymorphic procedure on arrays should work on arrays of booleans as well as on arrays of floats, and this naturally leads to representations where a float and a boolean occupy the same storage space, resulting in wastage in the latter case.

For an overloaded procedure that works on arrays of booleans and on arrays of floats, the overloading is typically accomplished by passing in a table of methods-a table of float methods if the array contains floats and a table of boolean methods if the array contains booleans.

The overloaded procedure then invokes the appropriate method from the table. Unfortunately, this can add the overhead of a procedure call for something as simple as an addition operation, which would normally be a single machine instruction in the nonoverloaded case.

Type analysis is a compiler technique that attempts to detect more precisely the types of certain program fragments to eliminate polymorphism and overloading. In the best case, identification of precise types can eliminate uniform representations in favor of natural machine representations (this is often called "unboxing").

Finally, we should note that many of the optimizations described above make it harder to implement a good debugger for the language. For example, consider a debugger in which we wish to examine the data passed into a polymorphic list-sorting function.

In order to display the objects in the input list, the debugger will need a way to reconstruct the type of those objects. However, since the function is polymorphic, the debugger needs to examine the caller of the function, and perhaps its caller, and so on, in order to find out the actual type of the input list.

This task is made more difficult if optimizations have stripped away tags and other type-identifying information. This problem has been well-studied and solutions are described in the literature.

Implementing Implicit Parallelism: Earlier in Part 3 in this series, we described the explicit fork statement of some parallel languages (such as C with P-Threads, or Java). The fork statement allows a programmer to specify the creation of a new, parallel, thread of control.

The fundamental problem with explicit forks is that it can be very difficult for the programmer to choose the placement of fork calls. Standard implementations of fork have a high overhead (it is often hundreds or thousands of times more expensive than a procedure call).

Further, the cost of multiplexing the computer's processors amongst threads is typically very high. Thus, it is necessary to find just the right balance-too many threads will result in too much forking and thread-switching overhead, whereas too few threads will leave processors idle. Having too few threads can also introduce deadlock.

For example, suppose in a call fork f(), the parent thread is supposed to produce data for the child thread. Omitting the fork (thereby converting f(} into an ordinary procedure call), will deadlock the program: the procedure f() will wait for its data, but since it has not returned, the "parent" cannot produce it.

This problem, of choosing where to fork, is faced by programmers of explicitly parallel languages and by compilers of implicitly parallel languages like pH. In principle, a pH program can be compiled into a sequential programming language with explicit forks liberally sprinkled at every point where there is potential parallelism according to pH semantics.

Unfortunately, the right number of threads is generally not syntactically obvious-it depends on the program structure and the run-time synchronization and data dependencies. Further, it can depend on program inputs; that is, there may be no single "correct" identification of fork points in the program.

A first and simple step towards solving this problem is to treat forks as hints rather than imperatives. When encountering a fork call, the run-time system actually forks a thread only if there are idle processors available; otherwise, it simply calls the specified function using a normal function call.

Thus, the programmer or the compiler can freely introduce forks that may potentially specify hundreds or thousands o£ threads, confident that the runtime system will in fact create just enough threads to keep the processors of the computer busy.

The language Cid is an example of a parallel language that explored this technique [21]. Unfortunately, this solution is not a complete solution. Most importantly, it does not solve the deadlock problem described above, because once a fork is treated as a procedure call, there is no going back.

Another problem, less serious, is that the decision whether or not to fork is taken at the time when the fork is encountered, based on the number of idle processors at that time. If the situation changes later, there is no way to revisit the decision (e.g., the system decides not to fork because all processors are busy, but processors become free shortly after the decision is taken).

A more sophisticated approach is called Lazy Task Creation (or LTC). It was originally pioneered in the language Multilisp [15] with an efficient implementation by Marc Feeley [20] but has been further developed in Lazy ThreadsStackThreads and in the language Cilk from MIT [12].

Under LTC a fork statement is always treated as an ordinary function call, but the runtime system performs just a little additional book-keeping, leaving a small record with enough information to revoke this decision if necessary.

If a processor becomes idle, it is able to examine the runtime data structures and, if it finds such a record, it can retroactively convert the procedure call into a fork point, and execute the pending work.

The processor that created the record may not be the same as the processor that exploits such a record; in this case, the second processor is said to "steal" the parent from the first.

The key to making this work well is to minimize the overhead of the initial book-keeping, even if it makes stealing expensive. The rationale is that, while there may be thousands or millions of potential fork-points encountered during execution (it depends on the program structure and the data), the actual number of thread creation events (stealing) will be much lower, sometimes as low as the number of processors, which is typically a much smaller number.

1 | 2 | 3 | 4 | 5

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :