How to analyze C-code for parallelization -

How to analyze C-code for parallelization

(Note: This article originally appeared on EE Times Europe. )

Computer scientists and engineers have understood for a long time – almost from the inception of computer science as a discipline – that a possible route to accelerating the execution of a computational task is to exploit the parallelism inherent in its program flow.

If it were possible to identify those aspects of a program that can be executed in parallel, re-organise the program flow accordingly, and provide hardware to host each parallel fragment, the overall task should run to completion in greatly reduced time.

In practice, the problem has proved, over time, to be relatively intractable. For a considerable period, finding a solution was taken off the critical path to improving performance, as the semiconductor industry continued to provide rapidly-increasing processing power through faster and faster CPU clock rates. That era has now come to an end and microprocessor designers have turned to multicore architectures to further increase throughput. The availability of a suitable hardware platform on which to run parallel-structured tasks once again focuses attention on the desirability of parallelizing program flows.

The feasibility of taking a single sequential program and breaking it apart to run various pieces in parallel depends on whether these pieces rely on each other for completion. This is driven by dependencies within the program, and understanding such dependencies is critical to creating an effective parallel implementation that is functionally identical to the original sequential version.

Amdahl's Law still applies
The relationship known as Amdahl's Law states, more or less, that the amount of performance improvement that can be gained through parallel computation is effectively limited by those portions of the program that have to be run in order. For n processors, the only way to have the program run n times as fast is if it can be broken up into n mutually-independent pieces of equal length; this is almost never possible. That means that programmers will have to be satisfied with something less than a speedup of n times. The design challenge becomes one of trying to maximise that speedup.

Figure 1: Uneven parallelization of a program limits the amount of speedup that can be achieved.

In a given program, there are generally things whose performance is critical and things that are less critical. Getting the critical portions to run as quickly as possible involves ensuring that the non-critical parts are out of the way. All efforts at parallel computing rely on removing those dependencies that can be eliminated and working around the ones that remain. Understanding where all of those dependencies lie is nontrivial, and missing dependencies can result in a parallel version of a program that is functionally different from the sequential original.

The programming language used to write the program does not, in principle, affect the analysis of the program, although an ideal parallel language (which does not exist) would not need an analysis tool; the dependencies would be obvious by inspection. ANSI C is very far from that ideal; however, it is by far the most common language used to program embedded systems (as is C++ in enterprise systems), and is likely to remain so. This fact also means that in practice, the distinction between the problem of parallelizing new and legacy code is not as great as might be imagined.

Parallelization will create a section of a program that is to be executed concurrently with other sections. Such a section can be implemented as a thread or process. Threads are subordinate to a process and share a common memory space; processes are completely independent. A multicore program can be implemented as a single process with multiple threads; or as a group of small single-thread processes in an asymmetric multiprocessing (AMP) environment, using inter-process communication (IPC) to share any necessary information; or as a combination of both.

At what level should opportunities for parallel execution be investigated? At the lowest level, instruction-level parallelism can be exploited to reorder and parallelize low-level operations. However, this depends greatly on the architecture of the target processor and is something that compilers already focus on; it is hard to realize additional benefit. At the other extreme, if very large pieces of a program are examined, they will almost always be mutually dependent if the computing involves anything nontrivial.

A useful point of access is the processor-memory boundary. While the processor itself and its registers are managed by the compiler, memory architecture is not explicitly considered by the compiler; this becomes a level where investigating relationships is useful. The two critical operations here are store and load. The compiler will manage variables kept in processor registers, but as soon as they are stored into or loaded from memory, we have an opportunity to manage the dependencies that arise. This concept can be extended further to include peripherals (which are often memory-mapped): any time a variable leaves the scope of the processor (and compiler) it becomes an item of interest.

The other question that arises is how large a block of code can usefully be analyzed. It is very common to study the behavior of programs at the function level. But much of the computational work, especially in embedded systems, is accomplished through the use of loops, which are of critical interest when determining how to parallelize a program.

Much of the behavior of a program can be learned by inspection, or static analysis. But this can miss critical behaviors, especially regarding how pointers and other real-time memory-related operations behave, that can only be gleaned by dynamic analysis, observing what happens while a program executes on a specific set of data.

The key to effective parallelization of sequential code is dependencies, which come in different varieties. If one calculation uses the results of another calculation, then it must wait until that prior calculation is complete before it can proceed: this is a compute dependency because it is driven by the computing flow of the program. A useful way of viewing events is that a compute dependency starts with a load operation, retrieving a value that is operated on, until the sequence ends in a store operation. This type of dependency is readily identified through static analysis.

The next type of dependency is less obvious, made up of the hard-to-track activity associated primarily with pointers. The C language, in particular, allows undisciplined, unstructured use of pointers in a manner that – according to point of view – may be considered creative or sloppy: such usage is the single biggest source of unanticipated, hard-to-find, subtle bugs and problems, especially when trying to parallelize a sequential program. They cannot be handled by compilers or standard profilers.

Some previous attempts to address the issues arising from use of pointers have simply forbidden them, which places a huge rewriting burden on the engineer, who is probably not the original author of the code (and may therefore not have much insight as to the intent it embodies).A better approach is to separately analyze the behavior of pointers and identify thecritical dependencies that they create. Because of their association with pointers and memory access, these dependencies are referred to as memory dependencies. They start with a store operation, which is typically (but not exclusively) a pointer write, and end with one or more load operations, which are typically pointer references.

What makes these dependencies not statically analyzable is the fact that there may be many pointers referencing the same address at any given time during program execution, each of which has a different name. As a result, dynamic analysis is needed in order to identify what is reading from and writing to specific addresses.

Because this storing of data for later loading is effectively the communication of data from one part of the program to another, incorrect parallelization means that some parts of the program will not have the correct data. If a program is parallelized without correctly accounting for these dependencies, behavior will be badly disrupted in ways that may be very hard to debug.

These two major categories of dependency revolve around the need for data to be available for use, which means waiting to read the data. The reverse situation can also be true: a piece of data that is going to be overwritten may have to hold its current value until that value is no longer needed. This means delaying a write until all required reads are complete, a situation called “anti-dependency”.

Yet another type of dependency involves the (implicit) initialization of the program – pre-loading variable values, to take the simplest case – before it runs. Library calls can also create dependencies which may or may not be critical for correct program operation. For example, a printf call becomes important if the order of printing matters, but not if the only important thing is the printing, but not the order of printing.

Loops can often be parallelized to improve performance; each iteration of a loop can be assigned to a different task. If there are no dependencies between the iterations, then each iteration can happen at exactly the same time. There is a particularly important kind of memory dependency or anti-dependency: the consumption in one iteration of a loop of some value produced in another iteration.

Figure 2: Parallelization of a loop where data is produced (P) and consumed (C) in parallel tasks; the second task must await the data availability before proceeding.

The existence of such loop-carried dependencies does not make that section of a program a lost cause for parallelization, but it does require some adjustments to the analysis, including introducing concepts such as “dependency distance”. Broadly, this is the number of iterations of a loop that occur between the program generating some item of data and then requiring it again. Nested loops add complexity to the analysis, which must be data-centric, tracking the production and consumption of each item. Programs may well have multiple nested loops, with computed values being passed across several layers of the hierarchy from inner-most to outer-most loop.

Delivering meaningful analysis
Simply depicting the flow of information, through these various types of dependency discovered by dynamic analysis, is a challenge in itself; it is a key aspect of Vector Fabrics' analysis software vfAnalyst. Visualization of dependencies and other such critical elements significantly simplify what is otherwise a difficult task for engineers or programmers – even if they are the same people who wrote, or are in the process of writing, the code. It also hints at why conventional software profiling techniques are, if not entirely useless, at least of very limited value as parallelizing tools.

Figure 3: Example display from vfAnalyst, a tool that analyzes programs for concurrency and dependencies and can propose parallelization strategies.

It is easy to write code that, in its execution, gives rise to extremely complex, convoluted exchanges of data items. This can completely obscure, to manual inspection, opportunities for parallelization that will be turn out to be highly productive.

About the author:
Mike Beunder is CEO of Vector Fabrics B.V., a software tools vendor he co-founded in Eindhoven in 2007. In his early career he was a VSLI design engineer and he has worked in successful start-ups and blue-chip semiconductor companies in Australia, Germany, France, the USA and the Netherlands. Mr. Beunder gained a Batchelors Degree in Electrical Engineering and a Masters Degree from Twente University (Netherlands). He also has a PhD degree from Twente University and IMS Stuttgart (Germany).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.