In this product how-to article, Vector Fabrics’ Paul Stavers describes a more efficient way to parallelize code for embedded multicore designs illustrating the process using the company’s online tool to parallelize Google’s VP8 video decoder.
Many silicon vendors rely on multicore architectures to improve performance. However, some engineers might say that those vendors have not been able to deliver compilation tools that have kept pace with the architecture improvements. The tools that are available require both a good understanding of the application and a deep understanding of the target platform.
However, an alternative approach is possible. This article will highlight a way to parallelize complex applications in a short time span, without the need to understand neither the application nor the target platform.
This can be achieved with interactive mapping and partitioning design flow. The flow visualizes the application behavior and allows the user to interactively explore feasible multithreaded versions. The program semantics are guaranteed to be preserved under the proposed transformations.
In many cases the design of an embedded system starts with a software collection that has not yet been partitioned to match the multicore structure of the target hardware.
As a result, the software does not meet its performance requirements and hardware resources are left idle. To resolve this, an expert (or a team of experts) comes in to change the software so that it fits the target multicore structure.
Current multicore design flow practice
Figure 1 Traditional multicore design practice involves an iterative process of analysis, partitioning, loop parallelization, incorporation of semaphores, analysis, testing and retuning of code.
A typical approach, illustrated in Figure 1 above, would include the following steps:
- Analyze the application. Find the bottlenecks.
- Partition the software over the available cores. This requires a good understanding of data access patterns and data communication inside the application to match this with the cache architecture and available bandwidth of buses and channels on the target platform. Optimize some of the software kernels for the target instruction set (e.g. Intel SSE, ARM Neon).
- Identify the loops that can be parallelized. This requires a good understanding of the application: find the data dependencies, find the anti- and output dependencies, and find the shared variables. The dependencies can be hidden very deeply, and to find them often requires complex pointer analysis.
- Predict the speedup. Predict the overhead of synchronizing, the cost of creating and joining threads. Predict the impact of additional cache overhead introduced by distributing workload over multiple CPUs. If parallelizing a loop still seems worth it, go to the next step.
- Change the software to introduce semaphores, FIFOs and other communication and synchronization means. Add thread calls to create and join threads. This requires a good understanding of the API’s available on the target platform. In this stage subtle bugs are often introduced, related to data races, deadlock or livelock that may only manifest themselves much later, e.g. after the product has been shipped to the customer.
- Test. Does it seem to function correctly? Measure. Does the system achieve the required performance level? If not: observe and probe the system. Tooling exists to observe the system; The experts need to interpret these low-level observations in the context of their expert system knowledge, then draw conclusions.
- Try again to improve performance or handle data races and deadlocks. This involves repeating the above from Step 1.
Close analysis of Figure 1 clearly shows there are many problems with this design flow. Experts that can successfully complete this flow are a rare-breed. Even if you can find them, at the start of a project it is hard to predict how many improvement and bug fix iterations the experts need to go through until the system stabilizes.
Multicore platforms are quickly becoming a very attractive option in terms of their cost-performance ratio. But they also become more complex every year, making it harder for developers to exploit their benefits. Therefore we need a new design flow that enables any software developer to program multicore platforms. This flow is depicted in Figure 2 below.
In this alternative flow, a tool analyzes the program before it is partitioned. It finds the loops that can be parallelized and devises a synchronization strategy for these loops. The tool also has detailed knowledge of the target platform and it can estimate the cost of different partitioning and synchronization strategies.
In the remainder of this article we will show this design flow at workin a demonstration of it use in parallelizing Google’s VP8 video decoderby using a cloud-based tool from Vector Fabrics. We show that tools cananswer the important questions that normally require experts to handle.Namely, which loops dominate the workload and which of these can beparallelized, how much speedup will I get, and how do I change myprogram to implement my parallelization choices?
After compiling and analyzing the VP8 source code, its behaviour is visualized as shown in Figure 3 below.
Click on image to enlarge.
Figure 3 The dominant filter loop with all its loop-carried data dependencies
Here we see a 3D view on the function call and loop stacking hierarchyof the VP8 decoder program. The main() function is all the way in theback, with callers stacked up in front. The width of each function andloop invocation corresponds directly to its relative contribution to theprogram execution time. Therefore, wide boxes represent hot spots inthe program, and they are displayed most prominently on the screen.
In VP8’s case the dark blue loop shown in Figure 3 dominates theexecution time of the VP8 decoder. We therefore instruct the tool toparallelize this filter loop. It responds with a parallelizationproposal for a range of CPU workloads. All the developer now has to dois choose how many CPUs he wants to dedicate to the filter job.
To provide the developer with insight in the proposed parallelization ofthe filter loop, the tool displays the loop carried dependencies ashorizontal bars.
In the example of Figure 3 some of these dependencies really areinduction expressions. Such expressions can be removed from the loopbecause their value only depends on the iteration sequence number andnot on the actual computations going on inside the loop. Only thered-coloured data dependencies present a problem requiring deeperanalysis.
The filter loop actually is a stack of loops. The outer loop iteratesover macroblock rows, the next nesting level iterates over macroblockcolumns, and deeper loops iterate over individual pixels, as follows:
for (row = 0; row < rows; row += 1)
for (col = 0; col < cols; col += 1)
When the program is analyzed, the tool not only observes the presence ofdata dependencies between successive calls to filter_macroblock() butit also computes the corresponding dependency distance vectors. Thesedistance vectors are an important clue for parallelization, because theytell the parallelization engine exactly how many iterations of eachprogram loop are executed before a data value is consumed.
Even if parallelization is possible, it does not necessarily mean itwill make the program run faster. Often the overhead of introducingthreads and synchronizing them introduces more delay than can be offsetby the concurrency gain. In the case of VP8’s 2-dimensional filter looptransformation (Figure 4 below), tools can compute a schedule thatincludes the overhead of synchronization.
Click on image to enlarge.
Figure 4 shows what the schedule looks like with 4 CPUs. In thispicture time flows from left to right. The induction expressions do notappear in the schedule, but the red-coloured filter dependency withdistance (1, -1) does. To resolve this dependency, rows start executingwith a delay of 2 columns with respect to the immediately preceding row.At the beginning and at the end of the schedule the CPUs are onlypartially active, resulting in a speedup of 3.5×.
Figure 5 On a multicore x86 with two to eight cores, measured speedup is within 10 percent of the predicted estimate
Figure 5 shows the resulting performance of the parallel VP8filter loop as measured on the actual target hardware for two differentvideo formats. Target hardware in this study is a multicore x86 and theexperiments involve machine configurations with 2 to 8 cores (shown asdifferent colours on the horizontal axis). The measured speedup iswithin 10% of the predicted speedup.
Paul Stravers , Co-founder and chief architect at VectorFabrics, Paul Stravers has roots deep in NXP and Philips, with focus onplatforms for digital media processing, processors, andhardware/software co-design. He led the CAKE/Wasabi project in theresearch phase, developing a low-cost, high-powered programmablemulti-processor architecture, and then contributed to its transformationinto production-quality technology. He holds a PhD and a cum laudemaster’s degree in electrical engineering from Delft University ofTechnology. He can be contacted at .