Tuning C/C++ compilers for optimal parallel performance in multicore apps: Part 1 - Embedded.com

Tuning C/C++ compilers for optimal parallel performance in multicore apps: Part 1

An important first step for an embedded systems developer who wants to take advantage of parallelism via multithreading or partitioning in his or her multicore design is to increase the scalar performance of the application.

One of the better and easier methods of doing this is to apply aggressive compiler optimization. A compiler that targets your processor and features advanced optimizations such as automatic vectorization, interprocedural optimization, and profile-guided optimization can substantially improve performance of your application.

And applying usability features of your compiler such as those aiding compatibility, compile time, and code size, can substantially improve development efficiency.

Why scalar optimization is so important
An absolute prerequisite for parallel optimization is highly tuned scalar performance. Why is this claim true? To understand why this is so, consider a hypothetical performance optimization project with the following requirement: Parallel optimization must provide a 30% increase in performance over the scalar version of the application.

A development team, Team M, is created to develop a prototype of the application that employs parallel optimization and multi-core processors to meet the performance requirement. Another team, Team S, is created to see how much scalar optimization techniques can improve performance. Each team prototypes their improvement and a performance comparison is obtained.

Figure 5.1 below is a graphical representation of the performance obtained by the different teams. As you can see, Team M increased performance by 43% over the original code. Team S increased performance by 11% over the original code. The question ” did Team M meet its goal?

Figure 5.1: Hypothetical scalar versus parallel performance improvement

The last column in the graph shows the performance improvement comparing Team M's results against Team S, which can be considered a new scalar version of the application.

Strictly speaking, Team M did not meet the goal. The performance difference between Team M and the new scalar version (Team S) is only 29%. Now it could be argued that the goal should have been clear in specifying that the original version of the code should be used for the performance comparison, but the reality is we are discussing end results.

If the scalar optimization work could be accomplished with minimal effort and the parallel optimization effort required a large amount of resources, the parallel optimization effort may be terminated. If Team M had known about the performance headroom offered by scalar optimization, Team M could have perhaps applied parallel techniques on more of the application to meet the performance goal.

The importance of compilation
C and C++ compilers are widely used for embedded software development [1]. Compiler optimization can play a large role in increasing application performance and employing the latest compiler technology specifically targeting your processor can provide even greater benefit.

Figure 5.2 below is a comparison of two different compilers and several optimization settings executing SPEC CINT2000 [1] on an Intel Pentium M processor system. The two compilers employed are labeled “comp1 ” and "comp2 " , respectively [2] .

Figure 5.2: Compiler and target architecture comparison

Four different optimization settings were employed when comp1 compiled the benchmark. Two different optimization settings were used when comp2 compiled the benchmark. The compiler, comp1, targeting an Intel 486 processor serves as the baseline and is represented by the far left bar in the graph.

Using comp1 with the -O3 option and targeting an Intel486 processor produces a 4% performance improvement. The -O3 option provides greater optimization over the default optimization setting. Using comp1 with the -O3 option and its default compiler processor target results in a 12% improvement. Using comp1 with the processor target of an Intel Pentium 4 processor results in a 17% performance improvement.

Employing the second compiler, comp2, and targeting the Pentium M processor leads to a 38% performance improvement. Finally, using comp2 targeting the Pentium M processor and advanced optimization leads to a 62% improvement.

The baseline compiler and option setting (comp1 targeting an Intel486 processor) represents legacy applications that employ dated compilation tools and target older processors while relying solely upon new processors to provide application performance improvement.

The comp2 targeting the Pentium M processor and using advanced optimization represents a compiler with the latest optimization technology and optimization that specifically targets the processor that is used to execute the application in the field. Terms such as advanced optimization and processor target are explained more fully in latter sections of this two part series. But two points should be clear from this data:

1) Legacy applications that have not been recompiled for new processor targets may be sacrificing performance; and

2) Employing a high-performance compiler specifically tuned for your architecture can lead to big performance improvements.

For developers that use C and C++ , knowing the performance features available in their respective compiler is essential. This chapter describes performance features available in many C and C ++ compilers, the benefits of these features, and how to employ them.

In this Part 1 of two articles, I will detail some of the C and C++ compiler performance features such as general optimizations, advanced optimizations, and user-directed optimization. Next, in Part 2, I will go into detail on the process to use when optimizing your application as well as discuss usability features that can aid compatibility, and methods for reducing compile time and code size. With these techniques, embedded developers can extract higher performance from their applications with less effort.

Compiler optimizations
A compiler optimizes by analyzing source code and determining a representation of the source code in machine language that executes as effectively as possible. Compiler optimizations can be grouped into general optimizations and advanced optimizations:

1) General optimizations include both architecture-independent and architecture dependent optimizations.

2) Advanced optimizations are specialized and show the highest benefits on very specific types of code.

General Optimizations . General optimizations are comprised of architecture-independent and architecture dependent optimizations. Architecture-independent optimizations do not rely upon knowledge of the underlying architecture; these are good techniques to apply no matter what platform the code will execute on.

Examples of architecture-independent optimizations include dead code elimination, common-subexpression elimination, and loop-invariant code motion. Figure 5.3 below shows code where the compiler may perform the following optimizations to yield higher performance code:

Dead Code Elimination: Lines 6 and 7 could be optimized away, because the if statement at line 6 can be proved to always be false guaranteeing that line 7 can never execute.

Loop-invariant code motion: Tthe compare at line 9 could be moved outside of the loop, because the value of y is not affected by anything else in the loop.

Common-subexpression elimination: ” the pointer calculations for the two references of a[i] at Line 10 could be shared.

Figure 5.3: General optimization code sample

Architecture-dependent optimizations include register allocation and instruction scheduling. A detailed knowledge of the microprocessor architecture is required to create a good architecture-dependent optimizer. Register allocation is compiler functionality that determines where variables are loaded for computation in the processor.

The x86 ISA has only eight general purpose registers so the number of active computations that can be in progress at the same time is somewhat limited. Instruction scheduling is the ordering of machine language code based upon internal processor constraints and program dependencies. Both register allocation and instruction scheduling is the domain of compiler writers and assembly language programmers.

Typically, combinations of general optimizations are bundled under a few compiler options. Table 5.1 below summarizes a variety of general optimizations available in a number of compilers that can be used to optimize your application.

The option -O2 is a good basic optimization fl ag and is the default optimization setting in some compilers. The -O3 option employs more aggressive optimizations that can improve performance, but may cost more in terms of application code size or compilation time. Consider these general optimization options as a first step in performance tuning of your application.

Table 5.1: Description of general optimization options

The -mtune= option schedules instructions based upon the processor specified. For example, to schedule instructions for a Pentium M processor, you would use -mtune=pentium-m . This option may result in lower performance on architectures that differ from the one specified.

The -march= option generates code with a specific processor in mind. This option automatically sets -mtune= to the same processor and in addition may use instructions that are not supported on older processors.

For example, the -mtune=Prescott option may select SSE3 instructions which would not execute on a processor that does not support SSE3 such as the Pentium III processor. Consider the -mtune= option if the majority of your target systems are a specific processor target. Consider the -march= option if all of your target systems are feature equivalent with the processor specified.

Advanced Optimizations
Some optimizations take advantage of more recent technologies in a processor and/or require more effort on the part of the developer to use. The term advanced optimizations is used to collectively refer to these compiler features. The next sections describe three such advanced optimizations: automatic vectorization , interprocedural optimization , and profile-guided optimization .

Automatic Vectorization . Several processors have extended their instruction set to allow access to new capabilities in the processors such as data prefetching and parallel execution. For example, the Embedded Intel Architecture processors offer a set of instruction extensions termed MMX technology, SSE, SSE2, SSE3, SSSE3, and SSE4.

For higher-level languages such as C and C ++, compiler technology provides the gateway to these new instructions. Methods of employing these instructions in your application may include:

1) Writing inline assembly language to explicitly specify the new instructions
2) Employing a C intrinsic library or a C ++ class library, giving access to higher level language instructions
3) Employing automatic vectorization technology to automatically generate the new instructions

Automatic vectorization technology enables the compiler to analyze your C and C++ source code to determine where these instructions can be employed to speed up your code. A compiler that performs vectorization analyzes loops and determines when it is safe to execute several iterations of the loop in parallel.

Figure 5.4: Vectorization example ” C source and resulting assembly code

Figure 5.4 above is a C source code example and the resulting x86 assembly language when compiling the code with the Intel C ++ compiler using automatic vectorization. The vectorizer is able to take advantage of SSE2 instructions and transforms the loop to compute two results of a[i] per iteration, instead of one result as specified by the source code.

Interprocedural Optimization
Compilers typically process one function at a time and in isolation from other functions in the program. During optimization, the compiler must often make conservative assumptions regarding values in the program due to side-effects that may occur in other functions, limiting the opportunity for optimization.

Use a compiler with interprocedural optimization to optimize each function with detailed knowledge of other functions in the application. Interprocedural optimization enables other optimizations; these other optimizations are more aggressive because of the enhanced interprocedural information. Examples of typical optimizations enabled by interprocedural optimization are:

Inlining : the ability to inline functions without any user direction.

Arguments in registers: function arguments can be passed in registers instead of the stack which can reduce call/return overhead.

Interprocedural constant propagation: Propagate constant arguments through function calls.

Loop-invariant code motion: better compiler analysis information can increase the amount of code that can be safely moved outside of loop bodies.

Dead code elimination: better global information can increase detection of code that can be proved to be unreachable.

Figure 5.5 below is an example of code that can be optimized effectively with interprocedural optimization.

Figure 5.5: Interprocedural optimization code sample

In the example in Figure 5.5 above , the entire body of the loop in the function check() can be optimized away for the following reasons:

1) Interprocedural constant propagation will propagate the argument to the called function at line 10 which is constant in the function check() .

2) Loop-invariant code motion will recognize that the if statement at line 3 is not dependent on the loop.

3) Dead code elimination will eliminate the code under the true condition of the if statement at line 3 because the variable debug is a constant, zero.

Profile-guided Optimization . Profile-guided optimization enables the compiler to learn from experience. Profile-guided optimization is a three-stage process:

Stage 1: Compile the software for profile generation. The compiler inserts instrumentation into the compiled code able to record metrics about the behavior of the code when it is executed.

Stage 2: Execute the application to collect a profile that contains characteristics of what is termed a ” training run ” of the application.

Stage 3: Recompile the application to take advantage of the profile or what has been learned about the application during the training run.

The compiler now knows where the most frequently executed paths in the program reside and can prioritize its optimization to those areas. A description of several optimizations enabled by profile-guided optimization in a typical compiler follows:

Function ordering: Improves instruction cache hit rates by placing frequently executed functions together.

Switch-statement optimization: Optimizes the most frequently executed cases in a switch statement to occur first.

Basic block ordering. . Improves instruction cache hit rates by placing frequently executed blocks together.

Improved register allocation. . Gives the best register assignments to calculations in the most highly executed regions of code.

Table 5.2: Advanced optimization options

Advanced Optimization Options . The options to enable advanced optimization in several compilers are listed in Table 5.2 above . The GNU gcc version 4.1 features advanced optimizations. Automatic vectorization is enabled with the -ftree-vectorize option.

General interprocedural optimization is not available. However, some interprocedural optimization is enabled with the -wholeprogram and combine options . Profile-guided optimization is enabled with the -fprofile-generate option for profile generation and the "profile-use option for the profile use phase.

The Microsoft Visual C ++ compiler does not currently support automatic vectorization. Interprocedural optimization is enabled with the /GL option for compilation and the /LTCG option at link time. Profile-guided optimization is enabled with /GL option for compilation, and linked with /LTCG:PGI for profile generation and /LTCG:PGO for profile application.

The Intel C++ Compiler for Linux enables automatic vectorization with the -x proc option where proc specifies one of several processor targets. Interprocedural optimization is enabled with the -ipo option on both the compile command line and link command line. Profile-guided optimization is enabled with the -prof-gen option for profile generation and -prof-use option for profile application.

The options listed above for each of the compilers are not precisely equivalent between them. For example, the gcc option -ftree-vectorize only enables automatic vectorization; however, the Intel C++ Compiler option -xproc enables automatic vectorization and targets the specified processor so other optimizations such as software prefetch and instruction scheduling for the processor target may be enabled.

Read your compiler documentation to understand what is provided with your compiler advanced optimizations and any special requirements. Lastly, in comparing advanced optimization features of each compiler keep in mind the end value is application performance, which means that it may be necessary to build and measure the performance gained from each compiler and its advanced optimizations.

Aiding optimizations
Developers can aid compiler optimization in situations where the compiler is restricted in terms of making assumptions about the code, how the code is laid out in memory, and when the developer has prior knowledge about the run-time behavior of the code.

In these cases, careful understanding of these issues and how to effectively communicate optimization information to the compiler can have a substantial impact on performance.

The first area where developers can assist the compiler involves the use of pointers and aliasing issues that can restrict optimization. Aliasing occurs when two pointers reference the same object. The compiler must be conservative in its assumptions and optimizations in the presence of aliasing.

Figure 5.6: Aliasing code sample

Figure 5.6 above shows a situation where the compiler cannot assume that a and b point at different regions of memory and must be conservative in optimizing or add runtime bound checks of the pointers.

A compiler may even choose not to vectorize in this case. The problem highlighted in Figure 5.6 can be solved by providing the compiler more information regarding the pointers referenced in the function. Several techniques are available for assisting the compiler in its alias analysis and are summarized as follows:

Restrict : C99 defines the restrict keyword which allows the developer to specify pointers that do not point to the same regions of memory.

Array notation: using array notation such as a[i] and b[i] helps denote referenced regions of memory.

Interprocedural optimization: enables greater alias analysis and may enable the compiler to prove that certain pointers do not point to the same region.

The restrict code sample in Figure 5.7 below is based upon the code in Figure 5.6 and adds the use of the restrict keyword to allow the compiler to disambiguate a and b and to use a vectorized version of the loop at run-time.

Figure 5.7: Restrict code sample

Figure 5.7 also includes the use of a pragma, #pragma vector aligned , which communicates data layout information to the compiler. Some architectures contain instructions that execute faster if the data is guaranteed to be aligned on specific memory boundaries.

Proper use of pragmas in communicating knowledge to the compiler can lead to several benefits as summarized in Table 5.3 below . For proper syntax and usage, refer to your compiler's reference manual.

Table 5.3: Typical compiler directives, attributes, and benefits

One last area where developers can aid the compiler is with the memory layout of data structures used by the application. Modern architectures feature a memory hierarchy containing multiple levels of cache. Effective use of the memory hierarchy involves trying to keep data that is accessed very close together in time and space in contiguous regions. Techniques for optimizing for cache are numerous; however, a few of the important ones are summarized as:

1) Data layout
2) Data alignment
3) Prefetching

Data layout helps data structures fit in cache more effectively. Consider the structure declarations in Figure 5.8 below .

Figure 5.8: Data layout and alignment example

Suppose the value SIZE was defined to be 1. The size of soa on many compilers is 24 bytes because a compiler would pad three bytes after f , z , and v . The size of soa2 on many compilers is 16 bytes which is a 33% reduction in size. Order the declarations of data in your structures to minimize unnecessary padding.

Data alignment is a technique that employs useful padding to allow efficient access to data. Consider again the declarations in Figure 5.8 where this time SIZE is defined to a large number, say 100,000. There is no guarantee that the arrays a , b , and c are aligned on a 16-byte boundary which prevents faster versions of SSE instructions to be employed if a loop that accesses the data is vectorized.

For example, if the for loop detailed in Figure 5.9 below is vectorized using SSE instructions, unaligned loads will be performed on the data which are less efficient than their aligned counterparts. If you substitute soa2 for soa in the code aligned loads will be employed.

Figure 5.9: Unaligned load code sample

Prefetch is a second technique to optimize memory access. The goal of prefetch is to request data to be placed in cache right before it is referenced by the application.

Modern processors feature both software instructions to prefetch data and automatic prefetch. The automatic prefetch is a processor feature that analyzes the stream of memory references, attempts to predict future access, and places predicted memory accesses in the cache.

Automatic prefetch helps in applications that have regular strided access to data and may hurt performance if the data access is irregular. In some cases, it may be beneficial to turn off automatic prefetch on your processor. Several BIOS programs enable the setting of automatic prefetch. If your application has irregular non-strided access to data, it may be worth turning off automatic prefetching to see if it increases performance.

To read Part 2 , go to The Compiler Optimization Process.

Used with permission from Newnes, a division of Elsevier, from “Software Development for Embedded Multi-core Systems” by Max Domeika. Copyright 2008. For more information about this title and other similar books, please visit www.elsevierdirect.com.

Max Domeika is a senior staff software engineer in the Developer Products Division at Intel , creating software tools targeting the Intel Architecture market. Max currently provides technical consulting for a variety of products targeting Embedded Intel Architecture.

References
[1] SPEC 2000
[2] MySQL
[3] SetQuery Benchmark.
[4] ISO/IEC 9899:1990, Programming Languages C.
[5] ISO/IEC 9899:1999, Programming Languages C.
[6] ISO/IEC 14882, Standard for the C++ Language.
[7] Persistence of Vision Raytracer. [8] EON.
[9] A. Bik , The Software Vectorization Handbook . Hillsboro, OR : Intel Press , 2004 .
[10] A. Bik, Vectorization with the Intel Compilers (Part I)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.