Tuning C/C++ compilers for optimal parallel performance in multicore apps: Part 2 - Embedded.com

Tuning C/C++ compilers for optimal parallel performance in multicore apps: Part 2

As a follow-on to Part 1, which included, among other things, an overview of compiler optimization as it relates to parallelization of code for multicore applications, in this second part in this series, the discussion will detail a process for applying these optimizations to your application.

Figure 5.10 below depicts this process, which consists of four steps:

1. Characterize the application.
2. Prioritize compiler optimization.
3. Select benchmark.
4. Evaluate performance of compiler optimizations.

Figure 5.10: Compiler optimization process

Optimization using the compiler begins with a characterization of the application. The goal of this step is to determine properties of the code that may favor using one optimization over another and to help in prioritizing the optimizations to try.

If the application is large it may benefit from optimizations for cache memory. If the application contains floating point calculations, automatic vectorization may provide a benefit. Table 5.4 below summarizes the questions to consider and sample conclusions to draw based upon the answers.

Table 5.4: Application characterization

The second step is to prioritize testing of compiler optimization settings based upon an understanding of which optimizations are likely to provide a beneficial performance increase. Performance runs take time and effort so it is essential to prioritize the optimizations that are likely to increase performance and foresee any potential challenges in applying them.

For example, some advanced optimizations require changes to the build environment. If you want to measure the performance of these advanced optimizations, you must be willing to invest the time to make these changes. At the least, the effort required may lower the priority. Another example is the effect of higher optimization on debug information.

Generally, higher optimization decreases the quality of debug information. So besides measuring performance during your evaluation, you should consider the effects on other software development requirements. If the debugging information degraded to an unacceptable level, you may decide against using the advanced optimization or you may investigate compiler options that can improve debug information.

The third step, select a benchmark , involves choosing a small input set for your application so that the performance of the application compiled with different optimization settings can be compared. In selecting a benchmark the following things should be kept in mind:

1) The benchmark runs should be reproducible, for example, not result in substantially different times every run.

2) The benchmark should run in a short time to enable running many performance experiments; however, the execution time cannot be so short that variations in run-time using the same optimizations are significant.

3) The benchmark should be representative of what your customers typically run.

The final step step is to build the application using the desired optimizations, run the tests, and evaluate the performance . The tests should be run at least three times apiece. My recommendation is to discard the slowest and fastest time and use the middle time as representative.

I also recommend checking your results as you obtain them, seeing if the actual results match up with your expectations. If the time to do a performance run is significant, you may be able to analyze and verify your collected runs elsewhere and catch any mistakes or missed assumptions early.

Finally, if the measured performance meets your performance targets, it is time to place the build changes into production. If the performance does not meet your target, the use of a performance analysis tool (such as the Intel VTune Performance Analyzer ) should be considered.

One key point you should remember from your reading of Part 1 in this series: let the compiler do the work for you .

There are many books that show you how to perform optimizations by hand such as unrolling loops in the source code. Compiler technology has reached a point now where in most cases it can determine when it is beneficial to perform loop unrolling.

In cases where the developer has knowledge the compiler cannot ascertain about a particular piece of code, there are oftentimes directives or pragmas where you can provide the compiler the missing piece of information.

To see this process at work, download the PDF of the case study about how it is used with the MySQL open source database .

C/C++ Compiler Usability
Another area to improve before beginning an effort at parallelism is C and C++ compiler usability. The performance of your development environment and specifically your C and C++ compiler translates into increased developer performance due to the following:

1) Greater standards compliance and diagnostics lead to fewer programming errors.
2) Faster compilation times leads to more builds per day.
3) Smaller application code size can enable more functionality to be provided by the application.
4) Code coverage tools help build a robust test suite.

Performance of the generated code is a key attribute of a compiler; however, there are other attributes collectively categorized as usability attributes that play a key role in determining the ease of use and ultimately the success of development.

The next several sections describe techniques for increasing the use of the compiler in terms of: 1) diagnostics, 2) compatibility, 3) compile time, 4) code size, and 5) code coverage. When employed correctly, these techniques can lead to increased developer efficiency and associated cost savings in the development phase of your embedded project.

Diagnostics. Two widely used languages for embedded software development are C and C++. Each of these languages has been standardized by various organizations. For example, ISO C conformance is the degree to which a compiler adheres to the ISO C standard. Currently, two ISO C standards are commonplace in the market: C89 [4] and C99[5] .

C89 is currently the de facto standard for C applications; however, use of C99 is increasing. The ISO C ++ standard [6] was approved in 1998. The C++ language is derived from C89 and is largely compatible with it; differences between the languages are summarized in Appendix C of the ISO C ++ standard.

Very few C ++ compilers offer full 100% conformance to the C++ standard due to a number of somewhat esoteric features such as exception specifications and the export template. Most compilers provide an option to specify a stricter degree of ISO conformance.

Employing an option to enforce ISO conformance helps developers who may be interested in porting their application to be compiled in multiple compiler environments.

Compiler diagnostics also aid porting and development. Many compilers feature an option to diagnose potential problems in your code that may lead to difficult to debug run-time issues. Use this option during the development of your code to reduce your debug time.

One other useful diagnostic feature is the discretionary diagnostic. A discretionary diagnostic has a number associated with it which allows the promotion or demotion of the diagnostic between a compile-stopping error message, warning message, remark, or no diagnostic. Table 5.8 below describes the options available in the Intel C ++ Compiler for Linux that provides the abilities mentioned above.

Table 5.8: Description of diagnostic options

Some compilers include a configuration file that allows the placement of additional options to be used during compilation and can be a useful location for additional discretionary diagnostics.

For example, to promote warning #1011 to an error for all C compilations, placing -we1011 in the configuration file eliminates the need to manually add the option to every compilation command line. For full information on diagnostic features including availability and syntax, please see your compiler's reference manual.

Compatibility. One challenge in embedded software development is porting software designed and optimized for one architecture onto a new upgraded architecture within the same family.

An example is upgrading the system that runs your application from an Intel Pentium III processor-based system to an Intel Pentium 4 processor-based system. In this case, you may want to recompile the software and optimize for the new architecture and its performance features.

One highly valued compiler feature is the ability to optimize for the new architecture while maintaining compatibility with the old architecture. Processor dispatch technology solves the problem of upgrading architectures while maintaining compatibility with legacy hardware deployed in the field.

Intel has released a number of instruction set extensions such as MMX technology (Multimedia Extensions) and Streaming SIMD Extensions (SSE, SSE2, and SSE3). For example, the Intel Pentium 4 processor supports all of these instruction set extensions, but older processors, such as the Pentium III microprocessor, support only a subset of these instructions.

Processor dispatch technology allows the use of these new instructions when the application is executing on the Pentium 4 processor and designates alternate code paths when the application is executing on a processor that does not support the instructions.

There are three main techniques available that enable developers to dispatch code based upon the architecture of execution:

1) Explicit coding of cupid : programmer can add run-time calls to a function that identifies the processor of execution and call different versions of code based upon the results.

2) Compiler manual processor dispatch: Language support for designating different versions of code for each processor of interest. Figure 5.11 below is a manual processor dispatch example that designates a code path when executing on Pentium III processors versus a code path when executing on other processors.

3) Automatic processor dispatch: The compiler determines the profitability of using new instructions and automatically creates several versions of code for each processor of interest.

The compiler inserts code to dispatch the call to a version of the function dependent on the processor that is executing the application at the time. These processor dispatch techniques help developers take advantage of capabilities in newer processors while maintaining backward compatibility with older processors.

The drawback to using these techniques is a code size increase in your application; there are multiple copies of the same routine, only one of which will be used on a particular platform. Developers are cautioned to be cognizant of the code size impact and consider only applying the technique on critical regions in the application.

Figure 5.11: Manual dispatch example

Compile Time. Compile time is defined as the amount of time required for the compiler to complete compilation of your application. Compile time is affected by many factors such as size and complexity of the source code, optimization level used, and the host machine speed. The following sections discuss two techniques to improve compile time during application development.

PCH Files. Precompiled Header (PCH) files improve compile time in instances where a subset of common header files is included by several source files in your application project. PCH files are essentially a memory dump of the compiler after processing a set of header files that form the precompiled header file set.

PCH files improve compile time by eliminating recompilation of the same header files by different source files. PCH files are made possible by the observation that many source files include the same header files at the beginning of the source file.

Figure 5.12: Precompiled header file source example

Figure 5.12 above shows a method of arranging header files to take advantage of PCH files. First, the common set of header files are included from a single header file that is included by the other files in the project.

In the example below, global.h contains the include directives of the common set of header files and is included from the source files, file1.cpp and file2.cpp . The file, file2.cpp , also includes another header file, vector .

Use the compiler option to enable creation of a PCH file during the first compilation and use the PCH file for the subsequent compilations. During compiles with automatic PCH files, the compiler parses the include files and attempts to match the sequence of header files with an already created PCH file.

If an existing PCH file can be used, the PCH file header is loaded and compilation continues. Otherwise, the compiler creates a new PCH file for the list of included files. #pragma hdrstop is used to tell the compiler to attempt to match the set of include files in the source file above the pragma with existing PCH files.

In the case of file2.cpp , the use of #pragma hdrstop allows the use of the same PCH file used by file1.cpp . Please note that the creation and use of too many unique PCH files can actually slow compilation down in some cases.

The specific compilation commands used for the code in Figure 5.12 when applying PCH files are:

icc -pch -c file1.cpp
icc -pch -c file2.cpp

Try timing the compilations with and without the -pch option on the command line. Table 5.9 below shows compile time reductions for the code in Figure 5.12 and two other applications.

Table 5.9: Compile time reduction using PCH

POV-Ray [7] is a graphical application used to create complex images. EON[8], a graphical application, is a benchmark application in SPEC CINT2000. These tests were run on a 2.8 GHz Pentium 4 microprocessor-based system with 512 MB RAM and Red Hat Linux.

Parallel Build. A second technique for reducing compilation time is the use of a parallel build. An example of a tool that supports parallel builds is make (with the -j option).

Table 5.10 below shows the compile time reduction building POV-Ray and EON on a dual processor system with the make -j2 command. These tests were run on a dual 3.2 GHz Pentium 4 microprocessor-based system with 512 MB RAM and Red Hat Linux.

Table 5.10: Compile time reduction using make -j 2

Code Size . The compiler has a key role in the final code size of an embedded application. Developers can arm themselves with knowledge of several compiler optimization settings that can impact code size, and in cases where code size is important can reduce code size by careful use of these options.

Many compilers provide optimization switches that optimize for size. Two -O options optimize specifically for code size. The option -O1 optimizes for code speed and code size, but may disable options that typically result in larger code size such as loop unrolling.

The option -Os optimizes for code size at the expense of code speed and may choose code sequences during optimization that result in smaller code size and lower performance than other known sequences.

In addition to the -O options above that either enable or disable sets of optimizations, developers may choose to disable specific code size increasing optimizations. Some available options include:

1) Options for turning off common code size increasing optimizations individually such as loop unrolling.
2) Options to disable inlining of functions in your code or in libraries.
3) Options to link in smaller versions of libraries such as abridged C++ libraries.

The last code size suggestion pertains to the advanced optimizations detailed in the first section of this chapter. Automatic vectorization and interprocedural optimization can increase code size.

For example, automatic vectorization may turn one loop into two loops: one loop that can iterate multiple iterations of the loop at a time, and a second loop that handles cleanup iterations. The inlining performed by interprocedural optimization may increase code size.

To help mitigate the code size increase observed with automatic vectorization or interprocedural optimization, use profile-guided optimizations that enable the compiler to use those optimizations only where they have the largest return. Table 5.11 below summarizes a list of common code size optimizations.

Table 5.11: Code size related options

The option -fnoexception should be applied carefully in cases where C++ exception handling is not needed. Many embedded developers desire the use of C++ language features such as objects, but may not want the overhead of exception handling.

It is also good practice to “strip” your application of unneeded symbols. Some symbols are needed for performing relocations in the context of dynamic binding; however, unneeded symbols are eliminated by using (on Linux systems)

strip -strip-unneeded

Code Coverage. Many embedded developers employ a test suite of common inputs into their embedded application that provides some confidence that newly implemented code does not break existing application functionality. One common desire of a test suite is that it exercises as much of the application code as possible.

Code coverage is functionality enabled by the compiler that provides this information. Specifically, code coverage provides the percentage of your application code that is executed given a specific set of inputs. The same technology used in profile-guided optimizations enables some code coverage tools.

Figure 5.13 below is a screen shot of a code coverage tool reporting on a sample program. The tool is able to highlight functions that were not executed as a result of running the application input.

The tool is also able to highlight areas inside an executed function that were not executed. The report can then be used by test suite developers to add tests that provide coverage. Consider the use of code coverage tools to increase the effectiveness of your test suite.

Figure 5.13: Code coverage tool

Optimization Effects on Debug
Debugging is a well-understood aspect of software development that typically involves the use of a software tool, a debugger that provides: 1) Tracing of application execution; 2) Ability to stop execution at specified locations of the application: and, 3) Ability to examine and modify data values during execution.

One of the trade-offs in optimizing your application is the general loss of debug information that makes debugging more difficult. Optimization effects on debug information are a large topic in itself. In brief, optimization affects debug information for the following reasons:

1) Many optimized x86 applications use the base pointer (EBP) as a general purpose register which makes accurate stack backtracing more difficult.

2) Code motion allows instructions from different lines of source code to be interspersed. Stepping through lines of source sequentially is not truly possible in this case.

3) Inlining further complicates the code motion issue because a sequential stepping through of source code could result in the display of several different source locations leading to potential confusion.

Table 5.12 : Debug related options

Table 5.12 above describes several debug-related options. These options are specific to the Intel C++ Compiler for Linux; however, similar options exist in other compilers.

Please consult your compiler documentation to determine availability. The -g option is the standard debug option that enables the source information to be displayed during a debug session.

Without this option, a debugger will only be able to display an assembly language representation of your application. The other options listed are used in conjunction with the -g option and provide the capabilities listed. Consider the use of supplemental debug options when debugging optimized code.

Final Note: Scalar optimization techniques are a necessary prerequisite before beginning parallel optimization. Understand and use the performance features and usability features available in your compiler to meet performance goals and increase developer efficiency. The net effect of using these features is cost savings for developing embedded applications.

To read Part 1 , go to Why scalar optimization is so important..

Used with permission from Newnes, a division of Elsevier, from “Software Development for Embedded Multi-core Systems” by Max Domeika. Copyright 2008. For more information about this title and other similar books, please visit www.elsevierdirect.com.

Max Domeika is a senior staff software engineer in the Developer Products Division at Intel , creating software tools targeting the Intel Architecture market. Max currently provides technical consulting for a variety of products targeting Embedded Intel Architecture.

References
[1] SPEC 2000
[2] MySQL
[3] SetQuery Benchmark.
[4] ISO/IEC 9899:1990, Programming Languages C.
[5] ISO/IEC 9899:1999, Programming Languages C.
[6] ISO/IEC 14882, Standard for the C++ Language.
[7] Persistence of Vision Raytracer. [8] EON.
[9] A. Bik , The Software Vectorization Handbook . Hillsboro, OR : Intel Press , 2004 .
[10] A. Bik, Vectorization with the Intel Compilers (Part I)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.