Evaluating the performance of multi-core processors - Part 2 - Embedded.com

Evaluating the performance of multi-core processors – Part 2

One common question when customers review data from microprocessor benchmarks such as those discussed in Part 1 is “How well will benchmark performance predict my particular application's performance if I employ your new processor or new compiler?”

In other words, if a new processor or different compiler increases the performance of benchmark X by Y%, how much will the processor or compiler benefit my application?

Of course, the answer is: it depends. Your application is not exactly like the benchmark program, and while the processor architects and compiler engineers may use the benchmark to assess the performance of features they are adding during development, they do not necessarily have access to or tune specifically for your application.

Therefore, you should be skeptical if someone claims that you will see the same performance benefit from using a new processor or compiler as what is shown from the benchmark data.

That said, you should expect some performance improvement and there are a couple of statistical techniques that can be employed to help improve the degree of confidence you have in benchmark data in estimating your end application performance.

There are three techniques to use when attempting to characterize your application's potential performance improvement based upon benchmark data and they are summarized as:

1) Assume the performance improvements correlate with the improvements observed in the benchmark.

2) Compute the correlation between your application and a number of benchmarks and then use the performance improvement from the benchmark with highest correlation.

3) Compute a multivariable regression using historical benchmark and application data to estimate a function, f ( x ) _ y , where x is the performance observed on the benchmarks and y is the expected performance of your application.

The first technique and second technique are fairly easy to implement. The third technique employs a multivariable regression to estimate your application performance using the following historical data as inputs in the calculation:

1) Benchmark data for a number of past processors and the new processor of interest

2) Your application ' s performance on the same past processors for which you have benchmark data

Consider the performance data in Table 3.1, downloadable as a two page PDF file. The performance results of each benchmark that comprises SPEC CINT2000 is shown for a number of processors including the Intel Pentium 4 processors, Intel Core Duo processors, and Intel Core 2 Quad processors.

The column labeled ” Processor ” indicates the processor on which the benchmark was executed. The column labeled ” MHz ” indicates the clock speed of the processor. The scores for the individual benchmark tests comprising CINT2000 are then listed.

The ” Your App ” column is hypothetical data 1 and indicates the performance of your application by assuming you have measured the performance on each of the processors listed. ( The data is actually from the CINT2000/300 Base benchmark that serves as the data to predict in this example.

The row labeled ” Correlation ” is the degree of correlation between the individual benchmark data and the data in the ” Your App ” column. The ” MV Estimate ” column was created by employing a multivariable regression where the CINT2000 benchmarks are the independent variables and the ” Your App ” data is the dependent variable.

Finally, the baseline column indicates the overall CINT2000 rating that is part of the historical SPEC data. If you were considering moving your application from a Dual-Core Intel Xeon 5160 processor-based system with a frequency of 3 GHz to an Intel Core 2 Extreme processor X6800 with a frequency of 2.93 GHz, Table 3.2 below shows a number of different performance estimates from the previous table.

The assumption in this scenario is that you have not executed the application on the Intel Core 2 Extreme processor X6800 and thus would not have the ” Actual” performance number ( ??? in Table 3.2 below ) for it yet.

Table 3.2 : Overall performance estimate comparison

The percentage difference for the “baseline” column suggests that your application would receive a 6.11% performance increase from moving to the new processor. If you applied the performance improvement from the benchmark that had the highest degree of correlation (technique #2) with your application (175 Base with .994 correlation), a 2.22% performance improvement is suggested.

If you applied a multivariable regression to the data set, the performance decrease is estimated at 1.68%. The actual performance decreased by 0.99% which in this example shows the multivariable regression estimate as being the closest.

There is no guarantee that employing correlation or a multivariable regression will lead to a better prediction of application performance in all cases, however, reviewing three estimates for performance improvement compared to one does provide a greater degree of confidence.

To summarize, armed with historical data from CPU2000 and your application on a number of processors, it is possible to generate two other estimates for the expected performance benefit of moving to a new processor without executing the application on the processor.

The estimates based upon degree of correlation between individual CPU2000 benchmarks and on a multivariable regression may provide a more accurate indication of expected performance.

Figure 3.10 : Image rotation performance
Table 3.2 : Overall performance estimate comparison

Characterizing Embedded System Performance
In addition to the pure performance estimates that can be created using benchmark information, benchmarks can also help characterize multi-core processor behavior on specific types of applications and across a range of design categories.

For example, consider Figures 3.10 earlier and Figure 3.11 below which depict performance of two different types of applications using varying numbers of concurrent work items on a dual-core Intel Pentium D processor-based system.

Figure 3.11 : FFT performance

As you can see, on the image rotation benchmark, a supralinear performance increase is possible most likely resulting from the application ' s cache behavior.

On a Fixed Fourier Transforms (FFT) application using multiple work items raises the throughput by approximately 20%. Attempting to process more than two data sets concurrently results in a slight performance decrease.

Figure 3.12 : Image processing speed-up

Performance on two processor cores is the simplest multi-core case; benchmark behavior on processors with more than two cores can also be characterized. For example, consider Figures 3.12 above and Figure 3.13 below , which show the performance of an image rotation benchmark on a 16 core target.

Figure 3.13 : Image processing throughput

The figures indicate the results from running the EEMBC image rotation benchmark with an input data set of grayscale pictures comprised of 4 million pixels. The first graph shows the effect of using multiple cores in order to speed up processing of a single picture, whereas the second graph shows the result of trying to optimize for throughput (i.e., the total number of pictures processed).

The term, slice size, refers to the granularity of synchronization that occurs between different cores working on the same picture (workers), where smaller slices require more synchronization.

Figure 3.12 illustrates that on single picture processing, using multiple cores with medium synchronization granularity, it is possible to speed up the processing by up to 5x.

Note that performance scales near linearly with 2 cores active, and reasonably well with 4 cores active, but when all 16 cores are active, performance is actually slower than the 8 processor core case. Most likely, the synchronization cost when employing 16 processor cores has more of an impact than the available processing power.

Figure 3.13 shows measured throughput where each line indicates the number of processor cores working on any single picture (workers). A speed-up of 6x (better then 5x over a single image) is observed. However, this is still far from the theoretical 16x speed-up potential on a 16 processor core target. Also it is interesting to note that again, using 8 processor cores result in the best performance, split as processing 2 images at a time, with 4 cores processing each picture.

To summarize, it is the combination of multiple testing strategies and benchmarks targeting different segments that allows the analysis of a multi-core platform using multi-core benchmarks. Software developers can use the benchmarks to make intelligent decisions on structuring the concurrency within their programs, as well as evaluate which target system decisions will best serve their needs.

Reviewing Benchmark Data
Benchmarks can help gauge and compare embedded system performance thereby serving a useful purpose; however, benchmark results can also be abused and used to make conclusions that are not reasonable. Understanding what conclusions can be drawn depends on the characteristics of the benchmarks employed. A few helpful considerations when reviewing benchmark data include:

1) System configuration
2) Benchmark certification
3) Benchmark run rules (base, peak), and,
4) Reviewing single threaded and single process benchmark data.

When comparing performance results between different vendors, it is important to consider the entire system confi guration and the preference is for the systems to be as similar as possible to help control for differences in these components.

For example, you would not want to compare two processors using systems with different amounts of memory and disk storage and different memory latency and hard drive speed. This consideration is fairly obvious but worth mentioning.

Second, determine if benchmark results have been certified in the case of EEMBC and BDTI benchmarks or are official published results in the case of SPEC. Certified results have been reviewed by an expert third party lending the performance results greater credibility.

Official published results are subject to review by other companies. Early releases of SPEC performance information on a given processor are labeled ” SPEC estimate ” and can be considered beta quality.

Third, consider how the benchmarks were executed. For example, SPEC allows two methods of executing the single-core benchmarks. The first measure, base results, requires the use of a limited number of options all of which must be the same for all of the benchmarks in the suite.

The second measure, peak results, allows the use of different compiler options on each benchmark. It would not make sense to compare a system using base results with a system using peak results.

Similarly, EEMBC allows two different types of results: out-of-the-box results and full-fury results. Out-of-the-box results specify that the EEMBC benchmark source code was not modifi ed for execution.

Full-fury results allow modification of code to use functionally equivalent algorithms to perform the tests. It may not be reasonable to compare out-of-the-box results with full-fury results unless you clearly understood the differences.

The BDTI Benchmark Suites address demanding signal processing applications; in these applications, software is usually carefully optimized for the target processor, and for this reason published BDTI Benchmark results are based on carefully and consistently optimized code.

In addition, BDTI requires that vendors obtain BDTI approval when they distribute product comparisons using BDTI Benchmark Suites. As a consequence, BDTI benchmark results can be safely compared without concerns about different underlying assumptions.

Finally, scrutinize downright abuses of benchmark data. It is not reasonable to compare a single-core processor with a multi-core processor using a single-core benchmark such as CPU2000 base and claim that the multi-core processor is not doubling performance and therefore bad.

For example, Table 3.1 shows that an Intel Core 2 Quad Extreme processor QX6700 has a CPU2000 overall base score of 2829 and that an Intel Core 2 Duo processor E6700 has a score of 2836.

If you concluded that the four processor core Intel Core 2 Quad Extreme processor QX6700 was therefore worse than the dual-core Intel Core 2 Duo processor E6700, you would be mistaken. CPU2000 base is a single core benchmark so should not be used to compare the benefits of a multi-core processor.

[1] P. Mehta, Unleash the Power of Multi-Core Using a Platform Approach , Multicore Expo, March 2006.
[2] Multicore Communication Application Programming Interface (MCAPI)
[3] Supra-linear Packet Processing Performance with Intel Multi-core Processors White Paper.
[4] D. Neumann, D. Kulkarni, A. Kunze, G. Rogers and E. Verplanke, Intel Virtualization Technology in embedded and communications infrastructure applications, Intel Technology Journal, 10 (3), 1996.
[5] Intel Digital Security Surveillance, When Safety is Critical.
[6] How Much Performance Do You Need for 3D Medical Imaging?
[7] BDTI Releases Benchmark Results for Massively Parallel picoChip PC102.
[8] V. Aslot, M. Domeika, R. Eigenmann, G. Gaertner, W. B. Jones and B. Parady. SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance, Workshop on OpenMP Applications and Tools , pp. 1″10, July 2001,
[9] Battery Life Tool Kit.
[10] MobileMark 2005.
[11] S. Daily, Software Design Issues for Multi-core/Multiprocessor Systems.

To read Part 1, go to Choosing and using the right benchmarks.

Used with permission from Newnes, a division of Elsevier, from “Software Development for Embedded Multi-core Systems” by Max Domeika. Copyright 2008. For more information about this title and other similar books, please visit www.elsevierdirect.com.

Max Domeika is a senior staff software engineer in the Developer Products Division at Intel , creating software tools targeting the Intel Architecture market. Max currently provides technical consulting for a variety of products targeting Embedded Intel Architecture.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.