Evaluating the performance of multi-core processors - Embedded.com

Evaluating the performance of multi-core processors

Determining the specific multi-core processor that will suit your embedded application needs is a challenge. Relying upon the marketing collateral from a given company is not sufficient because in many cases, the results quoted are specific to a given platform and ambiguous application; there is no guarantee your application will exhibit the same performance.

Therefore, it is important to understand the strengths and weaknesses of common benchmark suites that are used to gauge single processor and multi-core processor performance. In addition, the increased focus on energy efficiency has led to the development of benchmark suites that gauge power performance; a description of these benchmark suites is included.

Finally, in Part 2 , two practical examples of using benchmark results to estimate system and application behavior are discussed:

1) How to use benchmark data to characterize application performance on specific systems.

2) How to apply performance benchmark suites to assist in estimating performance of your application.

Single-core Performance Benchmark Suites
A number of single-core benchmark suites are available to assist embedded development engineers assess the performance of single-core processors. Before considering multicore processor performance on parallel applications, scalar application performance should be reviewed. Some popular benchmark suites commonly used to evaluate singlecore embedded processor performance are:

EEMBC Benchmark Suites
BDTI Benchmark Suites
SPEC CPU2000 and CPU2006

The Embedded Microprocessor Benchmark Consortium (EEMBC), a non-profi t, industry-standard organization, develops benchmark suites comprised of algorithms and applications common to embedded market segments and categorized by application area.

EEMBC benchmark suites covering the embedded market segments include:

Automotive 1.1
Consumer 1.1
Digital Entertainment 1.0
Networking 2.0
Network Storage
Office Automation 2.0
Telecom 1.1

The EEMBC benchmark suites are well suited to estimating performance of a broad range of embedded processors and fl exible in the sense that results can be obtained early in the design cycle using functional simulators to gauge performance.

The benchmarks can be adapted to execute on bare metal embedded processors easily as well as execute on systems with COTS OSes. The classification of each application or algorithm into market segment specific benchmark suites make it easy for market specific views of performance information.

For example, a processor vendor targeting a processor for the automotive market segment can choose to measure and report Automotive 1.1 benchmark suite performance numbers. Executing the benchmark suites results in the calculation of a metric that gauges the execution time performance of the embedded system.

For example, the aggregate performance results from an execution of the networking benchmark suites is termed NetMark * and can be compared to the NetMark value obtained from an execution of the benchmark on different processors.

Additionally, the suites provide code size information for each benchmark, which is useful in comparing tradeoffs made between code optimization and size. Publicly disclosed benchmark suite results require certification by EEMBC, which involves inspection and reproduction of the performance results and which lend credibility to the measurements.

This is especially critical when the benchmark code is optimized to provide an implementation either in hardware, software, or both to maximize the potential of the processor subsystem. Access to the benchmark suite requires formal membership in the consortium, an academic license, or a commercial license. For further information on EEMBC, visit www.embc.org .

The BDTI Benchmark Suites focus on digital signal processing applications, such as video processing and physical-layer communications. One valuable feature of these suites is that they are applicable to an extremely broad range of processor architectures, and therefore enable comparisons between different classes of processors.

The BDTI benchmarks define the functionality and workload required to execute the benchmark, but do not dictate a particular implementation approach. The benchmark customer has the fl exibility of implementing the benchmark on any type of processor, in whatever way is natural for implementing that functionality on that processor.

The benchmark results developed by the customer are then independently verifi ed and certifi ed by BDTI. The rationale for this approach is that it is closer to the approach used by embedded developers.

Embedded system developers obtain source code for key functional portions and typically modify the code for best performance either by optimizing the software (e.g., using intrinsics) or offl oading some work to a coprocessor. For background on BDTI and their benchmark offerings, please consult the website, www.BDTI.com .

Standard Performance Evaluation Corporation (SPEC) CPU2006 is comprised of two components, CINT2006 and CFP2006, which focus on integer and fl oating point code application areas, respectively.

For embedded developers, CINT2006 is more relevant than CFP2006; however, portions of CFP2006 may be applicable to embedded developers focused on C and C ++ image and speech processing. CINT2006 is comprised of 9 C benchmarks, 3 C ++ benchmarks and cover application areas such as compression, optimization, artificial intelligence, and software tools.

System requirements are somewhat steep for an embedded multi-core processor requiring at least 1 GB of main memory and at least 8 GB of disk space. Overall SPEC CPU2006 and the recently retired SPEC CPU2000 provide good coverage of different application types.

Due to the longevity of the benchmark and availability of publicly available performance data, it is possible to create a model that estimates your application performance on new processor cores before you have access to the new processor. An example of this technique is detailed later in this chapter. For background on SPEC and their benchmark offerings, please consult the website, www.spec.org .

Multi-core Performance Benchmarks
Good benchmarks for evaluating embedded single-core processor performance have been available for years; however, benchmarks for embedded multi-core processor performance are scarcer. Some popular benchmark programs commonly used to evaluate multi-core embedded processor performance are:

SPEC CPU2006 (and CPU2000) rate
EEMBC MultiBench software
BDTI Benchmark Suites

SPEC CPU2006 evaluates multi-core processor performance when executed under what is termed ” rate ” mode. SPEC CPU2006 rate tests multi-core performance by executing multiple copies of the same benchmark simultaneously and determining the throughput.

The number of copies executed is determined by the tester, but is typically set equal to the number of processor cores in the system. SPEC CPU2006 rate provides a relatively straightforward method of evaluating multi-core processor performance, however, can be criticized for not representing typical workloads ” how many users would execute multiple copies of the same application with the same data set simultaneously?

The EEMBC MultiBench benchmark software is a set of multi-context benchmarks based upon embedded market segment applications. Much of the work is in the creation of a platform independent harness, called the Multi-Instance-Test Harness (MITH), which can be ported to different embedded systems, some of which may not support a standard thread model such as POSIX threads or execution on a COTS OS. The group has targeted applications from several sources including creating parallel versions of the single-core EEMBC benchmarks.

Figure 3.6 : MITH contexts and work items

The key benefit of MITH is that it allows the easy creation of a wide variety of workloads from predefi ned work items. As depicted in Figure 3.6 above a workload can be composed of the following testing strategies:

1. A single instance of a work item ” similar to the SPEC CPU2006 rate benchmark.

2. Multiple instances of the same work item, where each instance executes a different data set ” provides an environment similar to embedded applications such as a multiple channel VoIP application.

3. Multiple work items can be executed in parallel ” emulates a complex system supporting a practically unlimited number of work items. An example of such a system would be a video conferencing application that is simultaneously running MPEG encode and decode algorithms, networking (TCP), screen updates ( jpeg), and even background music (MP3).

4. Instances of multi-threaded work items ” some of the algorithms are built to take advantage of concurrency to speed up processing of a single data set. For example, H.264 encoding could employ each core to process separate frames.

As mentioned earlier, the BDTI Benchmark Suites are designed to be applicable to a very wide range of processors, including single-core and multi-core processors, as well as many-core processors [7] and even FPGAs. Taking advantage of this fl exibility, BDTI 's benchmarks have been implemented on a number of multi-core processors beginning in 2003.

In particular, the BDTI Communications Benchmark (OFDM) – and the BDTI Video Encoder and Decoder Benchmarks – are amenable to multi-core processor implementations. Consistent with its philosophy of broad benchmark applicability, BDTI does not require the use of threads or any other particular implementation approach. It is up to the benchmark implementer to decide the best approach for implementing BDTI 's benchmark suites on a given target processor.

SPEC OMP [8] was released in June 2001 and is based upon applications in SPEC CPU2000 that were parallelized using OpenMP directives. The benchmark assesses the performance of SMP systems and contains two data sets, SPEC OMPM2001 and SPEC OMPL2001.

SPEC OMPM2001 uses data sets that are smaller than those in SPEC OMPL2001 and is appropriate for evaluating the performance of small-scale multicore processors and multiprocessors. SPEC OMP is not recommended for evaluating embedded application performance; the application focus of SPEC OMP is the high performance computing (HPC) market segment and floating point performance.

Only 2 out of the 11 applications in OMPM2001 are written in C; the rest are written in Fortran. OpenMP fi ts a number of embedded application areas; however, I cannot recommend SPEC OMP as a method of benchmarking multi-core processors that will be employed in an embedded project.

Power Benchmarks
The embedded processor industry ' s continued focus on power efficiency has led to demand for benchmarks capable of evaluating processor and system power usage. Like the multi-core processor benchmarks, this area of benchmarking is somewhat less mature than single-core processor performance benchmarks.

There are only two benchmark suites (EEMBC EnergyBench, and BDTI Benchmark Suites) which could be considered industry standard for the embedded market segments. Other tools and benchmarks can be employed but offer less of a fi t to the embedded market segments. Four benchmark suites that are used to assess power performance in embedded systems include:

1. EEMBC EnergyBench
2. Battery Life Tool Kit (BLTK) [9]
3. BDTI Benchmark Suites
4. MobileMark [10]

EEMBC EnergyBench employs the single-core and multi-core EEMBC benchmarks and supplements them with energy usage measurements to provide simultaneous power and performance information.

The benchmark suite requires measurement hardware comprised of a host system, a target system, a data acquisition device, shielded cable, and connector block.

A number of power rails on the system board are monitored at execution time and the metrics reported include energy (Joules per iteration) and average power used (Watts). EEMBC offers a certification process and an optional Energymark rating. EEMBC Energybench uses National Instruments LabVIEW to display results.

During certification, the equipment is used to calculate the average amount of energy consumed for each pass over the input data set and for each benchmark being certifi ed. Power samples at frequencies which are not aliased to the benchmark execution frequency are used to capture a sufficient number of data samples to ensure a statistically viable result, and the process is repeated multiple times to verify stability.

Figure 3.7 : Single pass of EEMBC EnergyBench certification process

It is interesting to note that actual results show that power consumption varies significantly depending on the type of benchmark being run, even on embedded platforms.

Figure 3.7 above summarizes the process to ensure stable results with regard to power measurement. Figure 3.8 below shows published EnergyBench results for an NXP LPC3180 microcontroller.

One interesting observation is made by comparing the power utilization of the processor executing at full speed versus fractions of it with and without processor features enabled such as instruction cache and floating point unit.

Figure 3.8 : EnergyBench results showing clock speed versus power

For example, when executing at 208 MHz with both cache and floating point unit on, the processor energy consumption for the basefp benchmark is actually in line with energy consumption of the same processor running at 13 MHz, and much better than running the processor without the cache and with the floating point unit off. Based on expected use, it may actually be more power efficient to run the processor at full speed with processor features enabled.

Figure 3.9 below shows power utilization results across a number of benchmarks showing a difference of up to 10% in average power consumption depending on the actual benchmark being executed. This underlies the value of employing power utilization benchmarks to gauge performance as opposed to relying upon one ” typical ” figure which is often the case in designing a system today.

Figure 3.9 : EnergyBench results across benchmarks

BLTK is a set of scripts and programs that allow you to monitor power usage on Linux systems that employ a battery. The toolkit estimates power usage by employing the built-in battery instrumentation, and by running a battery from full to empty while executing a set of applications.

Since the power capacity of the battery is known, it is possible to estimate average power usage taking as input the Watt hour rating of the battery and how long an application executed until the battery drains. The authors of BLTK explicitly mention that BLTK is not an industry-standard benchmark.

However, the capabilities of the toolkit enable users to evaluate the impact of several embedded system design characteristics and their effect on power, such as employing hard drives with different speeds and employing more processor cores in the system.

The BDTI Benchmark suites are frequently used to estimate or measure processor energy efficiency. For estimated results, BDTI uses a consistent set of conditions and assumptions to ensure meaningful estimates.

For measured results, BDTI carefully repeats processor vendors ' laboratory measurements as part of its independent certification process. For example, the BDTI DSP Kernel Benchmarks have been used to evaluate the energy efficiency of a wide range of single-core processors; for results, see www.BDTI.com..

MobileMark 2005 is a notebook battery life benchmark from BAPCo and is comprised of a number of workloads common in the notebook market segment including DVD playback, Internet browsing, and office productivity applications.

The primary metric returned by the benchmark suite is a battery life rating. Like BLTK, the benchmark process requires conditioning of the battery and exercises the benchmarks until the battery is depleted. The rating metric is minutes. This benchmark suite requires a Windows OS which limits applicability to embedded systems employing other OSes.

Next in Part 2: Estimating system and application behavior

Used with permission from Newnes, a division of Elsevier, from “Software Development for Embedded Multi-core Systems” by Max Domeika. Copyright 2008. For more information about this title and other similar books, please visit www.elsevierdirect.com.

Max Domeika is a senior staff software engineer in the Developer Products Division at Intel , creating software tools targeting the Intel Architecture market. Max currently provides technical consulting for a variety of products targeting Embedded Intel Architecture.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.