Many attempts have been made to provide a single number that can totally quantify the ability of a CPU. Be it MHz, MOPS, MFLOPS—all are simple to derive but misleading when looking at actual performance potential. Dhrystone was the first attempt to tie a performance indicator, namely DMIPS, to execution of real code—a good attempt that has long served the industry but is no longer meaningful. BogoMIPS attempts to measure how fast a CPU can do nothing, for what that's worth.
The need still exists for a simple and standardized benchmark that provides meaningful information about the CPU core. Introducing CoreMark, available for free download from www.coremark.org. CoreMark ties a performance indicator to execution of simple code, but rather than being entirely arbitrary and synthetic, the code for the benchmark uses basic data structures and algorithms that are common in practically any application. Furthermore, in developing this benchmark, the Embedded Microprocessor Benchmark Consortium (EEMBC) carefully chose the CoreMark implementation such that all computations are driven by run-time–provided values to prevent code elimination during compile-time optimization. CoreMark also sets specific rules about how to run the code and report results, thereby eliminating inconsistencies.
To appreciate the value of CoreMark, it's worthwhile to dissect its composition, which in general consists of lists, strings, and arrays (matrices to be exact). Lists commonly exercise pointers and are also characterized by non-serial memory-access patterns. In terms of testing the core of a CPU, list processing predominantly tests how fast data can be used to scan through the list. For lists larger than the CPU's available cache, list processing can also test the efficiency of cache and memory hierarchy.
List processing consists of reversing, searching, or sorting the list according to different parameters, based on the contents of the list's data items. In particular, each list item can either contain a precomputed value or a directive to invoke a specific algorithm with specific data to provide a value during sorting. To verify correct operation, CoreMark performs a 16-bit cyclic redundancy check (CRC) based on the data contained in elements of the list. Since CRC is also a commonly used function in embedded applications, this calculation is included in the timed portion of CoreMark.
In many simple list implementations, programs allocate list items as needed with a call to malloc . However, on embedded systems with constrained memory, lists are commonly constrained to specific programmer-managed memory blocks. CoreMark uses the latter approach to avoid calls to library code (malloc/free ).
CoreMark partitions the available data space into two blocks, one containing the list itself and the other containing the data items. This partitioning also applies to embedded systems designs where data can accumulate in a buffer (items) and pointers to the data are kept in lists (or sometimes ring buffers). The data16 items are initialized based on data that is not available at compile time.
Each data16 item really consists of two 8-bit parts, with the upper 8 bits containing the original value for the lower 8 bits. The data contained in the lower 8 bits is:
The benchmark code modifies the data16 item during each iteration of the benchmark.
The idx item maintains the original order of the list items, so that CoreMark can recreate the original list without reinitializing the list (a requirement for systems with low memory capacity).
The list head is modified during each iteration of the benchmark and the next pointers are modified when the list is sorted or reversed. At each consecutive iteration of the benchmark, the algorithm sorts the list according to the information in the data16 member, performs the test, and then recreates the original list by sorting back to the original order and rewriting the list data. Figure 1 shows the basic structure.
Click on image to enlarge.
Since pointers on CPUs can range from 8 bits to 64 bits, the number of items initialized for the list is calculated such that the list will contain the same number of items regardless of pointer size. In other words, a CPU with 8-bit pointers will use a quarter of the memory that a 32-bit CPU uses to hold the list headers).Matrix processing
Many algorithms use matrices and arrays, warranting significant research on optimizing this type of processing. These algorithms test the efficiency of tight loop operations as well as the ability of the CPU and associated toolchain to use instruction set architecture (ISA) accelerators such as multiply-accumulate (MAC) units and single instruction, multiple data (SIMD) instructions. These algorithms are composed of tight loops that iterate over the whole matrix. CoreMark performs simple operations on the input matrices, including multiplication with a constant, a vector, or another matrix. CoreMark also tests operating on part of the data in the matrix in the form of extracting bits from each matrix item for operations. To validate that all operations have been performed, CoreMark again computes a CRC on the results from the matrix test.
Within the matrix algorithm for CoreMark, the available data space is split into three portions: an output matrix (with a 32-bit value in each cell) and two input matrices (with 16-bit values in each cell). The input matrices are initialized based on input values that aren't available at compile time. During each iteration of the benchmark, the input matrices are changed based on input values that cannot be computed at compile time. The input matrices are recreated with the last operation, and the same function can be invoked to repeat exactly the same processing.
An important function of a CPU core is the ability to handle control statements other than loops. A state machine based on switch or if statements is an ideal candidate for testing that capability. The two common methods for state machines use either switch statements or a state transition table. Because CoreMark already uses the latter method in the list-processing algorithm to test load and store behavior, CoreMark uses the former method, switch and if statements, to exercise the CPU control structure.
The state machine tests an input string to detect if the input is a number; if the input is not a number, the state machine will reach the “invalid” state. Figure 2 shows a simple state machine with nine states. The input is a stream of bytes, initialized to ensure we pass all available states, based on an input that is not available at compile time. The entire input buffer is scanned with this state machine.
Click on image to enlarge.
To validate operation, CoreMark keeps count of how many times each state was visited. During each iteration of CoreMark, some of the data is corrupted based on input that is not available at compile time. At the end of processing, the data is restored based on inputs not available at compile time.
Since CoreMark contains multiple algorithms, it's interesting to demonstrate how the behavior changes over time. For example, looking at the percentage of control code executed (samples taken at each 1,000 cycles) and branch mispredictions in Figure 3 , it's obvious where the matrix algorithm is being called. This is portrayed by the low mis-prediction rate and high “% of control” operations, indicative of tight loops (for example, between points 330 and 390).
Click on image to enlarge.
By default, CoreMark only requires the allocation of 2 Kbytes to accommodate all data. This minimal memory size is necessary to support operation on the smallest microcontrollers, so that it can truly be a standard performance metric for any CPU core. Figure 4 examines the memory-access pattern during the benchmark's execution. The information is represented as a percentage of memory operations that access memory within a certain distance from the previous access. It is easy to deduce that the distance peaks are caused by switching between the different algorithms (since each algorithm operates on a slice of a third of the total available data space).
Click on image to enlarge.
More than 120 CoreMark results are available online at www.coremark.org, but Table 1 shows a few results that display some interesting patterns. All results that depend on the compiler version and flags make it clear that these details must be included, otherwise it is impossible to make a useful comparison. The run and reporting rules for CoreMark require that exact tool versions be reported along with any performance results.
Click on image to enlarge.
- Blackfin results (1, 2) show a 10% increase in performance when moving from GCC 4.1.2 to GCC 4.3.3, a reasonable expectation for a newer compiler version.
- Results (8, 9) show an even more pronounced difference of 18% for a mature compiler, while other results (10, 11) (12, 13) show minor effects only, as all of those compilers are based on the GCC4 series.
- The compiler can also balance code size vs. performance as we can see in results (3, 4). The compiler and platform are the same, but when directed to build a smaller executable (-Os -mpa), performance drops 19% vs. optimizing the code with –O3. The distinction is even sharper with results (5, 6) at 30% performance difference (using –mpa switch). Note: compiler switch information is available from CoreMark website reports.
- Other compiler options affect how much the compiler tries to optimize the code. Results (7, 8) show a typical effect from safest (-O2) vs. normal (-O3) optimizations of about 10%.
- When operating frequency is scaled up, the system memory and/or on-chip flash cannot always maintain a 1:1 ratio. It is common to see extra wait states on the flash when using higher processor frequencies. When the code resides in flash, the efficiency (as expressed in CoreMark/MHz) is impacted; (14, 15) shows the efficiency dropping almost 10%. For (16, 17) the wait-state effect is even more pronounced as the CPU:memory ratio can only be maintained 1:1 up to 50 MHz; when operating at the devices' highest frequency (80 MHz), the ratio drops to 1:2 resulting in an efficiency drop of 15%. However, running at 80 MHz still yields an absolute performance improvement of 25% vs. running at 50 MHz.
- The results (18, 19) explore the situation where the cache is too small to contain all of the data. In 18, the cache is exactly 2K, which will fit all the data just barely, but have no room left for function arguments that must be passed on the stack. This causes a small amount of bus traffic external to the cache, but when the cache is enlarged (result 19), performance improves 10% even though the pipeline has been changed to a less efficient five-stage pipeline vs. a three-stage pipeline for the first implementation.
Overall CoreMark is well suited to comparing embedded processors. It is small, highly portable, well understood, and highly controlled. CoreMark verifies that all computations were completed correctly during execution, which helps debug any issues that may come up. The run rules are clearly defined and reporting rules are enforced on the CoreMark web site. In addition, EEMBC offers certification for CoreMark scores and even has a standardized method to measure energy consumption while running the benchmark.
Markus Levy is founder and president of EEMBC. He is also president of The Multicore Association and chairman of Multicore Technical Conference and Expo. Markus was previously a senior analyst at In-Stat/MDR and an editor at EDN magazine, focusing in both roles on processors for the embedded systems industry.
Shay Gal-On is EEMBC's director of software engineering and leader of the EEMBC Technology Center. At EEMBC Shay created the EnergyBench and MultiBench Standards for benchmarking. Prior to EEMBC Shay was principal performance analyst in the Microprocessor Products Group at PMC Sierra where he influenced the design of new processors, including instruction set design, and optimized both hardware and software products.