Benchmarking is just the start -

Benchmarking is just the start

It's difficult to get even a suite of benchmarks to accurately represent a real chunk of critical-path code. A variety of factors are conspiring against benchmark numbers.

Benchmarking CPU performance for embedded designs has always been a fraught subject. Fundamentally there are two kinds of problems. First, the length of time a CPU requires to execute a given number of instructions generally depends on the mix of instructions. So it is difficult to get even a suite of benchmarks to represent a real chunk of critical-path code accurately. Compilers compound this problem if a seemingly inconsequential change in source code generates an entirely different number and mix of machine instructions.

Second, in the embedded world external events can exert hard-to-anticipate influence over the course of instruction execution. Embedded war stories are full of characters like the improperly-debounced switch that creates a disabling hailstorm of interrupts, or the too-small cache that suddenly begins thrashing.

These factors can lead benchmark numbers—especially simple numbers based on clock frequencies or, as our cover story suggests, MIPS—to fail to predict the elapsed time between the beginning and end of a code path. The error may be merely annoying if it gums up a user interface, or it may be life-threatening if it imperils a hard deadline in a real-time controller.

Recent trends are making the problem more complex. Today an MCU may be powered by a 32-bit processor core with a small local cache and a larger on-chip shared memory, all accompanied by a sophisticated interrupt controller and an even more sophisticated DMA engine. Does your critical-path code all fit in the L1 cache? The benchmark code did. Can an interrupt displace some of your cached code, or flush a critical data structure? Does your answer depend on how the interrupt controller is set up at the time, or on how you mapped the memory? And what about that DMA engine? Can you use it cleverly to significantly reduce code execution, or is it quietly invalidating the data cache you were about to dip into?

If you are employing an embedded operating system, do you know how it uses the system resources? Do you understand under what circumstances a chunk of operating system or middleware code may get executed in the middle of your critical path? What do the operating systems's power management facilities do to your CPU speed and the latencies you assumed?

Finally, some designs will be adding accelerators or additional CPU cores to get more performance. But along with the added compute power comes the potential for memory and bus contention, interprocess messaging tangles, and total deadlocks. In today's world, benchmarks barely start the job of performance modeling.

Ron Wilson is the director of content/ media, EE Times Group Events and Embedded. You may reach him at

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.