Benchmarking an ARM-based SoC using Dhrystone: A VFT perspective - Embedded.com

Benchmarking an ARM-based SoC using Dhrystone: A VFT perspective

Editor’s Note: The authors describe the development of a self-checking, result-signaling, tester pattern version of the popular Dhrystone benchmark. Their method generates useful performance numbers from an ARM-based SoC that can be used in a tester environment to correlate with the performance predicted by architectural analysis and RTL simulations.

The Dhrystone 2.1 integer benchmark is a widely-used performance benchmark. However, there is no well-defined official reference specification for it, detailing the exact procedures needed to run the program and validate its correct operation.

Most existing references talk about the calculations of the Dhrystone number based on the core instruction cycles consumed , or they analyze the *.c (code) and *.h (header) components of the program being executed. [1,2]. They often discuss the relative advantages and drawbacks of the metric compared to other standards [3], or deal with little more than basic high-level descriptions of the program [4].

It's easy to get confused by the huge amount of data available, making it difficult to develop clear and unambiguous guidelines for something as simple as getting the Dhrystone performance benchmark test up and running on a tester, much less getting it in a self-checking, result-signalling format to execute on bare-metal SoC silicon. In this context, 'bare metal' refers to a system environment without any type of kernel or operating system; the benchmark program must be wholly self-contained to allow its execution on the hardware directly. This is necessary at the start of an SoC design to establish correlations of expected integer performance with register transfer level (RTL) simulations.

This process often involves a test bench component , primarily a core instruction bus snooper, along with code simulation to track the number of core instruction cycles consumed while carrying out the iterations of the loops of the Dhrystone code. Then it is necessary to derive the performance numbers through calculations using the number of instruction cycles, or in other cases to print out and gather the time of the execution and other metrics from the log files through the printfs incorporated into the code.

Despite these limitations, the Dhrystone benchmark provides a simple, easy-to-control program with a relatively short execution time per loop iteration. This is attractive in the silicon tester environment because it allows execution time to be easily measured. In the silicon tester environment described in this article, assertion and negation of package port pins can be used to signal various events. This replaces the conventional testbench core bus snoop logic used in an RTL simulation environment to extract performance metrics.

Current approaches to benchmarking ARM core-based SoCs
“ARM apps note on Dhrystone benchmarking for ARM Cortex Processors” [1] and “ARM apps note on benchmarking with ARMulator” [2] detail the use of the architecture’s instruction set simulator (ISS) to execute the Dhrystone benchmark. The ARM documentation contains a lot of information about the outputs from the ISS when running this example. Unfortunately, most of the ISS output data does not pertain to calculation of the Dhrystone benchmark performance metric in “Dhrystones per second,” which is the reciprocal of the steady-state loop iteration execution time metric needed in an SoC environment.

If you look at the code, the information from the *.c and *.h code is mainly obtained in the form of printf/info statements. All these do is print out the time taken by the loops of the code referenced to a timer within the test case, as well as the real time at start and end, and the difference between the two. Other test case variables such as the number of iterations are also printed out by the code. This is satisfactory – and useful – in a simulation environment supporting a printf , where the values can be viewed in a logfile. But in a tester setup, where we want all information to be either self-checking or to be sent out on the design ports for viewing on the tester, this approach falls short.

The use of MIPS (million instructions per second ) in the Dhrystone result numbers also confuses matters. But in an SoC test environment, apart from the strict bus cycles or instructions being measured per second by processes like the ISS from ARM, we are not interested in measuring the instructions per second. What is more important is the time taken by the iterations of the code loops.

The point we want to emphasize is that when running Dhrystone performance benchmarking, the only data that really matters for actual calculations in an SoC test environment is the execution time and how many loop iterations are included.

The rest of this article will describe the simple steps we developed to guide SoC designers who want to use the standard Dhrystone codes to get performance numbers from their tester setup with minimal code or process overhead. These procedures are based on our work with experienced core/platform developers and what we learned while performing the number of iterations required to get the Dhrystone performance patterns ported and running easily in our tester environment.

For this article we will describe the procedures we developed to run the Dhrystone benchmark using a dual-core, embedded microprocessor SoC device. This device includes two ARM Cortex processor cores: a Cortex-A5 (CA5) and a Cortex-M4 (CM4), with both cores including tightly-coupled cache memories for maximum performance.

For our typical tester environment, there are three areas of configuration that are specifically required:

Configuration #1: Dual core environment with a CA5 running concurrently with the CM4
For this exercise, the dual core configuration included two separate, independent memory images (code and data for each processor) to mimic the typical operating environment of a multicore system.

Configuration #2: Tester specific code and pad signatures required

Minimal support code for tester specific functions needs to be ported to the Dhrystone. This code includes signalling through design ports/pads, the start and end of testcase, pass/fail banner, and execution time for the specified number of loops of Dhrystone.

Configuration #3: Speed-binning to be carried out
Another goal of this effort is the use of the Dhrystone benchmark as a first-pass functional pattern to assist with speed sorting of the silicon. Accordingly, the benchmark code must be modified to run code enabling the following capabilities in order for the Dhrystone to have the code run at-speed:

  • enabling the PLL to lock at the desired frequency before executing the benchmark,
  • sourcing a high speed clock into the device through a fast design port, and
  • configuring the SoC to use this clock directly.

We’re not going to go into the details of the code itself (i.e. dhry.h, dhry_1.c, dhry_2.c ). Instead we will focus on the specific code areas needing modification. These modifications can be classified into three categories:

  1. removal of existing but unnecessary code by commenting it out,
  2. addition of new code to perform the required self-checking operations, and
  3. added code for signalling of the execution results.

1. Removal of unnecessary code
All time-related defines, tasks, and difference calculations
Since the tester environment uses the assertions and negations of design pins to directly signal execution events, all the code statements in the *.h and *.c files related to the time functions can simply be commented out, removing them from the compiled executable.

// Commenting below
/*
#ifndef TIME
#undef TIMES
#define TIMES
#endif
*/
/*
User_Time = End_Time – Begin_Time;
if (User_Time < Too_Small_Time)
{
printf (“Measured time too small to obtain meaningful resultsn”);
printf (“Please increase number of runsn”);
printf (“n”);
}
else{
#ifdef TIME
Microseconds = (float) User_Time * Mic_secs_Per_Second
/ (float) Number_Of_Runs;
Dhrystones_Per_Second = (float) Number_Of_Runs / (float) User_Time;
#else
Microseconds = (float) User_Time * Mic_secs_Per_Second
/ ((float) HZ * ((float) Number_Of_Runs));
Dhrystones_Per_Second = ((float) HZ * (float) Number_Of_Runs)
/ (float) User_Time;
#endif
printf (“Microseconds for one run through Dhrystone: “);
printf (“%6.1f n”, Microseconds);
printf (“Dhrystones per Second: “);
printf (“%6.1f n”, Dhrystones_Per_Second);
printf (“n”);
}
*/

2. Add code for self-checking
The Dhrystone program normally includes a number of printf statements to standard output, after the inner loop iterations are completed, so the final states of selected variables can be visually verified. For the tester environment, we want to suppress all these printf statements and replace them with the code needed to perform the required self-checking. To easily accumulate an error status, a simple 32-bit unsigned integer is used to record any differences between the expected and the actual variable values, where each comparison sets a unique bit if there is a mis-compare. After 19 explicit data variable comparisons are completed and any differences recorded, the “results” variable is stored in system RAM for subsequent use in determining the pass (results == 0), fail (results != 0) status of the benchmark's execution.

As shown below, the commented portions are actual Dhrystone prints; the non-commented portion is the check of the variables against a expected value being saved in a integer.

// printf (“Final values of the variables used in the benchmark:n”);
// printf (“n”);
// printf (“Int_Glob: %dn”, Int_Glob);
// printf (” should be: %dn”, 5);
if (Int_Glob != 5)
results[0] |= 1<<0;
// printf (“Bool_Glob: %dn”, Bool_Glob);
// printf (” should be: %dn”, 1);
if (Bool_Glob != 1)
results[0] |= 1<<1;
// printf (“Ch_1_Glob: %cn”, Ch_1_Glob);
// printf (” should be: %cn”, 'A');
if (Ch_1_Glob != 'A')
results[0] |= 1<<2;
// printf (“Ch_2_Glob: %cn”, Ch_2_Glob);
// printf (” should be: %cn”, 'B');
if (Ch_2_Glob != 'B')
results[0] |= 1<<3;
// printf (“Arr_1_Glob[8]: %dn”, Arr_1_Glob[8]);
// printf (” should be: %dn”, 7);
if (Arr_1_Glob[8] != 7)
results[0] |= 1<<4;
// printf (“Arr_2_Glob[8][7]: %dn”, Arr_2_Glob[8][7]);
// printf (” should be: Number_Of_Runs + 10n”);
if (Arr_2_Glob[8][7] != Number_Of_Runs + 10)
results[0] |= 1<<5;
// printf (“Ptr_Glob->n”);
// printf (” Ptr_Comp: %dn”, (int) Ptr_Glob->Ptr_Comp);
// printf (” should be: (implementation-dependent)n”);
// printf (” Discr: %dn”, Ptr_Glob->Discr);
// printf (” should be: %dn”, 0);
if (Ptr_Glob->Discr != 0)
results[0] |= 1<<6;
// printf (” Enum_Comp: %dn”, Ptr_Glob->variant.var_1.Enum_Comp);
// printf (” should be: %dn”, 2);
if (Ptr_Glob->variant.var_1.Enum_Comp != 2)
results[0] |= 1<<7;
// printf (” Int_Comp: %dn”, Ptr_Glob->variant.var_1.Int_Comp);
// printf (” should be: %dn”, 17);
if (Ptr_Glob->variant.var_1.Int_Comp != 17)
results[0] |= 1<<8;
// printf (” Str_Comp: %sn”, Ptr_Glob->variant.var_1.Str_Comp);
// printf (” should be: DHRYSTONE PROGRAM, SOME STRINGn”);
// printf (“Next_Ptr_Glob->n”);
// printf (” Ptr_Comp: %dn”, (int) Next_Ptr_Glob->Ptr_Comp);
// printf (” should be: (implementation-dependent), same as aboven”);
// printf (” Discr: %dn”, Next_Ptr_Glob->Discr);
// printf (” should be: %dn”, 0);
if (Next_Ptr_Glob->Discr != 0)
results[0] |= 1<<9;
// printf (” Enum_Comp: %dn”, Next_Ptr_Glob->variant.var_1.Enum_Comp);
// printf (” should be: %dn”, 1);
if (Next_Ptr_Glob->variant.var_1.Enum_Comp != 1)
results[0] |= 1<<10;
// printf (” Int_Comp: %dn”, Next_Ptr_Glob->variant.var_1.Int_Comp);
// printf (” should be: %dn”, 18);
if (Next_Ptr_Glob->variant.var_1.Int_Comp != 18)
results[0] |= 1<<11;
// printf (” Str_Comp: %sn”,
// Next_Ptr_Glob->variant.var_1.Str_Comp);
// printf (” should be: DHRYSTONE PROGRAM, SOME STRINGn”);
if (my_strcmp(Next_Ptr_Glob->variant.var_1.Str_Comp, “DHRYSTONE PROGRAM, SOME STRING”))
results[0] |= 1<<12;
// printf (“Int_1_Loc: %dn”, Int_1_Loc);
// printf (” should be: %dn”, 5);
if (Int_1_Loc != 5)
results[0] |= 1<<13;
// printf (“Int_2_Loc: %dn”, Int_2_Loc);
// printf (” should be: %dn”, 13);
if (Int_2_Loc != 13)
results[0] |= 1<<14;
// printf (“Int_3_Loc: %dn”, Int_3_Loc);
// printf (” should be: %dn”, 7);
if (Int_3_Loc != 7)
results[0] |= 1<<15;
// printf (“Enum_Loc: %dn”, Enum_Loc);
// printf (” should be: %dn”, 1);
if (Enum_Loc != 1)
results[0] |= 1<<16;
// printf (“Str_1_Loc: %sn”, Str_1_Loc);
// printf (” should be: DHRYSTONE PROGRAM, 1'ST STRINGn”);
if (my_strcmp(Str_1_Loc, “DHRYSTONE PROGRAM, 1'ST STRING”))
results[0] |= 1<<17;
// printf (“Str_2_Loc: %sn”, Str_2_Loc);
// printf (” should be: DHRYSTONE PROGRAM, 2'ND STRINGn”);
if (my_strcmp(Str_2_Loc, “DHRYSTONE PROGRAM, 2'ND STRING”))
results[0] |= 1<<18; if (CORE_TYPE == CM4)
*(unsigned int *)SRAM_LOC1 = results[0];
else //CORE_TYPE== CA5
*(unsigned int *)SRAM_LOC2 = results[0];

return results[0];

3. Scan for the number of loops to be run
The normal benchmark program inputs the number of loop iterations from the user via standard input. In the tester environment, this user input operation is commented out and the loop counts are simply defined as system RAM variables.

// printf (“Please give the number of runs through the benchmark: “);
// {
// int n;
// scanf (“%d”, &n);
// }
// printf (“n”);
//
// printf (“Execution starts, %d runs through Dhrystonen”, Number_Of_Runs);

if ( CORE_TYPE == CM4){
Number_Of_Runs = *(int *)SRAM_LOC1;
} else { //CORE_TYPE == CA5
Number_Of_Runs = *(int *)SRAM_LOC2;
}

Additional Code

1. Code to toggle the design port on tester at start and end of measurements and also to signal other points in the run

To generate the desired start pulse output on a design port, code must be added at the very beginning of the inner for() loop.

Forthe specific ARM-based SoC device being discussed, recall it is acache-based microprocessor. To ignore the cold start effects associatedwith “priming” the caches, the inner loop of the benchmark is executed afew times before the loop performance is actually measured. This allowsthe “steady state” performance of the benchmark to be measured. Forthis discussion, let the inner loop be executed a total of 10 times, thefirst 5 to reach the steady state, and the final 5 iterations for theactual performance measurement.

for (Run_Index = 1; Run_Index <= Number_Of_Runs; ++Run_Index)
{

if (Run_Index == 6) {

if ( CORE_TYPE == CM4)

{

PIN_VALUE(PORTx,0);

PIN_VALUE(PORTx,1);

PIN_VALUE(PORTx,0);

}

else { // CORE_TYPE == CA5

PIN_VALUE(PORTy,0);

PIN_VALUE(PORTy,1);

PIN_VALUE(PORTy,0);

}

}

}

At end of the iterations of the Dhrystone code this is generated:

if ( CORE_TYPE == CM4)
{
PIN_VALUE(PORTx,0);
PIN_VALUE(PORTx,1);
PIN_VALUE(PORTx,0);
}
else { // CORE_TYPE = = CA5
PIN_VALUE(PORTy,0);
PIN_VALUE(PORTy,1);
PIN_VALUE(PORTy,0);
}

The function is typically inlined into the main inner loop compiler output.

Alsonote that the inclusion of the code executed when Run_Index = = 6 doesslightly increase the execution time of the inner loop, but the effectsare expected to be small enough to be ignored.

2. Code to toggle the design port to signal start and end of overall tester pattern execution and the signaling of pass/fail banner

Inaddition to the signaling of the start/end of the inner loop execution,it is also useful to include GPIO pin signaling to identify the startand end of the entire tester pattern as well as the pass/fail statusafter the self-check code has been executed. In this example, we encodeda static 2-bit value on GPIO pins to provide the following outputstatus:

if GPIO = 0x3, then tester pattern execution has started
else if GPIO = 0x2, then variable miscompare vs. expected data was detected
else if GPIO = 0x0, then execution completed successfully

3. Code to be included in the Cortex-A5 executable to wake up the Cortex-M4 as the secondary core (if applicable)

4. Code to provide a starting instruction address for secondary core execute (if applicable)

Execution always starts through primary core and in this code we will go and enable the clocks for cm4,

Informationabout dual/single core and which core is primary/secondary is providedto the design by driving specifc value into the design through fixedports. This is done through testbench/VCD.

5. For dual coreexecutions, code is added to provide a CPU-to-CPU “semaphore” variableproviding an indicator from the secondary core to the primary core thatits execution has completed. The primary core then completes itsexecution and provides the appropriate GPIO indicators.

Thisdone, the secondary core updates a known system RAM location with aparticular value when the execution is finished. This value iscontinuously being read by the primary core to notify it the secondarycore’s code execution is done and now the primary core can alsoterminate its own execution.

6. SoC specific code needed toconfigure the device's clocks correctly. This may entail getting systemto run at-speed either by locking the system Phase-locked Loop (PLL) orby providing a direct high-frequency at-speed clock on tester.

Calculating the SoC tester benchmarks
Finally,to determine the Dhrystone measurements it is necessary to perform thefollowing calculations: 

1. Measure the time on tester (signaled throughspecific pads for both cores) in seconds for the last n iterations.


2. Dhrystones/sec = 1/(time for last n loops/n).

This Dhrystones/secmetric is now typically converted into a DMIPS/MHz metric by dividingby two factors: the MHz speed of the processor and a constant thatrepresents the performance of a DEC VAX 11/780 machine, which was widelyviewed as a “1 MIPS” processor in the 1980s when the Dhrystonebenchmark first appeared.

The widely quoted constant describingthe VAX 11/780's performance is 1757 Dhrystones per second. DhrystoneMIPS are typically described as simply “DMIPS”, and a DMIPS per MHzperformance metric is often quoted, as below:

3. Dhrystone MIPS = Dhrystones/sec * (1/1757)

4. DMIPS per MHz = Dhrystone MIPS * (1/ freq of the CPU in MHz)

So the Dhrystone number for any MCU comes out as:

DMIPS/MHz =
(1 / Execution Time for last n loops ) * n * (1/1,757 ) * (1/MCU Frequency in MHz)

Some additional points to consider
Theinherent structure of the Dhrystone inner loop benchmark code and itsdisproportionate execution time in ASCII string functions (strcmp,strcpy) makes it very sensitive to compiler optimizations. This isbecause different compiler options can produce wildly varyingperformance metrics. Beyond the obvious sensitivities to compileroptions, there are a number of other optimizations and considerationsfor best performance, including:

  • Definite design port states. When running the tester pattern in an RTL simulation environment, take care that no design ports (especially the ones signaling the different simulation stage or those configured as input ports) are in an x state at any time in the run, otherwise unpredictable behavior will occur.
  • Caches for both cores. For best performance, enable all the processor architectural features, such as the caches, etc. Depending on the SoC micro-architecture, the performance with caches disabled can be significantly less than the Dhrystone metrics quoted by suppliers.
  • Semaphore in the cache-enabled mode. For a dual core configuration, a semaphore is needed for processor-to-processor communication to signal “execution complete” from the secondary core. Otherwise, With caches enabled and no special handling for the communication from the secondary core, the result indicator may not be visible to the primary processor. This condition would create a situation where the SoC enters an infinite loop waiting for the appropriate signal. The use of a non-cacheable semaphore for this processor-to-processor signaling solves the problem.
  • Copy-back mode. In a configuration mode, performance is usually maximized by operating the data cache(s) in copyback (also known as writeback) mode. In copyback mode, all processor writes generate a transaction only to the data cache so that only cache lines naturally displaced or cache maintenance operations can force line writes to occur into the next level memory system. This typically generates considerably less system bus traffic and improved performance versus a write-through configuration where every processor write also generates a system bus write to update the next level memory system.

In a dualcore configuration with caches enabled, it may be necessary to add codeto explicitly perform cache maintenance operations to push selectedvariables into the next level memory system so it can be visible to theother processor core.

Results of the SoC performance benchmarks
Inthe example described here, the performance of the Cortex-M4 operatingin copyback mode was considerably better then when in write-throughmode. Below is a VCD snapshot to illustrate the ports used asmeasurement points (start and stop for the last n loops used formeasuring execution time) and the pads showing the status of the run(running/execution complete/pass/fail).

Click on image to enlarge.


The overall results obtained from the tester provided closecorrelation with the performance predicted by the core platform designteam based on their RTL simulations as well as those provided by theapplication teams running in a more robust software environment. Thesecorrelations gave us confidence that the VFT setup would providecredible results.

The configurations presented here for Dhrystoneperformance measurement represent only a partial subset of the possibleoptions. The intent is to show relative performance ranges and not adefinitive set of absolute Dhrystone performance metrics. The tablebelow summarizes the performance metrices of the ARM SoC design.

Asshown above, the differences in the measured cache-enabled performanceare mostly due to the variations in the processor instruction setarchitectures and microarchitecture implementations. The numbers forcopy-back mode are significantly improved over the numbers withwrite-through. This is because the write-through for each instructionconsumes more execution time than a copy-back from cache at the end ofloops for the significant locations.

For the specific designexample considered, the Cortex-A5 caches only do copy-back, whereaswrite-through vs. copy-back option is applicable only for CM4, as alsoreflected in the results.

Conclusions
Performancemeasurements taken from a bare-metal execution of the Dhrystone 2.1benchmark on a dual-core SoC proved to be straightforward once weunderstood the specifics of the runtime environment and were able toinstrument the code to provide the appropriate event signalling.

TheDhrystone performance benchmark is a fairly simple one to use but onlyif the user is clear about how to plug it in and run it on theirindividual setups. This is particularly true if it involves minormodifications for system compatibility or uses a different way ofgathering the result numbers from their setup (for example, in our case,getting the numbers directly from tester).

For designersunfamiliar with the Dhrystone reference documentation, it is easy to getlost in the small print and excessive details. While useful forin-depth architectural analysis, much of what is available is not usefulin helping the designer understand what is needed for the actualexecution and result calculation.

For the sake of simplicity,we’ve focused on crucial, to-the-point steps, separating them from thegood-to-have (but not necessarily relevant) ones. With minor tweaks, theprocedures described here can be used by developers across theexecution flow and are adaptable for use on any tester (or anybenchmarking setup, for that matter).

Neha Srivastava is a lead engineer at Freescale Semiconductors (Noida, India DesignCentre), working in the Automotive and Industrial Solution Group (AISG)for over 5 years now. She has a Bachelor of Engineering (B.E.) degreefrom Birla Institute of Technology. She has worked on multiple SoCs infront-end verification and Verification for Testing domain with theareas of interest being low power designs, safety architectures and highperformance systems. She can be reached at .

Aashish Mittal is a principal design engineer at Freescale Semiconductors (Noida,India Design Centre), working in the Automotive and Industrial SolutionGroup(AISG) for over 12 years. He has a Master of Technology fromBanaras Hindu University He has worked on multiple SoCs in front-endverification, Testbench Integration and Verification for Testing domainwith the areas of interest being dual core, security , debug and lowpower architecture. He can be reached at .

References
1. ARM apps note on Dhrystone benchmarking for ARM Cortex Processors .
2. ARM apps note on benchmarking with ARMulator.
3. White-paper on Dhrystone Benchmark by ECL .
4. Wikipedia Reference for Dhrystone

1 thought on “Benchmarking an ARM-based SoC using Dhrystone: A VFT perspective

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.