Benchmarking an ARM-based SoC using Dhrystone: A VFT perspective

Neha Srivastava, and Aashish Mittal, Freescale Semiconductor

November 04, 2012

Neha Srivastava, and Aashish Mittal, Freescale SemiconductorNovember 04, 2012

Editor’s Note: The authors describe the development of a self-checking, result-signaling, tester pattern version of the popular Dhrystone benchmark. Their method generates useful performance numbers from an ARM-based SoC that can be used in a tester environment to correlate with the performance predicted by architectural analysis and RTL simulations.

The Dhrystone 2.1 integer benchmark is a widely-used performance benchmark. However, there is no well-defined official reference specification for it, detailing the exact procedures needed to run the program and validate its correct operation.

Most existing references talk about the calculations of the Dhrystone number based on the core instruction cycles consumed , or they analyze the *.c (code) and *.h (header) components of the program being executed. [1,2]. They often discuss the relative advantages and drawbacks of the metric compared to other standards [3], or deal with little more than basic high-level descriptions of the program [4].

It's easy to get confused by the huge amount of data available, making it difficult to develop clear and unambiguous guidelines for something as simple as getting the Dhrystone performance benchmark test up and running on a tester, much less getting it in a self-checking, result-signalling format to execute on bare-metal SoC silicon. In this context, 'bare metal' refers to a system environment without any type of kernel or operating system; the benchmark program must be wholly self-contained to allow its execution on the hardware directly. This is necessary at the start of an SoC design to establish correlations of expected integer performance with register transfer level (RTL) simulations.

This process often involves a test bench component , primarily a core instruction bus snooper, along with code simulation to track the number of core instruction cycles consumed while carrying out the iterations of the loops of the Dhrystone code. Then it is necessary to derive the performance numbers through calculations using the number of instruction cycles, or in other cases to print out and gather the time of the execution and other metrics from the log files through the printfs incorporated into the code.

Despite these limitations, the Dhrystone benchmark provides a simple, easy-to-control program with a relatively short execution time per loop iteration. This is attractive in the silicon tester environment because it allows execution time to be easily measured. In the silicon tester environment described in this article, assertion and negation of package port pins can be used to signal various events. This replaces the conventional testbench core bus snoop logic used in an RTL simulation environment to extract performance metrics.

Current approaches to benchmarking ARM core-based SoCs
"ARM apps note on Dhrystone benchmarking for ARM Cortex Processors" [1] and "ARM apps note on benchmarking with ARMulator" [2] detail the use of the architecture’s instruction set simulator (ISS) to execute the Dhrystone benchmark. The ARM documentation contains a lot of information about the outputs from the ISS when running this example. Unfortunately, most of the ISS output data does not pertain to calculation of the Dhrystone benchmark performance metric in “Dhrystones per second,” which is the reciprocal of the steady-state loop iteration execution time metric needed in an SoC environment.

If you look at the code, the information from the *.c and *.h code is mainly obtained in the form of printf/info statements. All these do is print out the time taken by the loops of the code referenced to a timer within the test case, as well as the real time at start and end, and the difference between the two. Other test case variables such as the number of iterations are also printed out by the code. This is satisfactory – and useful - in a simulation environment supporting a printf, where the values can be viewed in a logfile. But in a tester setup, where we want all information to be either self-checking or to be sent out on the design ports for viewing on the tester, this approach falls short.

The use of MIPS (million instructions per second) in the Dhrystone result numbers also confuses matters. But in an SoC test environment, apart from the strict bus cycles or instructions being measured per second by processes like the ISS from ARM, we are not interested in measuring the instructions per second. What is more important is the time taken by the iterations of the code loops.

The point we want to emphasize is that when running Dhrystone performance benchmarking, the only data that really matters for actual calculations in an SoC test environment is the execution time and how many loop iterations are included.

The rest of this article will describe the simple steps we developed to guide SoC designers who want to use the standard Dhrystone codes to get performance numbers from their tester setup with minimal code or process overhead. These procedures are based on our work with experienced core/platform developers and what we learned while performing the number of iterations required to get the Dhrystone performance patterns ported and running easily in our tester environment.

For this article we will describe the procedures we developed to run the Dhrystone benchmark using a dual-core, embedded microprocessor SoC device. This device includes two ARM Cortex processor cores: a Cortex-A5 (CA5) and a Cortex-M4 (CM4), with both cores including tightly-coupled cache memories for maximum performance.

For our typical tester environment, there are three areas of configuration that are specifically required:

Configuration #1: Dual core environment with a CA5 running concurrently with the CM4
For this exercise, the dual core configuration included two separate, independent memory images (code and data for each processor) to mimic the typical operating environment of a multicore system.

Configuration #2: Tester specific code and pad signatures required

Minimal support code for tester specific functions needs to be ported to the Dhrystone. This code includes signalling through design ports/pads, the start and end of testcase, pass/fail banner, and execution time for the specified number of loops of Dhrystone.

Configuration #3: Speed-binning to be carried out
Another goal of this effort is the use of the Dhrystone benchmark as a first-pass functional pattern to assist with speed sorting of the silicon. Accordingly, the benchmark code must be modified to run code enabling the following capabilities in order for the Dhrystone to have the code run at-speed:

  • enabling the PLL to lock at the desired frequency before executing the benchmark,
  • sourcing a high speed clock into the device through a fast design port, and
  • configuring the SoC to use this clock directly.

We’re not going to go into the details of the code itself (i.e. dhry.h, dhry_1.c, dhry_2.c). Instead we will focus on the specific code areas needing modification. These modifications can be classified into three categories:
  1. removal of existing but unnecessary code by commenting it out,
  2. addition of new code to perform the required self-checking operations, and
  3. added code for signalling of the execution results.

1. Removal of unnecessary code
All time-related defines, tasks, and difference calculations
Since the tester environment uses the assertions and negations of design pins to directly signal execution events, all the code statements in the *.h and *.c files related to the time functions can simply be commented out, removing them from the compiled executable.
// Commenting below
#ifndef TIME
#undef TIMES
#define TIMES
User_Time = End_Time - Begin_Time;
if (User_Time < Too_Small_Time)
printf ("Measured time too small to obtain meaningful results\n");
printf ("Please increase number of runs\n");
printf ("\n");
#ifdef TIME
Microseconds = (float) User_Time * Mic_secs_Per_Second
/ (float) Number_Of_Runs;
Dhrystones_Per_Second = (float) Number_Of_Runs / (float) User_Time;
Microseconds = (float) User_Time * Mic_secs_Per_Second
/ ((float) HZ * ((float) Number_Of_Runs));
Dhrystones_Per_Second = ((float) HZ * (float) Number_Of_Runs)
/ (float) User_Time;
printf ("Microseconds for one run through Dhrystone: ");
printf ("%6.1f \n", Microseconds);
printf ("Dhrystones per Second: ");
printf ("%6.1f \n", Dhrystones_Per_Second);
printf ("\n");
2. Add code for self-checking
The Dhrystone program normally includes a number of printf statements to standard output, after the inner loop iterations are completed, so the final states of selected variables can be visually verified. For the tester environment, we want to suppress all these printf statements and replace them with the code needed to perform the required self-checking. To easily accumulate an error status, a simple 32-bit unsigned integer is used to record any differences between the expected and the actual variable values, where each comparison sets a unique bit if there is a mis-compare. After 19 explicit data variable comparisons are completed and any differences recorded, the "results" variable is stored in system RAM for subsequent use in determining the pass (results == 0), fail (results != 0) status of the benchmark's execution.

As shown below, the commented portions are actual Dhrystone prints; the non-commented portion is the check of the variables against a expected value being saved in a integer.
// printf ("Final values of the variables used in the benchmark:\n");
// printf ("\n");
// printf ("Int_Glob: %d\n", Int_Glob);
// printf (" should be: %d\n", 5);
if (Int_Glob != 5)
results[0] |= 1<<0;
// printf ("Bool_Glob: %d\n", Bool_Glob);
// printf (" should be: %d\n", 1);
if (Bool_Glob != 1)
results[0] |= 1<<1;
// printf ("Ch_1_Glob: %c\n", Ch_1_Glob);
// printf (" should be: %c\n", 'A');
if (Ch_1_Glob != 'A')
results[0] |= 1<<2;
// printf ("Ch_2_Glob: %c\n", Ch_2_Glob);
// printf (" should be: %c\n", 'B');
if (Ch_2_Glob != 'B')
results[0] |= 1<<3;
// printf ("Arr_1_Glob[8]: %d\n", Arr_1_Glob[8]);
// printf (" should be: %d\n", 7);
if (Arr_1_Glob[8] != 7)
results[0] |= 1<<4;
// printf ("Arr_2_Glob[8][7]: %d\n", Arr_2_Glob[8][7]);
// printf (" should be: Number_Of_Runs + 10\n");
if (Arr_2_Glob[8][7] != Number_Of_Runs + 10)
results[0] |= 1<<5;
// printf ("Ptr_Glob->\n");
// printf (" Ptr_Comp: %d\n", (int) Ptr_Glob->Ptr_Comp);
// printf (" should be: (implementation-dependent)\n");
// printf (" Discr: %d\n", Ptr_Glob->Discr);
// printf (" should be: %d\n", 0);
if (Ptr_Glob->Discr != 0)
results[0] |= 1<<6;
// printf (" Enum_Comp: %d\n", Ptr_Glob->variant.var_1.Enum_Comp);
// printf (" should be: %d\n", 2);
if (Ptr_Glob->variant.var_1.Enum_Comp != 2)
results[0] |= 1<<7;
// printf (" Int_Comp: %d\n", Ptr_Glob->variant.var_1.Int_Comp);
// printf (" should be: %d\n", 17);
if (Ptr_Glob->variant.var_1.Int_Comp != 17)
results[0] |= 1<<8;
// printf (" Str_Comp: %s\n", Ptr_Glob->variant.var_1.Str_Comp);
// printf (" should be: DHRYSTONE PROGRAM, SOME STRING\n");
// printf ("Next_Ptr_Glob->\n");
// printf (" Ptr_Comp: %d\n", (int) Next_Ptr_Glob->Ptr_Comp);
// printf (" should be: (implementation-dependent), same as above\n");
// printf (" Discr: %d\n", Next_Ptr_Glob->Discr);
// printf (" should be: %d\n", 0);
if (Next_Ptr_Glob->Discr != 0)
results[0] |= 1<<9;
// printf (" Enum_Comp: %d\n", Next_Ptr_Glob->variant.var_1.Enum_Comp);
// printf (" should be: %d\n", 1);
if (Next_Ptr_Glob->variant.var_1.Enum_Comp != 1)
results[0] |= 1<<10;
// printf (" Int_Comp: %d\n", Next_Ptr_Glob->variant.var_1.Int_Comp);
// printf (" should be: %d\n", 18);
if (Next_Ptr_Glob->variant.var_1.Int_Comp != 18)
results[0] |= 1<<11;
// printf (" Str_Comp: %s\n",
// Next_Ptr_Glob->variant.var_1.Str_Comp);
// printf (" should be: DHRYSTONE PROGRAM, SOME STRING\n");
if (my_strcmp(Next_Ptr_Glob->variant.var_1.Str_Comp, "DHRYSTONE PROGRAM, SOME STRING"))
results[0] |= 1<<12;
// printf ("Int_1_Loc: %d\n", Int_1_Loc);
// printf (" should be: %d\n", 5);
if (Int_1_Loc != 5)
results[0] |= 1<<13;
// printf ("Int_2_Loc: %d\n", Int_2_Loc);
// printf (" should be: %d\n", 13);
if (Int_2_Loc != 13)
results[0] |= 1<<14;
// printf ("Int_3_Loc: %d\n", Int_3_Loc);
// printf (" should be: %d\n", 7);
if (Int_3_Loc != 7)
results[0] |= 1<<15;
// printf ("Enum_Loc: %d\n", Enum_Loc);
// printf (" should be: %d\n", 1);
if (Enum_Loc != 1)
results[0] |= 1<<16;
// printf ("Str_1_Loc: %s\n", Str_1_Loc);
// printf (" should be: DHRYSTONE PROGRAM, 1'ST STRING\n");
if (my_strcmp(Str_1_Loc, "DHRYSTONE PROGRAM, 1'ST STRING"))
results[0] |= 1<<17;
// printf ("Str_2_Loc: %s\n", Str_2_Loc);
// printf (" should be: DHRYSTONE PROGRAM, 2'ND STRING\n");
if (my_strcmp(Str_2_Loc, "DHRYSTONE PROGRAM, 2'ND STRING"))
results[0] |= 1<<18;

if (CORE_TYPE == CM4)
*(unsigned int *)SRAM_LOC1 = results[0];
else //CORE_TYPE== CA5
*(unsigned int *)SRAM_LOC2 = results[0];

return results[0];

3. Scan for the number of loops to be run
The normal benchmark program inputs the number of loop iterations from the user via standard input. In the tester environment, this user input operation is commented out and the loop counts are simply defined as system RAM variables.
// printf ("Please give the number of runs through the benchmark: ");
// {
// int n;
// scanf ("%d", &n);
// }
// printf ("\n");
// printf ("Execution starts, %d runs through Dhrystone\n", Number_Of_Runs);

if ( CORE_TYPE == CM4){
Number_Of_Runs = *(int *)SRAM_LOC1;
} else { //CORE_TYPE == CA5
Number_Of_Runs = *(int *)SRAM_LOC2;

< Previous
Page 1 of 4
Next >

Loading comments...