Debugging poor performing or unreachable code - Embedded.com

Debugging poor performing or unreachable code

After completing the first step of making a system work as expected, taking the next steps to optimize your code, improve performance, and eliminate unneeded instructions can be very worthwhile in the long term life of a product.

Developers working with embedded systems have the daily challenge of optimizing resources. The processing power of the MCU’s and MPU’s in embedded systems can be significantly less than that available in desktop, smartphones and servers mostly due to cost factors. Additionally, there is often limited memory available and this means that embedded applications require full control of how the system is behaving in order to fine tune the application to get the best response and performance from the available resources.

Each application is unique and every product has its own technical requirements and specifications. Depending on the application you generally have a maximum time you can spend for processing information and reacting to inputs.  This is one accepted definition of real time systems.

Professional compilers are very powerful and can generate optimized code for speed and high performance or for the smallest size. This is however often applied over the entire source code which may not the best solution. An embedded application is often a combination of the application code, middleware and RTOS and the BSP or HAL drivers. It’s a good practice to select different optimization strategies for the different components. The BSP or HAL drivers, generally provided as chip vendor libraries, can probably be optimized for size and the application and RTOS components for speed to get the best result. This might work in some cases however with the resource limitations of embedded systems you might need to fine tune the different modules or functions in order to fit the final output into the available memory.

If you are working with multicore environments, you also have the possibility of distributing the load of the application to the various cores for best performance. Adjustments are likely necessary and this brings us to the need to analyze and measure the performance of the application.

Benchmarks are very common to measure the performance of a specific core and how the generated code affects the efficiency. The most popular benchmarks for embedded hardware are Coremark and Dhrystone that are mainly C code.  These contain implementations of different algorithms including list processing (find and sort), matrix manipulation (common matrix operations) and state machine operations (determine if an input stream contains valid numbers).

However, in your own application it might be a bit more complicated to define what is a good enough performance, or if your application is in fact performing poorly.  In order to achieve the desired or specified performance, it is necessary to be able to measure actual time consumption of a piece of code in a highly precise manner. 

This is made possible by a debugger that supports trace and the capability of logging data accesses.

Reasons for using trace

Trace is a continuously collected sequence of executed instructions for a selected portion of the application. Trace can be collected for every single instruction, for example via the Embedded Trace Macrocell or through discrete event trace via SWO (Serial Wire Output Trace) in case of Arm cores.

Full instruction Trace data is mostly used for locating programming errors that have irregular symptoms and occur sporadically. By using trace, you can inspect the program flow up to a specific state, for instance an application crash, and use the trace data to locate the origin of the problem.  However, it can also provide you accurate information about the application’s performance for every routine and line of code executed with cycle level precision. Figure 1 shows a collected sequence of executed machine instructions collected using full instruction Trace.

click for full size image

Figure 1: Collected sequence of executed machine instructions.

The trace information can also be displayed as a call graph in the timeline. This assists developers in analyzing the performance of a live application using the call graph data. Figure 2 shows an example of getting timing information for the selection or functions, with start and end times in a simplified approach displayed as cycle counts and time including the absolute start time, stop time and the difference between the two.

click for full size image

Figure 2: Example of getting timing information from the timeline.

Event graph and data logging with instrumentation

Event messages can be produced when the execution passes specific positions in your application code.  To specify the position in your application source code where you want to generate an event message, it is necessary to make use of predefined preprocessor macros available in most Arm development tools supporting the Arm CoreSight features. For example, in IAR Embedded Workbench for Arm, you have the instrumentation macros defined in the arm_itm.h header file and it is required to add the macro calls in your application source code:

#include 
void func(void)
{
ITM_EVENT8_WITH_PC(1,25);
// Code whose time you want to measure
.
.
.
// end code
ITM_EVENT32_WITH_PC(2, __get_PSP());
}

The first line sends an event with the value 25 to channel 1. The second line sends an event with the current value of the stack pointer to channel 2, which means that the debugger can display the stack pointer at a code position of your choice. When these source lines are passed during program execution, events will be generated and visualized, which means that you can further analyze them.

Figure 3 displays the events produced when the execution passes specific positions in your application code. This is extremely powerful when working with an RTOS since it will help you to analyze the task switches that you have during the execution of the application but also very useful when measuring the time specific parts or function of the application take. Notice that if high message rates are generated with multiple message sources you might see data overflow issues.  This is because, while the SWO has significant bandwidth, in the range of 10’s of megabits per second, it is not sufficient for high bandwidth operations such as tracing full instruction streams or high speed interrupt and data sampling events.

click for full size image

Figure 3: Events produced with code instrumentation in the timeline.

Reasons for using the profiler

Profiling can help you find the functions in your source code where the most time is spent during execution. You should focus on those functions when optimizing your code. Profiling can help you fine-tune your code on a very detailed level, especially for assembler source code. Profiling can also help you to understand where your compiled C/C++ source code spends its time and perhaps give insight into how to rewrite it for better performance. Figure 4 shows the profiler that follows the program flow and detects function entries and exits:

  • For the InitFib function, Flat Time 231 is the time spent inside the function itself.
  • For the InitFib function, Acc Time 487 is the time spent inside the function itself, including all functions InitFib calls.
  • For the InitFib/GetFib function, Acc Time 256 is the time spent inside GetFib (but only when called from InitFib), including any functions GetFib calls.
  • Further down in the data, you can find the GetFib function separately and see all of its subfunctions (in this case none).

click for full size image

Figure 4: Profiler with function calls.

It is clear that the PutFib function with 3174 cycles run has the highest potential for performance optimizations. A first step might be to split PutFib into smaller modules. Higher speed optimization levels could also help.

Performance Monitoring Unit (PMU)

High-end ARM processors based on Cortex-A and Cortex-R include a Performance Monitor Unit (PMU) which provides useful information about performance, for example event counts and cycle counts. The PMU data is accessed via the CP (Co-processor) register. To access the Co-processors from the code, special instructions MCR (Move from Register to Co-processor) and MRC (Move from Co-processor to Register) are used.

The debugger with a viewer will make it possible to monitor event counters or CPU cycles through the PMU. Figure 5 shows an example of the performance monitoring registers.  If you know the CPU clock cycle time, the actual time elapsed can be calculated easily.


Figure 5: Example of the performance monitoring registers.

Notice that to use performance monitoring in your hardware debugger system requires a debug probe that can connect to the PMU through a debug access port (DAP) and the target must have memory-mapped registers. If these requirements are not met, the values of the event counters can only be read when the application execution is stopped.

Unreachable code

Unreachable code is a part of a program’s source code that can never be executed because there is no control flow path to reach the code from anywhere in the rest of the program.

Unreachable code is sometimes referred to as dead code, although dead code can also refer to code that executes but has no effect on the output of a program. Unreachable code is generally considered undesirable for several reasons:

  • Uses program memory unnecessarily
  • Can lead to unnecessary use of the CPU instruction cache
  • Time and effort may be spent testing, maintaining, and documenting code that is never used
  • An optimizing compiler may simply eliminate it making debugging confusing if breakpoints are set in this area

Unreachable code can exist for many reasons, such as:

  • Programming errors in complex conditional branches
  • Incomplete testing of new or changed code
  • Legacy code
  • Unreachable code that a programmer did not want to delete because it was mixed up with accessible code
  • Potentially reachable code that current use cases never need
  • Code only used for debugging

Unreachable code or unused code should not be part of a release build of the application unless there is a strong reason, like code that handles errors or exceptions. Additionally, in order to comply to functional safety standards, you are required to prove full coverage via compressive testing and this might be a stopper.

Code coverage capability helps to verify whether all parts of the application have been executed. It also helps to identify parts of the code that are not reachable.

Reasons for using code coverage

Code coverage functionality is useful when you design your test procedures to verify whether all parts of the code have been executed and therefore at least checked with one execution path. It also helps you identify parts of your code that are not reachable. Figure 6 shows a typical report of the status of the current code coverage analysis. For every program, module, and function, the analysis shows the percentage of code that has been executed since code coverage was turned on up to the point where the application has stopped.

click for full size image

Figure 6: Typical code coverage analysis report.

In addition, all statements that have not been executed are listed.

Only the statement containing the inlined function call is marked as executed. A statement is considered to be executed when all of its instructions have been executed. By default, when a statement has been executed the percentage is increased correspondingly and the window updated.

Conclusion

Taking advantage of comprehensive debugger features can provide more efficient and reliable code.

By eliminating unreachable code, the reliability of a program can be improved.  In addition by using code coverage techniques to ensure that all code is executed and tested, including error handling code, can help ensure that a system will behave as expected even when errors occur.

By using tools such as code profilers and performance analyzers to determine the “hot spots” where your application spends most of its time, you can focus your performance optimization efforts on the functions where you can get the best performance gains for your efforts.

This can not only ensure that your system will meet real-time constraints, but it can also be a very effective way to reduce overall system energy requirements.

As a result, it is clear that after the first step of making the system work as expected is completed, taking the next steps to optimize your code, improve performance, and eliminate unneeded instructions can be very worthwhile in the long term life of a product.

Note: Figure images are by IAR Systems unless otherwise noted. 

Aaron Bauch is a Senior Field Application Engineer at IAR Systems working with customers in the Eastern United States and Canada. Aaron has worked with embedded systems and software for companies including Intel, Analog Devices and Digital Equipment Corporation. His designs cover a broad range of applications including medical instrumentation, navigation and banking systems. Aaron has also taught a number of college level courses including Embedded System Design as a professor at Southern NH University. Mr. Bauch Holds a Bachelor’s degree in Electrical Engineering from The Cooper Union and a Masters in Electrical Engineering from Columbia University, both in New York, NY.

Related Contents:

For more Embedded, subscribe to Embedded’s weekly email newsletter.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.