Software performance engineering for embedded systems: Part 2 – The importance of performance measurements -

Software performance engineering for embedded systems: Part 2 – The importance of performance measurements


Performance measurement is another important area of SPE.This includes planning measurement experiments to ensure that results are both representative and reproducible.Software also needs to be instrumented to facilitate SPE data collection.Finally, once the performance critical components of the software are identified, they are measured early and often to validate the models that have been built and also to verify earlier predictions.See Figures 7 and 8 for an example of the outcome of this process for the Femto basestation project.

Click on image to enlarge.

Figure 7: Key parameters influencing performance scenarios based on cycle counts

Click on image to enlarge.

Figure 8:Output from a “Performance Calculator” used to identify and track key performance scenarios

Step 1: Determine where you need to be
Reject nonspecific requirements or demands such as “the system should be as fast as possible”. Instead, use quantitative terms such as “Packet throughput must be 600K packets per second for IP forwarding”.

Understand potential future use cases of the system and design the necessary scalability to handle them. Figure 9 shows an example of how to define these performance goals. To do this properly, the first step is to identify the system dimension. This is the context and establishes the “what”. Then the key attributes are identified. This identifies how good the system “shall be”. The metrics are then identified that determine “how we’ll know”. These metrics should include a “should” value and a “must” value.

In the example in Figure 9 , IP forwarding is the system dimension. For a networking application, IP forwarding is a key measurement focus for this application area. The key attribute is “fast” – the system is going to be measured based on how many packets can be forwarded through the system. The key metric is thousands of packets per second (KPPS). The system should be able to achieve 600 Kpps and must reach at least 550 Kpps to meet the minimum system requirements.

Figure 9: Defining quantitative performance goals

Step 2: Determine where you are now
Understand which system use cases are causing performance problems. Quantify these problems using available tools and measurements. Figure 10 shows a debug architecture for a Multicore SoC that can provide the visibility “hooks” into the device for performance analysis and tuning. Figure 11 shows a strategy for using embedded profiling and analysis tools to provide visibility into a SoC in order to collect the necessary information to quantify performance problems in an embedded system.

Perform the appropriate assessment of the system to determine if the software architecture can support performance objectives. Can the performance issues be solved with standard software tuning and optimization methods? This is important because it's not desirable to spend many months tuning the application only to determine later that the goals cannot be met using these tuning approaches and more fundamental changes are required. Ultimately, this phase needs to determine whether performance improvement requires re-design or if tuning is sufficient.

Figure 10: A debug architecture for a Multicore SoC that can provide the visibility “hooks” into the device for performance analysis and tuning

Click on image to enlarge.

Click on image to enlarge.

Figure 11: A tools strategy for using embedded profiling and analysis tools to provide visibility into a SoC in order to collect the necessary information to quantify performance problems in an embedded system.

Step 3: Decide if you can achieve the objectives
There are several categories of performance optimization, ranging from the simple to the more complex:

Low-cost/low ROI techniques  – usually these techniques involve automatic optimization options. A common approach in embedded systems is the use of compiler options to enable more aggressive optimizations for the embedded software.
High-cost/high ROI techniques – re-design or re-factoring the embedded software architecture.
Intermediate cost/intermediate ROI techniques – this category includes optimizing algorithms and data structures (for example using a FFT inste ad of a DFT) as well as approaches like modifying software to use more efficient constructs.

Step 4: Develop a plan for achieving the objectives
The first step is to pareto rank the proposed solutions based on return on investment. There are various ways to estimate resource requirements, including modeling and benchmarking. Once the performance targets have been determined, the tuning phase becomes iterative until the targets have been met. Figure 12 shows an example of a process used in optimizing DSP embedded software. As this figure shows, there is a defined process for optimizing the application based on an iterative set of steps:

  • Understand key performance scenarios for the application
  • Set goals for key optimizations for performance, memory, and power
  • Select processor architecture to match the DSP application and performance requirements
  • Analyze key algorithms in the system and perform algorithmic transformation if necessary
  • Analyze compiler performance and output for key benchmarks
  • Write “out of box” code in a high level language (e.g.C)
  • Debug and achieve correctness and develop regression test
  • Profile application and pareto rank “hot spots”
  • Turn on low level optimizations with the compiler
  • Run test regression, profile application, and re-rank
  • Tune C/C++ code to map to the hardware architecture
  • Run test regression, profile application, and re-rank
  • Instrument code to get data as close as possible to the CPU using DMA and other techniques
  • Run test regression, profile application, and re-rank
  • Instrument code to provide links to compiler with intrinsics, pragmas, keywords
  • Run test regression, profile application, and re-rank
  • Turn on higher level of optimizations using compiler directives
  • Run test regression, profile application, and re-rank
  • Re-write key inner loops using assembly languages
  • Run test regression, profile application, and re-rank
  • If goals are not met, re-partition the application in hardware and software and start over again. At each phase, if the goals are met, then document and save code build settings and compiler switch settings

Click on image to enlarge.

Figure 12: A Process for Managing the Performance of an embedded DSP application

The first step is to gather data that can be used to support the analysis. This data includes, but is not limited to, time and cost to complete the performance analysis, software changes required, hardware costs if necessary, and software build and distribution costs.

The next step is to gather data on the effect of the improvements which include things like hardware upgrades that can be deferred, staff cost savings, etc
Performance Engineering can be applied to each phase of the embedded software development process. For example, the Rational Unified Process (RUP) has four key phases: Inception, Elaboration, Construction, and Transition(Figure 13 ).

RUP is an iterative software development process framework created by the Rational Software Corporation (now IBM). RUP is an adaptable process framework instead of a single concrete prescriptive process. Its intended to be tailored by software development teams that will select the elements of the process.

Figure 13: Rational Unified Process

Step 5: Conduct an economic analysis of the project based on this plan

Lloyd Williams has mapped SPE into the RUP process as follows (4):

Inception Phase
 Theprimary objective of the inception phase is to scope the systemadequately as a basis for validating initial costing and budgets. From aSPE perspective, high level risks that may impact system performanceare identified and described in this phase.

Elaboration Phase
 Inthis phase, the main objective is to mitigate the key risk itemsidentified by analysis up to the end of this phase. In this phase theproblem domain analysis is made and the architecture of the project getsits basic form. This is where the critical business processes aredecomposed to critical use cases. The type of requirements that relateto SPE are the non-functional requirements (NFR) that are not limited touse cases.

The primary difference between functional andnon-functional requirements is that functional requirements define whatthe system should do, and can be expressed as “The embedded softwareshall (monitor, control, etc.), while non-functional requirements definewhen and/or how well the system should perform the functions, and canbe expressed as “The embedded software shall be (fast, reliable,scalable, etc.).

One way to formulate a set of NFR’s for an embedded system is to use the acronym “SCRUPLED”:

  • Security, Licensing, Installation – access privileges, security requirements, installation and licensing requirements
  • Copyright, legal notices and other items – required corporate representations and legal protections
  • Reliability – defects, Mean Time Between Failures, availability
  • Usability – ease of use requirements, presentation design guidelines, UE standards, accessibility standards, training standards, sheets, help systems
  • Performance – quantitative performance requirements
  • Localization and Internationalization – foreign language operating systems, localization enablement, specific localizations
  • Essential standards – industry, regulatory, and other externally imposed standards
  • Design Constraints – other constraints on the system or development technologies.

Mandated programming languages and standards, platforms, common components – Initial models are created that describe the overall system load overa specified time period, defining how many of each type of keytransaction (networking packets, video frames, etc.) will be executedper unit of time.

Construction phase The primary objectiveof the construction phase is to build the software system. The mainfocus is on the development of components and other features of thesystem. This is where the majority of the coding takes place. Severalconstruction iterations may be developed in order to divide the usecases into manageable segments that produce demonstrable prototypes. SPEadds some activities to this phase. Performance tool-related activitiesare completed. For example, specifying a profiling tool for componentdevelopment and unit testing is necessary. Automated frameworks areneeded to drive the components under development and measureperformance.

Transition phase  This is where we transit thesystem from development into production. Activities include trainingthe end users and maintainers, and beta testing the system to validateit against the end users' expectations. The product is also checkedagainst the quality level set in the Inception phase. From an SPEperspective, this is when we configure operating systems, the network,and any message queuing software and other optimizations identified inthe performance test plan. It's important to ensure that all necessaryperformance monitoring software is developed, deployed, and configured.

Latency vs. Throughput in an eNodeB application

Embeddedcomputer performance is characterized by the amount of useful workaccomplished by a computer system compared to the time and resourcesused. Depending on the context, good computer performance may involveone or more of the following:

  • Short response time for a given piece of work
  • High throughput (rate of processing work)
  • Low utilization of computing resources
  • High availability of the computing system or application

Itcan be difficult to design a system that provides both low latency andhigh performance. However, real-world systems (such as Media, eNodeB,etc.) need both. For example, see Figure 14 . This eNodeB systemmust be able to handle two basic NFR’s: low latency, 1 ms periodicinterrupt used for scheduling important calls through the system; andmaximum data throughout of 100 Mbps downlink and 50 Mbps uplink forsupporting key customer use cases such as data transfer for web surfingand texting.

Click on image to enlarge.

Figure 14: A use case involving both latency and throughput for a Femto application

Thisis a case where the designer needs to tune the system for the rightbalance of latency and performance. This would include the followingbasic decisions:
Partitioning the application between hardware coresand hardware acceleration. Embedded systems usually consist ofprocessors that execute domain-specific programs. Much of theirfunctionality is implemented in software, which is running on one ormultiple generic processors. High performance functions may beimplemented in hardware. Typical examples include TV sets, cellularphones, eNodeB basestations and printers. Most of these systems arerunning multimedia and/or telecom applications, like video and audiodecoders.

Figure 15 shows a table summarizing the keyperformance intensive functions required for an eNodeB application,which of those functions are utilizing hardware acceleration, theallocated cycle budget, and the core loading percentage.

Click on image to enlarge.

Figure 15: An Example of using processor cores and accelerators to partition an application onto a SoC processor

Partitioning the software application across the programmable cores in order to achieve the NFRs required for the application. Figure 14 above shows a diagram of how the real-time tasks are allocated to oneof the two available cores and the non real-time functions are allocatedto the other processing core.

Designing the proper software architecture to support the NFRs. Figure 16 shows additional software support for performing zero copy transfer ofrequired packets for eNodeB processing around the Linux software stack,avoiding the extra overhead required for unnecessarily going up and downthe Linux stack.

Click on image to enlarge.

Figure 16: Bypassing the Linux software stack using a “fastpath” software bypass technology

IfSPE has been properly applied at each iteration and phase of theproject, this should be sufficient to enable the system to achieve therequired performance goals. If there are use cases that cannot be tunedinto compliance, then it will be necessary to consider portions of thesystem to be re-factored or, in the worse case, re-partitioned betweenhardware and software. In some cases the problem can be resolved withadditional hardware, but adding more hardware leads quickly todiminishing returns, as Amdahl’s Law demonstrates (Figure 17 ).

Click on image to enlarge.

Figure 17: Amdahl’s Law dictates that more hardware may not necessarily improve performance linearly

SPE must be managed throughout the lifecycle. Crawl charts showing actual and target performance targets, like that shown in Figure 18 ,can be used to manage and report on performance status as the projectexecutes. This can be a transparent way of communicating performance tokey stakeholders and also a way to measure how well the SPE process isworking. Software release iterations are possible with SPE and theseiterations can also be tracked using crawl charts showing performancegoals for each of the software iterations.

Click on image to enlarge.

Figure 18: A Performance “crawl chart” showing performance improvements over time

Click on image to enlarge.

Figure 19: A Performance crawl chart showing increasing performance improvements supporting an incremental release process


Next in Part 3:  Collecting data and using it effectively
Read Part 1: Software performance engineering for embedded systems: Part 1 – What is SPE?

Rob Oshana ,author of the soon to be published “Software engineering for embeddedsystems,” by Elsevier, is director of software R&D, NetworkingSystems Group, Freescale Semiconductor.


“AMaturity Model for Application Performance Management ProcessEvolution, A model for evolving organization’s application performancemanagement process”, By Shyam Kumar Doddavula, Nidhi Timari, and AmitGawande

“Five Steps to Solving Software Performance Problems”, Lloyd G. Williams, Ph.D.Connie U. Smith, Ph.D. June, 2002

“Software Performance Engineering”, in UML for Real: Design of Embedded Real-Time Systems , Luciano Lavagno, Grant Martin, Bran Selic ed., Kluwer, 2003.

“PerformanceSolutions: A Practical Guide to Creating Responsive, ScalableSoftware”, Lloyd G. Williams, Ph.D., Connie U. Smith, Ph.D.

Used with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2012. For more information about Software engineering for embedded systems and other similar books, visit

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.