Achieving better embedded software performance through memory layout optimization – Part 2

Editor’s Note: In Part 2 in a series excerpted from Software Engineering for Embedded Systems , Mike Brogioli of Polymathic Consulting provides tips on various memory layout optimization techniques that can be used to improve embedded system performance.

In order to obtain sufficient levels of performance, application developers and software systems architects must not only select the appropriate algorithms to use in their applications, but also the means by which those applications are implemented. Quite often this also crosses the line into data structure design, layout and memory partitioning for optimal system performance.

It is true that senior developers often have insight into both algorithms and their complexity, as well as a toolbox of tips and tricks for memory optimization and data structure optimization. At the same time, the scope of most embedded software engineering projects prohibits manual code and data hand optimization due to time, resource and cost constraints.

As such, developers must often rely on the tools as much as possible to optimize the general use cases, only resorting to hand-level tuning and analysis to tweak performance on those performance-critical bottlenecks after the initial round of development.

This last round of optimization often entails using various system profiling metrics to determine performance-critical bottlenecks, and then optimizing these portions of the application by hand using proprietary intrinsics or assembly code, and in some cases rewriting performance-critical kernel algorithms and/or related data structures. This article follows on from the topics discussed in Part 1 and details design decisions that may prove useful for embedded system developers concerned with the issues those topics mentioned above.

Overview of memory optimization
Memory optimizations of various types are often beneficial to the run-time performance and even power consumption of a given embedded application. As was mentioned previously, these optimizations can often be performed to varying degrees by the application build tools such as compilers, assemblers, linkers, profilers and so forth. Alternatively, it is often valuable for developers to go into the application and either manually tune the performance or design in consideration of memory system optimization a priori, for either given performance targets or so as to design the software architecture to be amenable to automated-tool optimization in subsequent phases of the development cycle.

In tuning a given application, quite often the baseline or “out of box” version of the application will be developed. Once functionality is brought online, the development team or engineers may select to profile the application for bottlenecks that require further optimization. Often these are known without profiling, if certain kernels within the application must execute within a given number of clock cycles as determined by a spreadsheet or pen and paper exercise during system definition. Once these key kernels are isolated or key data structures are isolated, optimization typically begins by those experts with knowledge of both software optimization techniques, compiler optimizations, the hardware target and perhaps details of the hardware target instruction set.

Focusing optimization efforts
Amdahl’s law plays an interesting role in the optimization of full application stacks, however, and is not always appreciated by the software system developer. If only 10% of the dynamic run-time of a given application can benefit from SIMD or instruction-level parallel optimizations versus the 90% of dynamic run-time that must be executed sequentially, then inordinate amounts of effort on parallelizing the 10% portion of the code will still only result in modest performance improvements. Conversely, if 90% of the total application’s dynamic run-time is spent in code regions exhibiting large amounts of instruction-level parallelism and data-level parallelism, it may be worthwhile to focus engineering effort on parallelizing these regions to obtain improved dynamic run-time performance.

In determining those portions of the code which dominate the dynamic application run- time, and may be the best candidates for either hand optimization or hand adjustment for applicability to automated-tool optimization, application developers typically use a software profiler in conjunction with either the silicon target or software-based system simulation. Intel’s VTUNE is one such example of a profiling framework; alternatively the GNU GCC compiler and GPROF are open-source solutions that provide dynamic run-time information. Many silicon vendors such as Freescale Semiconductor and Texas Instruments also offer their own proprietary solutions for use with their respective silicon targets, allowing for either traces collected on software-based simulation platforms, or alternatively larger application-level traces that can be collected on the native silicon target.

Vectorization and the dynamic code: compute ratio
Vectorization of loops is an optimization whereby computation performed across multiple loop iterations can be combined into single vector instructions, effectively increasing the instruction-to-compute ratio within the application’s dynamic run-time behavior. Consider the example in Figure 5 .

     short a[16], b[16], c[16];
     for(iter=0; iter<16; ++iter)
     {
          // results in single 16-bit MPY instruction
          // generated in assembly listing
          //

          a[iter] = b[iter] * c[iter]
     }

     short a[16], b[16], c[16]; for(iter=0; iter<16 iter+=4)
     {
          // with high level compiler vectorization,
          // results in 4-WAY parallel 4×16-BIT multiply
          // vector instruction, effectively performing the
          // computation of four iterations of the loop in
          // a single atomic SIMD4 instruction.
          //

          a[iter:iter+4] = b[iter:iter+4] * c[iter:iter+4];
     }

Figure 5: Loop level vectorization example

In the first loop nest, we can see that each iteration of the loop contains a single 16-bit by 16-bit multiply instruction whose result is written to the a[] array as output. One multiplication instruction is performed for each iteration of the loop, resulting in 16 16-bit multiplications. The second loop, however, shows pseudocode for how the compiler or application developer might vectorize the loop when targeting an architecture that supports a four-way SIMD multiply instruction over 16-bit integer elements.

In this case, the compiler has vectorized multiple iterations of the loop together into the multiply instruction, as denoted by the array[start_range:end_range] syntax denoted in the second loop nest. Note that the loop counter is incremented by the vectorized length for each iteration of the loop now. Clearly only four iterations over the loop are now needed to compute the resulting an output array, as each iteration of the loop now contains a single vector multiply instruction that computes four elements of the output vector in parallel.

There are many benefits to vectorizing code in this manner, either by hand if the application developer uses intrinsics that are proprietary with respect to the target architecture, or if the compiler is able to vectorize the code. One such benefit is the increase in performance, as the code now exploits dedicated SIMD hardware, often providing a multiplication in improvement over the vectorized loop on the order of the underlying SIMD vector hardware.

Other benefits are the reduction in code size, as loops are no longer unrolled resulting in explosions in the code size, but rather more dense instructions of vector format are used rather than atomic scalar instructions. This may have secondary benefits in reducing the number of instruction fetch transactions that go out to memory as well. Lastly, the overall ratio of dynamically issued instructions to computation performed within the application is increased as well.

There are a number of challenges to both the development tools and the application developers when trying to vectorize code at the loop level. One such challenge is the code shape of loop nests that are candidate for vectorization. Typically, build tools need to understand the loop iteration space of a loop, so using constant loop bounds rather than run- time computed values may be beneficial depending on the advancement of the underlying compiler’s vectorization technology.

Secondly, the types of computation performed within the loop nest must be amenable to vectorization. For example, in the example above simple 16-bit integer multiplication is performed for a target architecture supporting a supposed 16-bit four-way SIMD multiply instruction. If the underlying target architecture only supports 8-bit SIMD multiplication, it may be advantageous to avoid 16-bit multiplication wherever possible if vectorization is desired.

Loop dependence analysis is another concern when vectorizing or parallelizing loop nests, as the compiler must be able to prove the safety of loop transformations. Loop dependence analysis is the means by which the compiler or dependence analyzer determines whether statements within a loop body form a dependence with respect to array accesses and data modifications, various data reduction patterns, simplification of loop-independent portions of the code and management of various conditional execution statements within the loop body.

As an example, consider the fragment of C-language code in Figure 6 .

     for(iter_a=0; iter        for(iter_b=0; iter_b           a[iter_a+4-iter_b] =
              b[2*iter_a-iter_b]+ iter_a*iter_b;

Figure 6: Fragment of C language code

For the loop above, the compiler’s data dependence analyzer will attempt to find all dependences between the statements reading the array b[] and writing to the array a[]. The challenge for the data dependence analyzer is to find all possible dependences between the statements that write to array a[] and read from array b[]. To ensure safety, the data dependence analyzer must ensure that it can explicitly prove safety or, in other words, any dependence that cannot be proven false must be assumed to be true to ensure safety!

The data dependence analysis shows independence between references by proving that no two instances of statements to array a[] and array b[] access or modify the same spot in array a[]. In the event that a possible dependence is found, loop dependence analysis will make an attempt to characterize the dependences, as some types of optimizations over loop nests may still be possible and profitable. It may also be possible to further transform the loop nests so as to remove the dependence.

In summary, writing loop nests so that a minimum of data dependencies exists between array references will benefit vectorization and other loop transforms as much as possible. While the compiler technology used in analyzing data dependencies and autovectorizing serial code for vector hardware stems from the supercomputing community, improperly written code with troublesome data dependencies and loop structure may still thwart the vectorization efforts of the most advanced tool sets.

At a high level, simply writing code which is easiest for humans to understand usually produces code that is easiest for the vectorizer to understand as well, as the vectorizer and data dependence analyzers can easily recognize what the programmer intended. In other words, highly hand-tuned code with a priori knowledge of the underlying target architecture is not the best candidate for automated vectorization at the tools level.

There are a number of things that application developers may want to keep an eye out for when developing code with the intent of autovectorization by the build tools.

Pointer aliasing in C
One challenge for vectorizers and data dependence analysis is the user of pointers in the C language. When data is passed to a function via pointers as parameters, it is often difficult or impossible for the data dependence analyzer and subsequent vectorizer to guarantee that the memory regions pointed to by the various pointers do not overlap in the interaction spaces of the loops in which they are computed. As the C standard has evolved over time, support for the “restrict” keyword has been added, as can be seen in the example in Figure 7 .

     void restrict_compute(restrict int *a, restrict int
     *b, restrict int *c)
     {
          for(int i=0; i               a[i] = b[i] * c[i];
     }

Figure 7: C Standard “restrict” keyword

By placing the restrict keyword qualifier on the pointers passed to the procedure, this ensures to the compiler that the data accessed by a given pointer with the restrict keyword does not alias with anything else the function may modify using another pointer. Note that this only applies to the function at hand, not the global scope of the application itself. This permits the data dependence analyzer to recognize that arrays are not aliased or modified by references with other side effects, and allows more aggressive optimization of the loop nest including vectorization amongst other optimizations.Data structures, arrays of data structures: adding it all up!
Appropriateselection of data structures, before the design of kernels, whichcompute over them, can have significant impact when dealing withhigh-performance embedded DSP codes. This is often especially true fortarget processors that support SIMD instruction sets and optimizingcompiler technology as was detailed previously in this chapter.

Asan illustrative example, this section details the various trade-offsbetween using an array-of- structure elements versus astructure-of-array elements for commonly used data structures. As anexample data structure, we’ll consider a set of six dimensional pointsthat are stored within a given data structure as either anarray-of-structures or a structure-of-arrays as detailed in Figure 8 .

Figure 8: Structure-of-arrays

Thearray of structures, as depicted on the left-hand side of Figure 8,details a structure which has six fields of floating-point type, each ofwhich might be the three coordinates of the ends of a line inthree-dimensional space. The structures are allocated as an array ofSIZE elements. The structure of arrays, which is represented on theright-hand side, creates a single structure which contains six arrays offloating-point data types, each of which is of SIZE elements. It shouldbe noted that all of the data structures above are functionalityequivalent, but have varying system side effects with regard to memorysystem performance and optimization.

Looking at thearray-of-structures example above, it can be seen that for a given loopnest that is known to access all of the elements of a given struct element before moving onto the next element in the list, good localityof data will be exhibited. This will be due to the fact that as cachelines of data are fetched from memory into the data caches, adjacentelements within the data structure will be fetched contiguously frommemory and exhibit good local reuse.

The downside when using thearray-of-structures data structure, however, is that each individualmemory reference in a loop that touches all of the field elements of thedata structure does not exhibit unit memory stride. For example,consider the illustrative loop in Figure 9 .

     for(i=0 i     {
          local_struct[i].x_00 = 0.00;
          local_struct[i].y_00 = 0.00;
          local_struct[i].z_00 = 0.00;
          local_struct[i].x_01 = 0.00;
          local_struct[i].y_01 = 0.00;
          local_struct[i].z_01 = 0.00;
     }

Figure 9: Illustrative loop

Eachof the field accesses in the loop above accesses different fieldswithin an instance of the structure, and does not exhibit unit stridememory access patterns which would be conducive to compiler-levelautovectorization. In addition, any loop that traverses the list ofstructures and accesses only one or few fields within a given structureinstance will exhibit rather poor spatial locality of data within thecases, due to fetching cache lines from memory that contain dataelements which will not be referenced within the loop nest.

Wecan contrast this rather bleak use case depicted above by migrating thearray-of- structures format to the structure-of-arrays format, asdepicted in the loop nest in Figure 10 .

     for(i=0 i     {
          local_struct.x_00[i] = 0.00;
          local_struct.y_00[i] = 0.00;
          local_struct.z_00[i] = 0.00;
          local_struct.x_01[i] = 0.00;
          local_struct.y_01[i] = 0.00;
          local_struct.z_01[i] = 0.00;
     }

Figure 10: Loop nest

Byemploying the structure-of-arrays data structure, each field accesswithin the loop nest exhibits unit stride memory references across loopiterations. This is much more conducive to autovectorization by thebuild tools in most cases. In addition, we still see good locality ofdata across the multiple array streams within the loop nest. It shouldalso be noted that in contrast to the previous scenario, even if onlyone field is accessed by a given loop nest, locality within the cache isachieved due to subsequent elements within the array being prefetchedfor a given cache line load.

While the examples presentedpreviously detail the importance of selecting the data structure thatbest suits the application developer’s needs, it is assumed that thedeveloper or system architect will study the overall application hotspots in driving the selection of appropriate data structures for memorysystem performance.

The result may not be a clear case of blackand white, however, and a solution that employs multiple data structureformats may be advised. In these cases, developers may wish to use ahybrid-type approach that mixes and matches between structure-of-arrayand array-of-structure formats.

Furthermore, for legacy codebases which are tightly coupled to their internal data structures forvarious reasons beyond the scope of this article, it may be worthwhileto run-time convert between the various formats as needed. While thecomputation required to convert from one format to another isnon-trivial, there may be use cases where the conversion overhead isdramatically offset by the computational and memory system performanceenhancements achieved once the conversion is performed.

Loop optimizations for memory performance
Inaddition to structuring loops for targetability by autovectorizingcompiler technology, and tailoring data structures over which loopscompute, there are loop transformations themselves which may benefit thememory system performance of an application as well. This sectiondetails various loop transformations that can be performed eithermanually by the application developer, or automatically by thedevelopment tools to improve system performance.

Data alignment’s rippling effects. The alignment of data within the memory system of an embedded targetcan have rippling effects on the performance of the code, as well as thedevelopment tools’ ability to optimize certain use cases. On manyembedded systems, the underlying memory system does not supportunaligned memory accesses, or such accesses are supported with a certainperformance penalty. If the user does not take care in aligning dataproperly within the memory system layout, performance can be lost. Insummary, data alignment details the manner in which data is accessedwithin the computer’s memory system.

When a processor reads orwrites to memory, it will often do this at the resolution of thecomputer’s word size, which might be four bytes on a 32-bit system. Dataalignment is the process of putting data elements at offsets that aresome multiple of the computer’s word size so that various fields may beaccessed efficiently. As such, it may be necessary for users to putpadding into their data structures or for the tools to automatically paddata structures according to the underlying ABI and data typeconventions when aligning data for a given processor target.

Alignmentcan have an impact on compiler and loop optimizations such asvectorization. For instance, if the compiler is attempting to vectorizecomputation occurring over multiple arrays within a given loop body, itwill need to know whether the data elements are aligned so as to makeefficient use of packed SIMD move instructions, and also to know whethercertain iterations of the loop nest that execute over non-aligned dataelements must be peeled off.

If the compiler cannot determinewhether or not the data elements are aligned, it may opt to notvectorize the loop at all, thereby leaving the loop body sequential inschedule. Clearly this is not the desired result for the best-performingexecutable. Alternatively, the compiler may decide to generate multipleversions of the loop nest with a run-time test to determine at loopexecution time whether or not the data elements are aligned. In thiscase the benefits of a vectorized loop version are obtained; however,the cost of a dynamic test at run-time is incurred and the size of theexecutable will increase due to multiple versions of the loop nest beinginserted by the compiler.

Users can often do multiple things toensure that their data is aligned, for instance padding elements withintheir data structures and ensuring that various data fields lie on theappropriate word boundaries. Many compilers also support sets of pragmasto denote that a given element is aligned. Alternatively, users can putvarious asserts within their code to compute at run-time whether or notthe data fields are aligned on a given boundary before a particularversion of a loop executes.

Selecting data types for big payoffs

Itis important that application developers also select the appropriatedata types for their performance-critical kernels in addition to theaforementioned strategies of optimization. When the minimal acceptabledata type is selected for computation, it may have a number of secondaryeffects that can be beneficial to the performance of the kernels.Consider, for example, a performance-critical kernel that can beimplemented in either 32-bit integral computation or 16-bit integralcomputation due to the application programmer’s knowledge of the datarange. If the application developer selects 16-bit computation using oneof the built-in C/C11 language data types such as “short int”, then thefollowing benefits may be gained at system run-time.

Byselecting 16-bit over 32-bit data elements, more data elements can fitinto a single data cache line. This allows fewer cache line fetches perunit of computation, and should help alleviate the compute-to-memorybottleneck when fetching data elements. In addition, if the targetarchitecture supports SIMD-style computation, it is highly likely that agiven ALU within the processor can support multiple 16-bit computationsin parallel versus their 32-bit counterparts.

For example, manycommercially available DSP architectures support packed 16-bit SIMDoperations per ALU, effectively doubling the computational throughputwhen using 16-bit data elements versus 32-bit data elements. Given thepacked nature of the data elements, whereby additional data elements arepacked per cache line or can be placed in user-managed scratchpadmemory, coupled with the increased computational efficiency, it may alsobe possible to improve the power efficiency of the system due to thereduced number of data memory fetches required to fill cache lines.

Part 1: Memory oriented code optimization

Dr. Michael C. Brogioli is principal and founder at Polymathic Consulting as well as an adjuct professor of computer engineering at RiceUniversity, Houston, Texas. Prior to Polymathic, he was senior member ofthe technical staff and chief architect at Freescale Semiconductor. Inaddition to that he also served in several roles at TI's AdvancedArchitecture and Chip Technology Research and Intel's AdvanceMicroprocessor Research Lab. He holds a PhD/MSc in electrical andcomputer engineering from Rice as well as a BSc in electricalengineering from Renssealer Polytechnic Institute.

Used with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2012, this article was excerpted from Software Engineering for Embedded Systems , written and edited by Robert Oshana and Mark Kraeling.

1 thought on “Achieving better embedded software performance through memory layout optimization – Part 2

  1. “Hi,nnRegarding Fig. 6nOr I don't understand something or there are 2 mistakes in the first line.nGenerally, I have nothing against any performance optimization tools, but it becomes relevant when done next steps:nn1) was chosen right CPU and good en

    Log in to Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.