CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Architecture-oriented C optimization, part 2: Memory and more
Here's how to optimize C to account for memory alignment, cache features, endianness, and application specific instructions.



DSP DesignLine
Part 1 looks at architecture-oriented C optimization. It shows how C optimizations can take advantage of zero overhead loop mechanisms, hardware saturation, modulo registers, and more.

Memory related guidelines
Alignment considerations
Architectures may allow or disallow unaligned memory access. While no special guidelines are required when unaligned memory access is allowed, if disallowed, the programmer must be careful. Ignoring alignment considerations causes severe performance issues and even malfunctions. To avoid malfunctions, all memory accesses need to be executed with the proper alignment. To improve performance, the compiler needs to be aware of the alignment of pointers and arrays in the program. Optimizing compilers normally track pointer arithmetic to identify alignment at each stage of the code in order to apply SIMD (Single Instruction Multiple Data) memory accesses and maintain correctness. In some cases the compiler can tell that a pointer alignment allows memory access optimization (for example, when a pointer to a 16-bit variable is aligned to 32 bits) and then SIMD memory operations are emitted. In other cases, the pointers are not aligned. Then the only option is to make them aligned by copying them to aligned buffers or by using the linker.

In most cases, the compiler simply cannot tell the alignment. It therefore assumes the worst case scenario and avoids memory access optimization as a consequence. To overcome this lack of information, advanced compilers offer a user interface for specifying the alignment of a given pointer. The compiler then uses this information when considering memory access optimization for the pointer. For loops with excessive memory accesses (such as copy loops), this feature allows two and even four times acceleration.

Memory subsystem considerations and memory arrangement
In general, memory is divided into code memory and data memory. In small memory models, where there is very little room for code and data, various paging techniques are applied. Most modern architectures however, are designed for large memories due to the complex applications they must run. Caches are common for acceleration of memory fetching and DMA (Direct Memory Access) units are used for transferring large quantities of code or data.

The main concern when organizing code in memory is minimizing cache misses. This is a complex task involving many variables such as the function call tree, frequency of function execution, function size and various test vectors. The process can also be counter intuitive at times, as optimizing a function might worsen the overall performance due to changes in the overall memory map.

Data memory organization involves frequency and size considerations with regards to data caches. But it can also be affected by memory architecture which might inflict conflicts during transactions. As architectures use numerous configurations of memories, caches, DMAs and memory subsystems, each one requires thorough research to find the optimal memory organization. Advanced profilers, like the CEVA-X and the CEVA-TeakLite-III profilers, offer memory usage diagnostics, which are highly important for this process.

Memory based architectures (also referred to as CISC or complex instruction set computer architectures) can perform memory operations and computational operations in the same instruction. CISC instructions that use more than one memory operand may require each operand to have a different memory class. In the case of CEVA-Teak's MAC instruction, this is mandatory—one memory operand has to reside in a class named XRAM and the other in a class named YRAM. When programming the CEVA-Teak, C programmers use dedicated directives to mark input buffers of MAC loops as XRAM and YRAM ones. Otherwise, the compiler adds move instructions inside the loop to enforce the requirement.


Figure 5. The xram and yram directives colored purple help the CEVA-Teak compiler in generating the efficient double MAC loop on the right.

Figure 5 above demonstrates the importance of memory considerations in C coding. The implementation on the left lacks XRAM and YRAM directives, and therefore fails to trigger an efficient MAC loop generation by the compiler. The extra move instruction, which loads a multiplication operand, is seen in red. The version on the right has the XRAM and YRAM directives and benefits from an optimal MAC loop as fast as 1/2 a cycle per iteration.

1 | 2 | 3

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :