Software performance considerations when using cache - Embedded.com

Software performance considerations when using cache

Embedded applications are being implemented on processors that havecache areas of significant size. Popular architectures in large-scaleembedded applications have variants that have large amounts of on-chipcache. This paper introduces basic concepts about cache and discussessome ideas on developing software to improve performance when runningon these processors.

A developer of an application that will run on a general-purposecomputer cannot make any assumptions about the number of otherapplications that are running on the system and how the cache is sharedbetween them. A developer of an embedded application has a higher levelof control over the application environment.

This higher level of control can provide opportunities to use cachemore effectively. In cases where consistent response times areimportant, developers may want to be aware of cases where memoryaccesses come from main memory instead of from cache.

The use of cache in computer systems can dramatically improve systemperformance. Increasing cache size is a cost-effective way to improveperformance on microprocessors as the transistor count on the chipincreases.

While the operation of cache is generally transparent to theprogrammer, some issues can arise that influence a program'sperformance. This paper will look at some of the issues that affectsingle-processor system performance arising from the tendency ofprograms to refer to data and instructions in locations of memory closeto previously accessed locations. This tendency is called Locality ofReference and is the basic property that allows cache to improveprocessor performance[1] .

Issues associated with cache coherency effecting multi-processor andmulti-core systems and systems with agents external to the processorssuch as peripherals using DMA are not covered.

Cache Organization
One of the most common methods of cache organization is called “n-way set associative cache “. Thisorganization of cache provides a good balance between the amount oftime it takes to search the cache and the number of times the desireddata will be in the cache.

The portion of cache that holds data from system memory is called a cache line . The cache line is afixed length and holds a contiguous group of bytes. The length of thecache line is fixed by the processor architecture and is usuallydetermined by the size of the bus used to access system memory.

Typical cache line sizes are 32 or 64 bytes but can be larger orsmaller. Another part of the cache memory holds directory entries, onefor each cache line, that hold a tag value and other information aboutthe state of the cache line. An example of the state of the cache lineis an indication if the line has been written to since it was last readfrom memory.

The cache memory is accessed using the memory address by splittingthe memory address into three fields: tag,line , and byte . The line number is used to select whichof a group or set of cache lines could be holding the data for thedesired memory location. The tag iscompared to directory entries, one for each line in the selected group,to find a match.

If the tag matches a directory entry, a Cache Hit occurs and the data inthe cache line with the tag match is accessed using the byte number . If a match does notoccur, the result is called a CacheMiss . The data must be read from system memory or a lower levelof cache if the system has multiple levels of cache memory. The numberof bytes read is the size of a cache line. The operation of reading thesystem memory into cache is called a CacheLine Fill .

The tag “associates” memory addresses with locations in cache. Thecache is organized into groups or “sets” that are identified with theline number from the memory address. The number of cache lines in a setis the number of “ways”. So, if a line number points to a set of fourpossible lines that could hold the data for an address, that cachewould be four-way set associative.

From a software perspective, the main concern is with the cache linesize; the amount of data that can be accessed from cache after thecache line is filled. Information about your processor's cache linesize will be described in the processor user manual.

The cache line will be aligned to an address that is a multiple ofthe cache line size. For example, in a cache with a line size of 32bytes, a byte access to a memory location that ends with zeros in theleast significant five bits will cause a cache line to be filled withthat byte and the next 31 bytes from system memory. Operationsperformed on any of the 32 bytes can now be resolved by accessingcache.

Figure1. Four-Way Set Associative Cache

Figure 1 above shows anexample where the data being accessed starts at a location that isoffset four bytes from the beginning of the cache line. The line fieldfrom the address selects the set of cache lines and directory entriesat location 05. Each selected directory entry is examined to see if itcontains the tag, 4356. This tag is present in Directory 1, indicatingthat the desired cache line is present in cache and is in Way 1. Thedata can then be accessed starting four bytes from the beginning of thecache line.

Figure 1 also provides anexample of how data that spans more than one cache line could be storedin cache. For an array of 64 bytes in a cache with 32-byte line sizesuch as character array char arr[63] ,the reference arr[0] could bethe first byte in one cache line and arr[32] would then be the first byte in the next cache line. Notice that in Figure 1 , arr[0] and arr[32] are not in the same Way.Even though the bytes come from the same data structure, they can be inany of the cache lines in a particular set.

Stride
Since the size of cache is smaller than the size of system memory, datafrom multiple addresses in system memory will share the same set ofcache lines. If a program has heavily used portions of code orfrequently accessed data areas that map to the same cache line set, asignificant degradation of performance can occur. As these heavily usedcache lines are accessed, cache lines are written back to system memoryto free up cache for one access only to be read back in on a subsequentaccess.

The stride is the numberof bytes that separate an address that uses a cache line set from thenext address that will use the same cache line set. In Figure 1 , if the cache line size is32 bytes and the cache is 512 kilobytes in size, address 4356054(0x4277d6) will use the same cache line set as address 4487126(0x4477d6).

The stride is 128 kilobytes or 0x20000 bytes and is calculated bydividing the cache size by the number of ways or “Cache Size/ Number ofWays”, in this example 512 kbytes/4. In some cases, a minor change tothe size of a data structure can have an unexpected impact onperformance.

Accesses that were to the same cache line set may no longer accessthe same set resulting in a performance improvement or the converseoccurs and performance degrades because frequently accessed data sharethe same cache line set. A developer can calculate the stride andexamine the data structures in the program to see if the change inperformance resulted from this unintended alignment of cache accesses.

Prefetch
Under certain conditions, additional cache lines may be read fromsystem memory into cache in anticipation of their use. This operationis called a prefetch . Aprefetch can be done in hardware or under software control. A softwaredeveloper does not usually insert prefetch instructions into the code.The compiler can do this when it is optimizing the code and can use thepattern of data accesses to determine the memory that may be accessedin the future.

Hardware prefetch will also examine the pattern of accesses todetermine if it should prefetch data. The prefetch will cause the datato be loaded into cache but not the processor's execution unit. This isdone before the execution unit requires the data so the data isavailable in cache when the execution unit is ready for it. Theprefetch should occur while the processor is processing data that hasalready been fetched and the system memory bus would be otherwise idle.

Figure2. Memory access without prefetch[3a]

Figure 2 above shows atimeline where data is being accessed to support an operation in theexecution unit. If this operation is sequentially moving throughmemory, the execution unit still must wait while the data is being readfrom memory. Of course, the data that was read in with the previouscache line is available without the long latency. However, as the datain the cache line is exhausted, new reads from system memory will takeplace as shown in the figure.

Figure 3 below shows howprefetching the data can reduce idle time for both the execution unitand the system memory bus. This example shows a sequential accesspattern that can be used to infer which data will be used in thefuture.

Figure3. Memory access with prefetch[3a]

It is possible to overuse prefeteches. If the activity on the systemmemory bus due to prefetches interferes with other accesses to memory,the benefit of the prefetch is reduced and could possibly be negated.

Memory Access
If a program accesses memory in a way that skips large distances fromone memory location to another, the processor can be forced to wait aseach access is read in from system memory. The usefulness of the cachewould decrease because the program is not taking advantage of the datathat has been read into cache along with the required data.

One example of this is Non-Unitstride . It demonstrates how thinking about the way that thecache memory is being filled can lead to a different method forperforming a task. In this case, stride refers to the distance betweenaccesses to data in a data structure and not the stride calculationdiscussed earlier.

Non-Unit Stride Memory Access[4]
An issue that can have considerable impact on performance is accessingmemory in a non-Unit Stride fashion. This means that as the inner-mostloop increments consecutively, memory is accessed from non-adjacentlocations. Depending on the size of the matrix, this could result in aCache Miss for each element in the matrix column. For example, considerthe matrix multiplication code shown in Example 1, below , written in “C” (arow-major language):

Example1. Matrix Multiply with non-unit stride

Its memory access pattern looks like the diagram in Figure 4 below of the b matrix :

Figure4. Incrementing k accesses non-consecutive addresses

Notice that c[i][j], and a[i][k] both access consecutivememory locations when the innermost loops associated with the array areincremented. However, the b array, which loops with indexes k and j , does NOT access Memory Unit Stride. When the loop reads b[k=0][j=0] and then the k loop increments by one to b[k=1][j=0] , ajump over NUM memory locations occurs skipping b[k][1],b[k][2] … b[k][NUM] . The solution for these types ofproblems is a loop transformation called Loop Interchange.

Example2. Matrix Multiply with unit stride

Example 2 above shows the same matrix multiply asExample 1 above, but in this case, index j is incremented in theinnermost loop before index k. After the loop interchange, the memoryaccess pattern looks like this:

Figure5. Incrementing j accesses consecutive addresses

Now the program is accessing the elements of the b matrixconsecutively. The data that has been read into cache will be accessedbefore more data is read in from memory. This pattern of access mayalso support prefetching.

Grouping and Alignment
When looking at how a program accesses memory, design decisionscan be made that will take the most advantage of cache. If a data setthat a program is working on is smaller than the cache line size of theprocessor, it is important to make sure the data is read into one cacheline. This is done by grouping the data together in a structure andaligning that structure so it begins at the start of a cache line.

Two Indexes
For example[5], suppose a function uses local variables i and jas subscripts into a 2-dimensional array. They might be declared asfollows:

int i, j;

These variables are commonly used together, but they can fall indifferent cache lines, which could be detrimental to performance. Ifthe variables are used in a part of the program that isperformance-critical, you could instead declare them as follows:

struct { int i, j; } sub;

This relies on the compiler's default alignment for structures. Thisdefault alignment is typically enough to ensure that the structurewould be aligned in cache such that both indexes would be in the samecache line. Of course, i and j mustnow be referred to as ub.i and sub.j .

Or, the alignment of the structure can be specified if the compilersupports this feature. Here is an example using the attribute featureof gcc to align a structure on an 8-byte boundary:

struct { int i, j; } sub __attribute__ ((aligned (8)));

If many functions with such subscript pairs are used inperformance-critical code, it may be more convenient to declare and usea struct type for them, as in the following example:

typedef struct { int i, j; } Sub __attribute__ ((aligned(8)));

Note that the compiler typically will have limits on the size of thealignment parameter.

This simple example can be expanded to include other data types andcollections of data. The compiler alignment directive can also be usedwithin a structure to align fields to different boundaries.

Array of Structures versus a Structure of Arrays
The following looks at two ways (Example 3, Example 4,below) to represent the same information. Even though theStructure of Arrays would typically use less memory, it is interestingto see how the design of the representation of the data could beinfluenced by how the data is accessed and how it will be read intocache.

Example3. Array of Structures
Example4. Structure of Arrays

If the elements of the structure are all accessed together, thenarray_of_struct will group the elements of thestructure together in cache. For example, if element a isaccessed and then c , in array_of_struct they would be next to each other in cache but would beseparated by 100 integers in the struct_of_array . Useof the compiler alignment directives could be added to refine thelocality of the structure elements in cache.

However, if the data within the arrays are accessed together, thestruct_of_array would provide better locality in thecache. For example, if the data in array a are accessed sequentially,the elements of array a would be read into cache together. This wouldbe true even if other elements of the structure were also beingaccessed at the same time, since those array elements would also begrouped together in cache.

When the elements of the structure are not accessed with equalfrequency, such as when elements of a are accessed ten times more oftenthan the other entries, struct_of_array not only saves memory, but italso prevents fetching unnecessary elements b, c, d, ande .

Conclusion
This paper has discussed some basic concepts of cache operationand organization and presented some software examples to demonstratehow an awareness of the operation of the cache on your system can helpimprove software performance.

The examples are probably most appropriately applied to portions ofcode that are performance-critical. Performance analysis tools areavailable that not only highlight frequently executed sections of code,but also use processor performance counters to identify code that isexperiencing a high level of cache misses or other important cacheevents.

Steve Daily is a technicalconsulting engineer in the Software Products Division at Intel Corp.for Embedded C/C++ compiler products. Steve has worked in variousembedded software capacities over the years, most recently, as afirmware developer for system management controllers fortelecommunications server blades.

This article is excerpted from a paper of the same namepresented at the Embedded Systems Conference Silicon Valley 2006. Usedwith permission of the Embedded Systems Conference. For moreinformation, please visit www.embedded.com/esc/sv.

References
1.
H. Stone, “High-Performance Computer Architecture”, AddisonWesley, 1993, pp. 2930
2.
Tom Shanley, “The Unabridged Pentium 4”, Addison Wesley,2005, pp. 404
3.
“IA-32 Intel Architecture Optimization Reference Manual”,Intel
Corporation, 2004, [3a] pp 6-20-21, [3b] pp 2-38
4.
“Intel C++ Compiler and Intel Fortran Compiler 8.xPerformance Guide”, Intel Corporation, 2004, pp 20-22
5.
“Intel C++ Compiler for Linux* Systems User's Guide”, IntelCorporation, 2004, pp 361

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.