This is the second in a series of columns on how hardware architectures influence software performance. Last month (Editor's Note: 20 years ago) , we examined two trade-offs in cache structures: write-through versus write-back and direct-mapped versus associative. This month we look at virtual memory and its interface with caches. This topic is particularly important to programmers working with multitasking systems. The way the hardware implements virtual memory addresses can drastically impact code performance.
Virtual memory allows the program to use a larger addressing space than is offered by the actual bit width of a processor. A processor with a 16-bit-wide address bus (for example, the Z80) can directly access two 16-memory addresses. A virtual memory system is typically employed when the same processor accesses a larger range of addresses, typically 232 or about 2,000 Mbytes.
Obviously, this is a larger addressing range than most people need. However, with virtual addressing this large space can be put to good use. One common way to realize a large addressing space is segmentation, which translates the logical address generated by a program into a linear address by adding it to a segment offset. Here, however, we'll examine a more sophisticated virtual-memory mechanism called paging.
Paging is used to translate a linear address into a physical address. Some systems don't use paging but do use segmentation, in which case the linear address generated by adding a segment offset to a logical address is equivalent to a physical address. Some systems use paging but not segmentation, in which case the linear address is the same as the logical address generated by the program.
Whether or not segmentation is employed or the linear address is the same as the logical address, paging translates the linear address into a physical address by adding an offset to the linear address. The offset is found in a page-table entry, which is found by an indexed offset in a page-table directory. The base of the page-table directory itself is directly addressed by an on-chip control register, called a root register.
Hierarchy of offsets
That may seem a little confusing, so let's look at it from another perspective. Figure 1 shows a typical paging system (the one used by the 80386).
The program generates a 32-bit linear address divided into three parts: a directory index, a page-table reference, and a page offset. The upper 10 bits of the 32-bit linear address, the directory index, are added to the page-table directory base in the root register. The page-table directory entry contains the address of a page table. The next 10 bits of the linear address are then added to the page-table address to find the required page reference in the table. Finally, the last 12 bits of the 32-bit linear address are added to an entry in the page table to find the actual physical address.
The use of a triple-tier hierarchy (root register, page-table directory, and page table) allows a great deal of programming flexibility. Multitasking systems usually load a different value into the root register for each process, providing isolation between individual tasks. At the same time, since the page-table directory can point to different page tables, several processes can share common-area page tables for protected functions such as ROM. The page tables themselves allow a program to store and access sequential data and instructions in different physical locations.
Sadly, this process is quite complex; it involves separate table lookups in the directory and the page table to find the right physical address. If the page-table directory and all the page tables are kept in main memory, the processor needs to make two separate memory accesses to find the byte. To reduce the overhead of directory and page-table lookups, some processors use a translation look-aside buffer. A TLB is a cache, usually fully associative, that's used to hold active page-table entries. The 80386, for example, uses a 32-word, fully associative TLB.
When a page-table entry is found in an on-chip TLB (a TLB hit), the page-table lookup doesn't normally add to the address translation time. When a TLB miss occurs, the processor must look again in the page-table directory, calculate the offset in the indexed page table, look up the page-table entry, and calculate the physical address by adding the linear address and page-table entry. The processor then replaces an older TLB entry with the newly accessed page-table entry, on the assumption that the page will be used again very soon, and continues on its way. All this activity (three adds and two memory reads before reaching the physical address) obviously results in some performance degradation. In normal operation, TLBs yield about a 98% hit rate. The hit rate usually decreases most significantly when tasks use deeply nested procedures. This is because of the way the processor decides what page-table entry to discard from the TLB when there's a miss.
The two main cache-replacement schemes are the least frequently used (LFU) and least recently used (LRU) algorithms. From the programmer's point of view, the LFU algorithm is more elegant. An occasional reference to some system variable in a distant and rarely used page, for example, has a low cache life expectancy with LFU replacement.
Unfortunately, LFU algorithms are very expensive to build in hardware. All memory systems below the mini super-computer grade currently use the LRU replacement algorithm for instruction caches, data caches, and TLBs, placing some stringent limits on the capacity of the memory system.
The trashing problem
Let's consider a 32-entry, fully associative LRU TLB. When 33 accesses occur to different pages, the 33rd access will cause the first accessed page to be discarded from the buffer. This may seem like a low TLB miss frequency, but consider a time-sliced multitasking system running six processes, each using different pages for instructions and data (an advisable programming practice). In this situation, at least 12 pages are active at any given time. If each process uses double-nested procedures (or jumps between three pages) in each time slice, there are 36 active pages–four more active pages than are available in the TLB.
Now consider what happens as the operating system cycles through each of the six time slices sequentially. When the processor reaches the sixth time slice, there will have been 30 sequential accesses to different pages. By the end of the sixth time slice, the operating system will have accessed 36 pages and the cache manager will have discarded from the TLB the first four page-table entries accessed in the cycle. These four page-table entries were the least recently used (they're in the first of the six time slices).
As the operating system cycles back to the first time slice, the TLB needs to resurrect those same page-table entries in itself. In so doing, it discards the next six page-table entries. The discarded entries are–you guessed it–those used by the second time slice. As the processor continues on its merry way, it continues to discard exactly those page-table entries from the TLB that it will need next.
This phenomenon, called page-table thrashing, essentially negates the performance improvement that the TLB should yield. With continual thrashing, overall program speed is at least three times slower than the optimum because the processor needs to look up three memory locations in every cycle instead of just one. A 300% drop in performance is enough to warrant real concern.
Thrashing can deteriorate performance even more if there's a data cache. This is because the data-cache mechanism sees no difference between page tables and any other type of data. As a result, page tables are loaded into the data cache whenever they're placed in the TLB. As each page table and page-table directory is encountered, the old contents of the data cache are flushed and the lookup tables loaded up instead. Since data caches customarily use an LRU algorithm, shared page tables aren't maintained in the data cache unless the cache is set-associative and there are sufficient sets to maintain the page-table directory in the data cache (as well as the page table itself and the program data). This situation is rare indeed, since most data-cache systems are two-way set-associative.
On the other hand, data-cache flushing isn't always a problem with page tables. If the TLB miss occurs when the operating system switches to a new task, the items discarded from the data cache belong to the last task anyway and the cache flush doesn't impact performance.
Page-table thrashing can cause data-cache thrashing as well. Consider, for example, a rare procedure that occurs during the processing of a block of contiguous data. One cache set is flushed and filled with the page-table directory; another cache set is flushed and filled with the indexed page table. If there's yet another available cache set holding current data, it will be the next to go (rather than the set holding the directory or page table) because it was the least recently used. And all this data-cache performance deterioration occurs even if the procedure doesn't access any data!
The impact of hardware architecture on code performance cannot be ignored, especially with paged virtual memory systems. The sizes and types of instruction caches, data caches, and TLBs can have vastly disparate effects on the performance of what might seem to be a simple application, especially in multitasking environments.
Thankfully, following some simple rules can help prevent undue thrashing. For example, it's possible to minimize data-cache performance deterioration with TLB misses, even if the page-table lookup failure occurs when there isn't a context switch. Macros should be expanded; long-unused variables should be declared absolutely rather than relatively.
Other ways to prevent thrashing: avoid nesting procedure calls wherever possible, minimize the number of concurrent tasks, and don't use jumps larger than the page size unless absolutely necessary.
Next month we'll look at some special techniques to streamline instruction and data flow to avoid thrashing. We'll also examine how interleaved memory banks can result in varying performance with different data and code structures.
Back in 1989, Ernest Meyer was a contributing editor for Embedded Systems Programming, now Embedded Systems Design.