A good way to do this would be to have a table (the page table) containing an entry for each page in the whole virtual address space, with that entry containing the correct physical address.
This is clearly a fairly large data structure and is going to have to be stored in main memory. But there are two big problems:
Problem #1. we now need two references to memory to do any load or store, and that's obviously hopeless for performance. You may foresee the answer to this: We can use a high-speed cache memory to store translation entries and go to the memory-resident table only when we miss in the cache.
Since each cache entry covers 4 KB of memory space, it's plausible that we can get a satisfactorily low miss rate out of a reasonably small cache. (At the time this scheme was invented, memory caches were rare and were sometimes also called "lookaside buffers," so the memory translation cache became a translation lookaside buffer or TLB; the acronym survives.)
Problem #2: The size of the page table. for a 32-bit application address space split into 4-KB pages, there are a million entries, which will take at least 4 MB of memory. We really need to find some way to make the table smaller, or there'll be no memory left to run the programs.
We'll defer any discussion of the solution for this, beyond observing that few real programs use anything like the 4 Gbytes addressable with 32 bits. More modest programs have huge holes in their program address space, and if we can invent some scheme that avoids storing all the "nothing here" translation entries corresponding to the holes, then things are likely to get better.
![]() |
| Figure 14.2 Desirable memory translation system. |
We've now arrived, in essence, at the memory translation system DEC figured out for its VAX minicomputer, which has been extremely influential in most subsequent architectures. It's summarized in Figure 14.2 above. The sequence of steps in which the hardware works is something like this:
Step #1. A virtual address is split into two, with the least significant bits (usually 12 bits) passing through untranslated—so translation is always done in pages (usually 4 KB).
Step #2. The more significant bits, or VPN, are concatenated with the currently running thread's ASID to form a unique page address.
Step #3. We look in the TLB (translation cache) to see if we have a translation entry for the page. If we do, it gives us the high-order physical address bits and we've got the address to use.
The TLB is a special-purpose store and can match addresses in various useful ways. It may have a global flag bit that tells it to ignore the value of ASID for some entries, so that these TLB entries can be used to map some range of virtual addresses for every thread.
Similarly, the VPN may be stored with some mask bits that cause some parts of the VPN to be excluded from the match, allowing the TLB entry to map a larger range of virtual addresses.
Both of these features are available in MIPS MMUs (there's no variable size pages in some very old MIPS CPUs, though).
Step #4. There are usually extra bits (flags) stored with the PFN that are used to control which kind of access is allowed—most obviously, to permit reads but not writes. We'll discuss the MIPS architecture's flags later in this series.
If there's no matching entry in the TLB, the system must locate or build an appropriate entry (using main-memory-resident page table information) and load it into the TLB and then run the translation process again.
In the VAX minicomputer, this process was controlled by microcode and seemed to the programmer to be completely automatic. If you build the right format of page table in memory and point the hardware at it, all memory translation just works.
Origins of the MIPS Design
The MIPS designers wanted to figure out a way to offer the same
facilities as the VAX with as little hardware as possible. The
microcoded TLB refill was not acceptable, so they took the brave step
of consigning this part of the job to software.
That means that apart from a register to hold the current ASID, the MMU hardware is simply a high-speed, fixed-size table of translations. System software can (and usually does) use the hardware as a cache of entries from some kind of comprehensive memory-resident page table, so it makes sense to call the hardware table a TLB.
But there's nothing in the TLB hardware to make it a cache, except this: When presented with an address it can't translate, the TLB triggers a special exception (TLB refill) to invoke the software routine. Some care is taken with the details of the TLB design, the associated control registers, and the refill exception to help the software to be efficient.
The MIPS TLB has always been implemented on chip. The memory translation step is required even for cached references, so it's very much on the critical path of the machine. That meant it had to be small, particularly in the early days, so it makes up for its small size by being clever.
It's basically a genuine associative memory. Each entry in an associative memory consists of a key field and a data field; you present the key and the hardware returns the data of any entry the key matches. Associative memories are wonderful, but they are expensive in hardware. MIPS TLBs have had between 32 and 64 entries; a store of this size is manageable as a silicon design.
All contemporary CPUs use a TLB in which each entry is doubled up to map two consecutive VPNs to independently specified physical pages. The paired entries double the amount of memory that can be mapped by the TLB with only a little extra logic, without requiring any large-scale rethinking of TLB management.
You will see the TLB referred to as being fully associative; this emphasizes that all keys are really compared with the input value in parallel. (The common 32-entry paired TLB would be correctly, if pedantically, described as a 32-way set-associative store, with two entries per set.)
![]() |
| Figure 14.3 TLB entry fields. |
The TLB entry is shown schematically in Figure 14.3 above. For the moment, we'll assume that pages are 4 Kbytes in size. The TLB's key - the input value - consists of three fields:
Field #1: VPN2 - The page number is just the high-order bits of the virtual address - the bits left when you take out the 12 low bits that address the byte within page. The "2" in VPN2 emphasizes that each virtual entry maps 8 Kbytes because of the doubled output field. Bit 12 of the virtual address selects either the first or the second physical-side entry of the pair.
Field #2: PageMask - Controls how much of the virtual address is compared with the VPN and how much is passed through to the physical address; a match on fewer bits maps a larger region. A "1" bit causes the corresponding address bit to be ignored. Some MIPS CPUs can be set up to map as much as 16 MB with a single entry. The most significant ignored bit is used to select the even or odd entry.
Field #3: ASID - Marks the translation as belonging to a particular address space, so this entry will only be matched if the thread presenting the address has EntryHi(ASID) set equal to this value.
The G bit, if set, disables the ASID match, making the translation entry apply to all address spaces (so this part of the address map is shared between all spaces). The ASID is 8 bits: The OS-aware reader will appreciate that even 256 is too small an upper limit for the number of simultaneously active processes on a big UNIX system.
However, it's a reasonable limit so long as "active" in this context is given the special meaning of "may have translation entries in the TLB." OS software has to recycle ASIDs where necessary, which will involve purging the TLB of translation entries for any processes being downgraded from active.
It's a dirty business, but so is quite a lot of what OSs have to do; and 256 entries should be enough to make sure it doesn't have to be done so often as to constitute a performance problem.
For programming purposes, it's easiest if the G bit is kept in the kernel's page tables with the output-side fields. But when you're translating, it belongs to the input side. On MIPS32/64 CPUs, the two output-side values are AND-ed together to produce the value that is used, but the realistic outcome is that you must make sure the G bit is set the same in both halves.
The TLB's output side gives you the physical frame number and a small but sufficient bunch of flags:
Flag #1 - Physical frame number (PFN): This is the physical address with the low bits cut off (the low 12 bits if this is representing a 4-Kbyte page). Write control bit (D): Set 1 to allow stores to this page to happen. The "D" comes from this being called the dirty bit; see the next section for why.
Flag #2 - Valid bit (V): If this is 0, the entry is unusable. This seems pretty pointless: Why have a record loaded into the TLB if you don't want the translation to work? There are two reasons. The first is that the entry translates a pair of virtual pages, and maybe only one of them ought to be there.
The other is that the software routine that refills the TLB is optimized for speed and doesn't want to check for special cases. When some further processing is needed before a program can use a page referred to by the memory-held table, the memory-held entry can be left marked invalid.
After TLB refill, this will cause a different kind of trap, invoking special processing without having to put a test in every software refill event.
Flag # 3 - Cache control (C): This 3-bit field's primary purpose is to distinguish cacheable (3) from uncached (2) regions.
But that leaves six other values, used for two somewhat incompatible
purposes: In shared-memory multiprocessor systems, different values are
used to hint whether the memory is shared (when hardware will have to work hard to
keep any cached data consistent across the whole machine). In
"embedded" CPUs, different values select different local cache
management strategies: write-through versus write-back, for example.
Translating an address is now simple, and we can amplify the
description above:
CPU generates a program address: This might be an instruction fetch, a load, or store - and it's onewith an address that does not lie in the special unmapped regions of the MIPS address space.
The low 13 bits are separated off, and the resulting VPN2, together
with the current ASID (EntryHi(ASID)), is looked up in the TLB. The
match is affected by the the PageMask and by G fields of the various
TLB entries.
Valid? The V and D bits are consulted. If it isn't valid, or a store is being attempted with D unset, the CPU takes an exception. As with all translation traps, the BadVAddr register will be filled with the offending program address; as with any TLB exception, the TLB EntryHi register will be preloaded with the VPN of the offending address.
The BadVPN2 fields of the convenience registers Context (and XContext on 64-bit CPUs) will be preloaded (in part, as advertised) with the appropriate bits of the virtual address we failed to translate during a TLB refill exception. But the specification is less definite about how those address fields behave in other exceptions. It's probably a good idea to stick to BadVAddr in other exceptions.
Cached? If the C bit is set, the CPU looks in the cache for a copy of the physical location's data; if it isn't there, it will be fetched from memory and a copy left in the cache. Where the C bit is clear, the CPU neither looks in nor refills the cache.
Of course, the number of entries in the TLB permits you to translate only a relatively small number of program addresses—a few hundred KB worth. This is far from enough for most systems. The TLB is almost always going to be used as a software-maintained cache for a much larger set of translations.
When a program address lookup in the TLB fails, a TLB refill trap is taken. (Should this be called a "TLB miss" [which is what just happened] or a "TLB refill" [which is what we're going to do to sort it out]? I'm afraid we probably use both terms in MIPS documentation.)
System software has the following sequence of operations to perform:
1) It figures out whether there is a correct translation; if not, the trap will invoke the software that handles address errors.
2) If there is a correct translation, it constructs a TLB entry that will implement it.
3) If the TLB is already full (and it almost always is full in running systems), the software selects an entry that can be discarded.
4) The software writes the new entry into the TLB.
Keeping Track of Modified Pages
(Simulating "Dirty" Bits)
An operating system that provides a page for an application program to
use often wants to keep track of whether that page has been modified
since the OS last obtained it (perhaps from disk or network) or saved a
copy of it. Unmodified ("clean") pages may be quietly discarded, since
they can easily be recovered from a file system if they're ever needed
again.
In OS parlance the modified pages are called "dirty," and the OS must take care of them until the application program exits or the dirty page is cleaned by being saved away to backing store.
To help out with this process, it is common for CISC CPUs to maintain a bit in the memory-resident page table indicating that a write operation to the page has occurred.
The MIPS CPU does not support this feature, even in the TLB entries. The D bit of the page table (found in the EntryLo register) is a write-enable and of course is used to flag read-only pages.
So here's the trick:
1) When writable page is first loaded intomemory, you mark its page table entry with D clear (leaving it read-only).
2) When any write is attempted to the page, a trap will result; system software will recognize this as a legitimate write but will use the event to set a "modified" bit in the memory resident tables—which, since it's in the EntryLo(D) position, permits future writes to be done without an exception.
3) You will also want to set the D bit in the TLB entry so that the write can proceed (but since TLB entries are randomly and unpredictably replaced, this would be useless as a way of remembering the modified state).
How the Kernel Services a TLB
Refill Exception
MIPS's TLB refill exception has always had its own unique entry point
(at least as long as the CPU wasn't already in exception mode; in
Linux, that would be a fatal error, and isn't considered further).
When the exception routine is called, the hardware has set up EntryHi (VPN2) to the page number of the location we just couldn't translate: EntryHi is set exactly to what is required to create a new TLB entry to map the offending address.
The hardware has also set up a bunch of other address-related fields, but we're not going to use any of them. In particular, we're not going to use the convenient "MIPS standard" way of using Context to find the relevant page table entry.
The MIPS standard way requires that a (notionally very long) linear page table is constructed in kseg2 (it won't really take up excessive space, because kseg2 is mapped and the vast empty spaces of the table would never be mapped to real memory). But Linux really does not like to have the unique-to thread-group kernel mappings that would require.
Instead, Linux's TLB refill page tables are organized as a three-level hierarchy of tables (called "global," "middle," and just "PTE"). But cunning use of C macros allows the middle level to disappear completely without changing the code — a two-level structure is enough for 32-bit MIPS. (However, all three are required for the extended virtual memory space available with 64-bit MIPS.)
So you can find the data for any TLB entry you want - if it exists at all - by following the links, as shown in Figure 14.4 below.
![]() |
| Figure 14.4 Linux two-level page table (32-bit MIPS design). |
This structure is reasonably economical of kernel memory: A largish Linux program with a 50-MB virtual address space has about 12-K 4-KB pages, each needing four bytes of PTE: That makes 48 KB or 12 pages.
That's an underestimate, because the address space components have holes in them, but a reasonable-size thread group's mapping needs take only 15 or so pages. Kernel (kseg2) mappings take more, but they're common to all memory maps, so there is only one set of PTEs for kernel mappings.
Most 32-bit CPUs are restricted to a 32-bit physical address, too: Where that's so, there are six unused high-order bits in the EntryLo0-1 registers. Linux recycles those to remember some software state about the page table entries.
This does mean that the TLB refill handler for this 32-bit needs to do two index calculations and three loads. It's certainly longer than the tiny instruction sequence of the original magic MIPS scheme (where the index calculation is done by magic by the way the Context register works, and the only loads are from the final page table), but it's not bad.
You may have noticed that this description is very particular to a certain sort of MIPS CPU. Do all Linux systems have a unique kernel?
Well, no: But in the current Linux/MIPS kernel certain critical routines - including the TLB miss exception handler - are binaries generated by table-driven kernel software during start-up and tailored to the requirements of this particular CPU.
Care and Maintenance of the TLB
The MIPS TLB is "just" a collection of translation entries. Entries get
into it purely by the refill handler's action. Most entries get
overwritten in the same way, of course. But sometimes the kernel needs
to change a page table entry, and then it's important to invalidate any
copy in the TLB.
For a single entry, we can look up the TLB entry using the virtual address + ASID of the original entry, doing a tlbp: If the matching entry is there, we can overwrite that page's entry with an invalid one.
Sometimes, though, you need to do surgery on a larger scale. The ASID mechanism allows the TLB to hold entries for 256 different memory maps, but a Linux system will commonly run more than 256 processes between start-up and shutdown.
So there are times when a whole batch of translations must be rescinded, because the ASID is going to be recycled for a new process and the old entries would be damagingly wrong.
There's no elegant way of doing that. You just have to iterate through all the TLB entries (index values go in the Index register), read each entry out into the usual registers, check the ASID value retrieved into EntryHi(ASID), and invalidate any that match the victim ASID value.
Memory Translation and 64-Bit
Pointers
When the MIPS architecture was invented, 32-bit CPUs had been around
for a while, and the largest programs' data sets were already moving up
toward 100 MB - the address space had only 6 bits or so to spare. (Historically, application program demand
for memory space seems to have grown at about three-quarters of a bit
per year.)
There was therefore every reason to be reasonably careful with the 32-bit space and not to reduce it by profligate fragmentation; this is why application programs (running with user privilege) keep 31 bits' worth of addressing for themselves.
When the MIPS III instruction set introduced 64-bit registers in 1991, it was leading the industry and, as we discussed in section 2.7, MIPS was probably four to six years ahead of real pressure on a 32-bit address boundary.
The doubling of register size only had to yield a few bits of extra address space to be remarkably future-proof; it's been more important to be cautious about the potentially exploding size of OS data structures than to make efficient use of all address space.
The limitations to the practical address space resulting from the basic 64-bit memory map are not going to be reached for a while; they theoretically permit the mapped user and other spaces to growto 61 bits without any reorganization. But so far, a 40-bit user virtual space has been ample.
Most other 64-bit Linux systems have 8-Kbyte pages, but that would be annoying for MIPS. The MIPS TLB can map either 4 Kbyte or 16 Kbyte in a single entry, but not 8 Kbyte, so 64-bit MIPS kernels use 4-Kbyte pages, with 16-Kbyte pages a desirable option in some brave future.
If you take a look back at Figure 14.4 and imagine a set of intermediate (PMD) tables between the PGD and PTEs, you can comfortably resolve a 40-bit virtual address in a three-level table. We'll leave details to those enthusiastic enough to read the source code.
Next in Part 5: MIPS specific issues in the Linux kernelThis series of articles is based on material from "See MIPS Run Linux," by Dominic Sweetman, used with the permission of the publisher, Morgan Kaufmann/Elsevier, which retains full copyrights. It can be purchased on line.
Dominic Sweetman is a software/hardware boundary expert based in London, England, who previously served as managing director at Algorithmics Ltd.