Origins of the MIPS Design
The MIPS designers wanted to figure out a way to offer the same
facilities as the VAX with as little hardware as possible. The
microcoded TLB refill was not acceptable, so they took the brave step
of consigning this part of the job to software.
That means that apart from a register to hold the current ASID, the
MMU hardware is simply a high-speed, fixed-size table of translations.
System software can (and usually does) use the hardware as a cache of
entries from some kind of comprehensive memory-resident page table, so
it makes sense to call the hardware table a TLB.
But there's nothing in the TLB hardware to make it a cache, except
this: When presented with an address it can't translate, the TLB
triggers a special exception (TLB refill) to invoke the software
routine. Some care is taken with the details of the TLB design, the
associated control registers, and the refill exception to help the
software to be efficient.
The MIPS TLB has always been implemented on chip. The memory
translation step is required even for cached references, so it's very
much on the critical path of the machine. That meant it had to be
small, particularly in the early days, so it makes up for its small
size by being clever.
It's basically a genuine associative memory. Each entry in an
associative memory consists of a key field and a data field; you
present the key and the hardware returns the data of any entry the key
matches. Associative memories are wonderful, but they are expensive in
hardware. MIPS TLBs have had between 32 and 64 entries; a store of this
size is manageable as a silicon design.
All contemporary CPUs use a TLB in which each entry is doubled up to
map two consecutive VPNs to independently specified physical pages. The
paired entries double the amount of memory that can be mapped by the
TLB with only a little extra logic, without requiring any large-scale
rethinking of TLB management.
You will see the TLB referred to as being fully associative; this
emphasizes that all keys are really compared with the input value in
parallel. (The common 32-entry paired TLB would be correctly, if
pedantically, described as a 32-way set-associative store, with two
entries per set.)
 |
| Figure
14.3 TLB entry fields. |
The TLB entry is shown schematically in Figure 14.3 above. For the moment,
we'll assume that pages are 4 Kbytes in size. The TLB's key - the input
value - consists of three fields:
Field #1: VPN2 - The page
number is just the high-order bits of the virtual address - the bits
left when you take out the 12 low bits that address the byte within
page. The "2" in VPN2 emphasizes that each virtual entry maps 8 Kbytes
because of the doubled output field. Bit 12 of the virtual address
selects either the first or the second physical-side entry of the pair.
Field #2: PageMask -
Controls how much of the virtual address is compared with the VPN and
how much is passed through to the physical address; a match on fewer
bits maps a larger region. A "1" bit causes the corresponding address
bit to be ignored. Some MIPS CPUs can be set up to map as much as 16 MB
with a single entry. The most significant ignored bit is used to select
the even or odd entry.
Field #3: ASID - Marks the
translation as belonging to a particular address space, so this entry
will only be matched if the thread presenting the address has
EntryHi(ASID) set equal to this value.
The G bit, if set, disables the ASID match, making the translation
entry apply to all address spaces (so this part of the address map is
shared between all spaces). The ASID is 8 bits: The OS-aware reader
will appreciate that even 256 is too small an upper limit for the
number of simultaneously active processes on a big UNIX system.
However, it's a reasonable limit so long as "active" in this context
is given the special meaning of "may have translation entries in the
TLB." OS software has to recycle ASIDs where necessary, which will
involve purging the TLB of translation entries for any processes being
downgraded from active.
It's a dirty business, but so is quite a lot of what OSs have to do;
and 256 entries should be enough to make sure it doesn't have to be
done so often as to constitute a performance problem.
For programming purposes, it's easiest if the G bit is kept in the
kernel's page tables with the output-side fields. But when you're
translating, it belongs to the input side. On MIPS32/64 CPUs, the two
output-side values are AND-ed together to produce the value that is
used, but the realistic outcome is that you must make sure the G bit is
set the same in both halves.
The TLB's output side gives you the physical frame number and a
small but sufficient bunch of flags:
Flag #1 - Physical frame
number (PFN): This is the physical address with the low bits cut off
(the low 12 bits if this is representing a 4-Kbyte page). Write control
bit (D): Set 1 to allow stores to this page to happen. The "D" comes
from this being called the dirty bit; see the next section for why.
Flag #2 - Valid bit (V): If
this is 0, the entry is unusable. This seems pretty pointless: Why have
a record loaded into the TLB if you don't want the translation to work?
There are two reasons. The first is that the entry translates a pair of
virtual pages, and maybe only one of them ought to be there.
The other is that the software routine that refills the TLB is
optimized for speed and doesn't want to check for special cases. When
some further processing is needed before a program can use a page
referred to by the memory-held table, the memory-held entry can be left
marked invalid.
After TLB refill, this will cause a different kind of trap, invoking
special processing without having to put a test in every software
refill event.
Flag # 3 - Cache control
(C): This 3-bit field's primary purpose is to distinguish cacheable (3)
from uncached (2) regions.
But that leaves six other values, used for two somewhat incompatible
purposes: In shared-memory multiprocessor systems, different values are
used to hint whether the memory is shared (when hardware will have to work hard to
keep any cached data consistent across the whole machine). In
"embedded" CPUs, different values select different local cache
management strategies: write-through versus write-back, for example.