Getting down to basics: Running Linux on a 32-/64-bit RISC architecture - Part 4 -

Getting down to basics: Running Linux on a 32-/64-bit RISC architecture – Part 4

The mapping mechanism discussed in Part3 must allow a program to use a particular address withinitsown process/address space and translate that efficiently into a realphysical address to access memory.

A good way to do this would be to have a table (the page table)containing an entry for each page in the whole virtual address space,with that entry containing the correct physical address.

This is clearly a fairly large data structure and is going to haveto be stored in main memory. But there are two big problems:

Problem #1. we now need two references to memory to do any load orstore, and that's obviously hopeless for performance. You may foreseethe answer to this: We can use a high-speed cache memory to storetranslation entries and go to the memory-resident table only when wemiss in the cache.

Since each cache entry covers 4 KB of memory space, it's plausiblethat we can get a satisfactorily low miss rate out of a reasonablysmall cache. (At the time this schemewas invented, memory caches were rare and were sometimes also called”lookaside buffers,” so the memory translation cache became atranslation lookaside buffer or TLB; the acronym survives .)

Problem #2: The size of the page table. for a 32-bit applicationaddress space split into 4-KB pages, there are a million entries, whichwill take at least 4 MB of memory. We really need to find some way tomake the table smaller, or there'll be no memory left to run theprograms.

We'll defer any discussion of the solution for this, beyondobserving that few real programs use anything like the 4 Gbytesaddressable with 32 bits. More modest programs have huge holes in theirprogram address space, and if we can invent some scheme that avoidsstoring all the “nothing here” translation entries corresponding to theholes, then things are likely to get better.

Figure14.2 Desirable memory translation system.

We've now arrived, in essence, at the memory translation system DECfigured out for its VAX minicomputer, which has been extremelyinfluential in most subsequent architectures. It's summarized in Figure 14.2 above . The sequence ofsteps in which the hardware works is something like this:

Step #1. A virtual addressis split into two, with the least significant bits (usually 12 bits)passing through untranslated—so translation is always done in pages(usually 4 KB).

Step #2. The moresignificant bits, or VPN, are concatenated with the currently runningthread's ASID to form a unique page address.

Step #3. We look in the TLB(translation cache) to see if we have a translation entry for the page.If we do, it gives us the high-order physical address bits and we'vegot the address to use.

The TLB is a special-purpose store and can match addresses invarious useful ways. It may have a global flag bit that tells it toignore the value of ASID for some entries, so that these TLB entriescan be used to map some range of virtual addresses for every thread.

Similarly, the VPN may be stored with some mask bits that cause someparts of the VPN to be excluded from the match, allowing the TLB entryto map a larger range of virtual addresses.

Both of these features are available in MIPS MMUs (there's no variable size pages in somevery old MIPS CPUs, though ).

Step #4. There are usuallyextra bits (flags) stored with the PFN that are used to control whichkind of access is allowed—most obviously, to permit reads but notwrites. We'll discuss the MIPS architecture's flags later in thisseries.

If there's no matching entry in the TLB, the system must locate orbuild an appropriate entry (using main-memory-resident page tableinformation) and load it into the TLB and then run the translationprocess again.

In the VAX minicomputer, this process was controlled by microcodeand seemed to the programmer to be completely automatic. If you buildthe right format of page table in memory and point the hardware at it,all memory translation just works.

Origins of the MIPS Design
The MIPS designers wanted to figure out a way to offer the samefacilities as the VAX with as little hardware as possible. Themicrocoded TLB refill was not acceptable, so they took the brave stepof consigning this part of the job to software.

That means that apart from a register to hold the current ASID, theMMU hardware is simply a high-speed, fixed-size table of translations.System software can (and usually does) use the hardware as a cache ofentries from some kind of comprehensive memory-resident page table, soit makes sense to call the hardware table a TLB.

But there's nothing in the TLB hardware to make it a cache, exceptthis: When presented with an address it can't translate, the TLBtriggers a special exception (TLB refill) to invoke the softwareroutine. Some care is taken with the details of the TLB design, theassociated control registers, and the refill exception to help thesoftware to be efficient.

The MIPS TLB has always been implemented on chip. The memorytranslation step is required even for cached references, so it's verymuch on the critical path of the machine. That meant it had to besmall, particularly in the early days, so it makes up for its smallsize by being clever.

It's basically a genuine associative memory. Each entry in anassociative memory consists of a key field and a data field; youpresent the key and the hardware returns the data of any entry the keymatches. Associative memories are wonderful, but they are expensive inhardware. MIPS TLBs have had between 32 and 64 entries; a store of thissize is manageable as a silicon design.

All contemporary CPUs use a TLB in which each entry is doubled up tomap two consecutive VPNs to independently specified physical pages. Thepaired entries double the amount of memory that can be mapped by theTLB with only a little extra logic, without requiring any large-scalerethinking of TLB management.

You will see the TLB referred to as being fully associative; thisemphasizes that all keys are really compared with the input value inparallel. (The common 32-entry paired TLB would be correctly, ifpedantically, described as a 32-way set-associative store, with twoentries per set.)

Figure14.3 TLB entry fields.

The TLB entry is shown schematically in Figure 14.3 above . For the moment,we'll assume that pages are 4 Kbytes in size. The TLB's key – the inputvalue – consists of three fields:

Field #1: VPN2 – The pagenumber is just the high-order bits of the virtual address – the bitsleft when you take out the 12 low bits that address the byte withinpage. The “2” in VPN2 emphasizes that each virtual entry maps 8 Kbytesbecause of the doubled output field. Bit 12 of the virtual addressselects either the first or the second physical-side entry of the pair.

Field #2: PageMask -Controls how much of the virtual address is compared with the VPN andhow much is passed through to the physical address; a match on fewerbits maps a larger region. A “1” bit causes the corresponding addressbit to be ignored. Some MIPS CPUs can be set up to map as much as 16 MBwith a single entry. The most significant ignored bit is used to selectthe even or odd entry.

Field #3: ASID – Marks thetranslation as belonging to a particular address space, so this entrywill only be matched if the thread presenting the address hasEntryHi(ASID) set equal to this value.

The G bit, if set, disables the ASID match, making the translationentry apply to all address spaces (so this part of the address map isshared between all spaces). The ASID is 8 bits: The OS-aware readerwill appreciate that even 256 is too small an upper limit for thenumber of simultaneously active processes on a big UNIX system.

However, it's a reasonable limit so long as “active” in this contextis given the special meaning of “may have translation entries in theTLB.” OS software has to recycle ASIDs where necessary, which willinvolve purging the TLB of translation entries for any processes beingdowngraded from active.

It's a dirty business, but so is quite a lot of what OSs have to do;and 256 entries should be enough to make sure it doesn't have to bedone so often as to constitute a performance problem.

For programming purposes, it's easiest if the G bit is kept in thekernel's page tables with the output-side fields. But when you'retranslating, it belongs to the input side. On MIPS32/64 CPUs, the twooutput-side values are AND-ed together to produce the value that isused, but the realistic outcome is that you must make sure the G bit isset the same in both halves.

The TLB's output side gives you the physical frame number and asmall but sufficient bunch of flags:

Flag #1 – Physical framenumber (PFN): This is the physical address with the low bits cut off(the low 12 bits if this is representing a 4-Kbyte page). Write controlbit (D): Set 1 to allow stores to this page to happen. The “D” comesfrom this being called the dirty bit; see the next section for why.

Flag #2 – Valid bit (V): Ifthis is 0, the entry is unusable. This seems pretty pointless: Why havea record loaded into the TLB if you don't want the translation to work?There are two reasons. The first is that the entry translates a pair ofvirtual pages, and maybe only one of them ought to be there.

The other is that the software routine that refills the TLB isoptimized for speed and doesn't want to check for special cases. Whensome further processing is needed before a program can use a pagereferred to by the memory-held table, the memory-held entry can be leftmarked invalid.

After TLB refill, this will cause a different kind of trap, invokingspecial processing without having to put a test in every softwarerefill event.

Flag # 3 – Cache control(C): This 3-bit field's primary purpose is to distinguish cacheable (3)from uncached (2) regions.

But that leaves six other values, used for two somewhat incompatiblepurposes: In shared-memory multiprocessor systems, different values areused to hint whether the memory is shared (when hardware will have to work hard tokeep any cached data consistent across the whole machine ). In”embedded” CPUs, different values select different local cachemanagement strategies: write-through versus write-back, for example.

Translating an address is now simple, and we can amplify thedescription above:

CPU generates a program address: This might be an instruction fetch, a load, or store – and it's onewithan address that does not lie in the special unmapped regions of theMIPS address space.

The low 13 bits are separated off, and the resulting VPN2, togetherwith the current ASID (EntryHi(ASID)), is looked up in the TLB. Thematch is affected by the the PageMask and by G fields of the variousTLB entries.

TLB matches key:  Ifthere's no matching entry, we'll take a TLB refill exception. But wherethere is a match, that entry is selected. Bit 12 of the virtual addressselects which physical-side half we'll use. The PFN from the TLB isglued to the low-order bits of the program address to form a completephysical address.

Valid?  The V and Dbits are consulted. If it isn't valid, or a store is being attemptedwith D unset, the CPU takes an exception. As with all translationtraps, the BadVAddr register will be filled with the offending programaddress; as with any TLB exception, the TLB EntryHi register will bepreloaded with the VPN of the offending address.

The BadVPN2 fields of the convenience registers Context (andXContext on 64-bit CPUs) will be preloaded (in part, as advertised)with the appropriate bits of the virtual address we failed to translateduring a TLB refill exception. But the specification is less definiteabout how those address fields behave in other exceptions. It'sprobably a good idea to stick to BadVAddr in other exceptions.

Cached?   If the C bitis set, the CPU looks in the cache for a copy of the physicallocation's data; if it isn't there, it will be fetched from memory anda copy left in the cache. Where the C bit is clear, the CPU neitherlooks in nor refills the cache.

Of course, the number of entries in the TLB permits you to translateonly a relatively small number of program addresses—a few hundred KBworth. This is far from enough for most systems. The TLB is almostalways going to be used as a software-maintained cache for a muchlarger set of translations.

When a program address lookup in the TLB fails, a TLB refill trap istaken. (Should this be called a “TLBmiss” [which is what just happened] or a “TLB refill” [which is whatwe're going to do to sort it out]? I'm afraid we probably use bothterms in MIPS documentation. )

System software has the following sequence of operations to perform:

1) It figures out whetherthere is a correct translation; if not, the trap will invoke thesoftware that handles address errors.

2) If there is a correcttranslation, it constructs a TLB entry that will implement it.

3) If the TLB is alreadyfull (and it almost always is full in running systems), the softwareselects an entry that can be discarded.

4) The software writes thenew entry into the TLB.

Keeping Track of Modified Pages(Simulating “Dirty” Bits)
An operating system that provides a page for an application program touse often wants to keep track of whether that page has been modifiedsince the OS last obtained it (perhaps from disk or network) or saved acopy of it. Unmodified (“clean”) pages may be quietly discarded, sincethey can easily be recovered from a file system if they're ever neededagain.

In OS parlance the modified pages are called “dirty,” and the OSmust take care of them until the application program exits or the dirtypage is cleaned by being saved away to backing store.

To help out with this process, it is common for CISC CPUs tomaintain a bit in the memory-resident page table indicating that awrite operation to the page has occurred.

The MIPS CPU does not support this feature, even in the TLB entries.The D bit of the page table (found in the EntryLo register) is awrite-enable and of course is used to flag read-only pages.

So here's the trick:

1) When writable page isfirst loaded intomemory, you mark its page table entry with D clear (leaving it read-only ).

2) When any write isattempted to the page, a trap will result; system software willrecognize this as a legitimate write but will use the event to set a”modified” bit in the memory resident tables—which, since it's in theEntryLo(D) position, permits future writes to be done without anexception.

3) You will also want toset the D bit in the TLB entry so that the write can proceed (but sinceTLB entries are randomly and unpredictably replaced, this would beuseless as a way of remembering the modified state).

How the Kernel Services a TLBRefill Exception
MIPS's TLB refill exception has always had its own unique entry point(at least as long as the CPU wasn't already in exception mode; inLinux, that would be a fatal error, and isn't considered further).

When the exception routine is called, the hardware has set upEntryHi (VPN2) to the page number of the location we just couldn'ttranslate: EntryHi is set exactly to what is required to create a newTLB entry to map the offending address.

The hardware has also set up a bunch of other address-relatedfields, but we're not going to use any of them. In particular, we'renot going to use the convenient “MIPS standard” way of using Context tofind the relevant page table entry.

The MIPS standard way requires that a (notionally very long) linearpage table is constructed in kseg2 (itwon't really take up excessive space, because kseg2 is mapped and thevast empty spaces of the table would never be mapped to real memory ).But Linux really does not like to have the unique-to thread-groupkernel mappings that would require.

Instead, Linux's TLB refill page tables are organized as athree-level hierarchy of tables (called”global,” “middle,” and just “PTE” ). But cunning use of C macrosallows the middle level to disappear completely without changing thecode — a two-level structure is enough for 32-bit MIPS. (However, all three are required for theextended virtual memory space available with 64-bit MIPS .)

So you can find the data for any TLB entry you want – if it existsatall – by following the links, as shown in Figure 14.4 below.

Figure14.4 Linux two-level page table (32-bit MIPS design).

This structure is reasonably economical of kernel memory: A largishLinux program with a 50-MB virtual address space has about 12-K 4-KBpages, each needing four bytes of PTE: That makes 48 KB or 12 pages.

That's an underestimate, because the address space components haveholes in them, but a reasonable-size thread group's mapping needs takeonly 15 or so pages. Kernel (kseg2) mappings take more, but they'recommon to all memory maps, so there is only one set of PTEs for kernelmappings.

Most 32-bit CPUs are restricted to a 32-bit physical address, too:Where that's so, there are six unused high-order bits in the EntryLo0-1registers. Linux recycles those to remember some software state aboutthe page table entries.

This does mean that the TLB refill handler for this 32-bit needs todo two index calculations and three loads. It's certainly longer thanthe tiny instruction sequence of the original magic MIPS scheme (wherethe index calculation is done by magic by the way the Context registerworks, and the only loads are from the final page table), but it's notbad.

You may have noticed that this description is very particular to acertain sort of MIPS CPU. Do all Linux systems have a unique kernel?

Well, no: But in the current Linux/MIPS kernel certain criticalroutines  – including the TLB miss exception handler – arebinariesgenerated by table-driven kernel software during start-up and tailoredto the requirements of this particular CPU.

Care and Maintenance of the TLB
The MIPS TLB is “just” a collection of translation entries. Entries getinto it purely by the refill handler's action. Most entries getoverwritten in the same way, of course. But sometimes the kernel needsto change a page table entry, and then it's important to invalidate anycopy in the TLB.

For a single entry, we can look up the TLB entry using the virtualaddress + ASID of the original entry, doing a tlbp: If the matchingentry is there, we can overwrite that page's entry with an invalid one.

Sometimes, though, you need to do surgery on a larger scale. TheASID mechanism allows the TLB to hold entries for 256 different memorymaps, but a Linux system will commonly run more than 256 processesbetween start-up and shutdown.

So there are times when a whole batch of translations must berescinded, because the ASID is going to be recycled for a new processand the old entries would be damagingly wrong.

There's no elegant way of doing that. You just have to iteratethrough all the TLB entries (index values go in the Index register),read each entry out into the usual registers, check the ASID valueretrieved into EntryHi(ASID), and invalidate any that match the victimASID value.

Memory Translation and 64-BitPointers
When the MIPS architecture was invented, 32-bit CPUs had been aroundfor a while, and the largest programs' data sets were already moving uptoward 100 MB – the address space had only 6 bits or so to spare. (Historically, application program demandfor memory space seems to have grown at about three-quarters of a bitper year.)

There was therefore every reason to be reasonably careful with the32-bit space and not to reduce it by profligate fragmentation; this iswhy application programs (running with user privilege) keep 31 bits'worth of addressing for themselves.

When the MIPS III instruction set introduced 64-bit registers in1991, it was leading the industry and, as we discussed in section 2.7,MIPS was probably four to six years ahead of real pressure on a 32-bitaddress boundary.

The doubling of register size only had to yield a few bits of extraaddress space to be remarkably future-proof; it's been more importantto be cautious about the potentially exploding size of OS datastructures than to make efficient use of all address space.

The limitations to the practical address space resulting from thebasic 64-bit memory map are not going to be reached for a while; theytheoretically permit the mapped user and other spaces to growto 61 bitswithout any reorganization. But so far, a 40-bit user virtual space hasbeen ample.

Most other 64-bit Linux systems have 8-Kbyte pages, but that wouldbe annoying for MIPS. The MIPS TLB can map either 4 Kbyte or 16 Kbytein a single entry, but not 8 Kbyte, so 64-bit MIPS kernels use 4-Kbytepages, with 16-Kbyte pages a desirable option in some brave future.

If you take a look back at Figure14.4 and imagine a set of intermediate (PMD) tables between thePGD and PTEs, you can comfortably resolve a 40-bit virtual address in athree-level table. We'll leave details to those enthusiastic enough toread the source code.

Next in Part 5: MIPS specificissues in the Linux kernel
To read Part 3, go to “WhatHappens on a System Call
To read Part 2, go to “How hardware and software work together.”
To read Part 1, go to “GNU/Linux from eight miles high”

Thisseries of articles is based on material from “SeeMIPS Run Linux,” by Dominic Sweetman, used with the permission ofthe publisher, Morgan Kaufmann/Elsevier, which retains full copyrights.It can be purchased on line.

Dominic Sweetman is asoftware/hardware boundary expert based in London, England, whopreviously served as managing director at Algorithmics Ltd.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.