Getting down to basics: Running Linux on a 32-/64-bit RISC architecture - Part 6 - Embedded.com

Getting down to basics: Running Linux on a 32-/64-bit RISC architecture – Part 6

The work of fixing around cache aliases is never finished. As the yearsgo on, Linux will offer moreexciting facilities. Programmers struggle to fix the places wherealiases can break legitimate code, but as new facilities are added tothe OS, it breaks again.

I think the CPU designers are now recognizing that relying onsoftware workarounds for aliases creates maintenance problems:Increasingly, new CPUs have at least the D-cache alias-free.

MIPS was a very influentialarchitecture, so some of its competitors faithfully copied itsmistakes; as a result, cache aliases are found in other architecturestoo.

(If imitation is the sincerestformof flattery, imitation of an architecture's mistakes must be tantamountto hero-worship. )

CP0 Pipeline Hazards
As we've seen in some of the examples previously in Part 5,CP0 operations which aredependent on some CP0 register value need to be done carefully: Thepipeline is not always hidden for these OS-only operations. The OS iswhere all the CP0 operations happen, of course.

On a modern CPU (compliant toRelease 2 of the MIPS32/MIPS64 specifications ), hazard barrierinstructions (eret, jalr.hb, and jr.hb )are available.

On older CPUs, only entry into exception and the eret instructionare guaranteed to act as CP0 hazard barriers. So where you're writing acode sequence where something depends on the correct completion of aCP0 operation, you may need to pad the code with a calculated number ofno-op instructions (your CPU may likeyou to use the ssnop rather than a plain old nop – read the manual ).

The CPU manual for an older CPU should describe the (CPU-specific)maximum duration of any hazard and let you calculate how manyinstruction times might pass before you are safe.

Multiprocessor Systems and CoherentCaches
In the 1990s, MIPS CPUs were used by Silicon Graphics, Inc.and others to build cache-coherent multiprocessor systems. SGIwerenotexactly pioneers in this area, but the MIPS R4000 was one of the firstmicroprocessors designed from the outset to fit into such a system, andSGI's large MIPS multiprocessor server/supercomputers were verysuccessful products.

However, little about this technology is really specific to MIPS, sowe'll be brief. The kind of multiprocessor system we're describing hereis one in which all the CPUs share the same memory and, at least forthe main read/write memory, the same physical address space – that's an”SMP” system (for Symmetric MultiProcessing -it's”symmetric” because all CPUs have the same relationship with memory andwork as peers).

The preferred OS structure for SMP systems is one where the CPUscooperate to run the same Linux kernel. Such a Linux kernel is anexplicitly parallel program, adapted to be executed by multiple CPUssimultaneously.

The SMP version of the scheduler finds the best next job for any CPUthat calls it, effectively sharing out threads between the pool ofavailable CPUs. That's a big upgrade for an OS kernel that started lifeon x86 desktops, but the problems posed by multiple CPUs running inparallel are similar to those posed by multitasking a single CPU.

Back to the hardware. That memory may be built as one giant lumpshared by all CPUs, or distributed so that a CPU is “closer” to somememory regions than others.

But all multiprocessor systems have the problem that the sharedmemory has lower performance than a private one: Its total bandwidth isdivided between theCPUs, but—usually more important – the logic thatregulates sharing puts delays into the connection and increases memorylatency. Performance would be appalling without large, fast CPU caches.

The trouble with the caches is that while the CPUs executeindependently a lot of the time, they rely on shared memory tocoordinate and communicate with each other.

Ideally, the CPU's individual caches should be invisible: Every loadand store should have exactly the same outcome as if the CPUs weredirectly sharing memory (but faster) –  such caches are called”coherent.”

But with simple caches (as featured on many MIPS CPUs), that doesn'thappen – CPUs with a cached copy of data won't see any change in memorymade by some other CPU, and CPUs may update their cached copy of datawithout writing back to memory and giving other CPUs a chance to seeit.

What you need is a cache that keeps a copy of memory data in theusual way, but which automatically recognizes when some other CPU wantsto access the same data, and deals with the situation (most of thetime, by quietly invalidating the local copy). Since relatively fewreferences are to shared code, this can be very efficient.

It took a while to figure out how to do it. But the system thatemerged is something like this: Memory sharing is managed in units thatcorrespond to cache lines (typically 32 bytes).

Each cache line may be copied in any number of the caches so long asthey're reading the data; but a cache line that one CPU is writing isthe exclusive property of one cache.

In turn, that's implemented within each CPU cache by maintaining awell defined state for each resident cache line. The bus protocol isorganized by specifying a state machine for each cached line, withstate transitions caused by local CPU reads and writes, and by messagessent between the caches.

Coherent systems can be classified by enumerating the permittedstates, leading to terms like MESI,MOSI, and MOESI (from states called modified, owned,exclusive, shared, and invalid). Generally, simpler protocolslead to more expensive sharing.

The first practical systems used a single broadcast bus connectingall caches and main memory, which had the advantage that allparticipants could see all transactions (simultaneously, and all in the sameorder—which turns out to make things much simpler).

Much of the intercache communication could be achieved by lettingthe caches “snoop” on cache-memory refill and write-back operations.That's why coherent caches are sometimes called “snoopy caches.”Single-bus systems don't scale well either to more CPUs or to very highfrequencies.

Modern systems use more complex networks of cache-to-cache andcache-to-memory connections; that means you can't rely on snooping andhave no one place in the system where you can figure out in which orderdifferent requests happened. Much more complicated . . .

This technology is migrating to the smallest systems, where multipleprocessors on a system-on-chip (SoC) share memory. Chip-level multiprocessing (CMP)ismore attractive than you might think, because it increases computepower without very high clock rates.

The only known practical design and test methods for SoC don'tdeliver very high frequencies –  and, in any case, a 3-GHzprocessor taking 70Wof electric power and dissipating it as heat ishardly practical in a consumer device. It's hard to build anything onan SoC that behaves like a single snoopable bus.

In contemporary SoCs the state of the art is a small cluster ofCPUs, whose mutual coherency is maintained by a piece of logic thatcouples those CPUs quite tightly together, implementing a singleordering point. Future SoCs may use more loosely coupled networks ofCPUs, which will need more complicated cache coherency protocols.

It was difficult to get cache-coherent SMP hardware to work, andthen to work efficiently. Those struggles produced a fair amount offolklore about how to do things well, and it's worth getting anintroduction to it before you work with a robust multi-CPU applicationlike the Linux kernel.

Here are some headline issues:

1) Selectingpages that don't need coherent management: Cache coherency looksafter itself, but performance will be better if the OS can tell thehardware about pages that are not shared and don't need coherencemanagement.

Read-only pages need no management – they come from memory, andcaches may make as many copies of them as required. No page is entirelyread-only, of course: Data had to be written sometime and somewhere.

But in Linux, instruction pages come close enough to be worthspecial handling. The kernel's own instructions were safely writtenbefore anything was cached, so they're OK. Application instructionpages are generally read in from a file, and at that point all cachesmust have any copies invalidated; but after that they're OK too.

Quite a lot of data is perthread. Process stack is always perthread,and for single-threaded applications, so is all the user-space data.Unfortunately, while there's only one CPU interested in asingle-threaded process at any point in time, an application canmigrate from CPU to CPU when it's rescheduled (and if you want a multiprocessor system towork well, you need few constraints on migration).

It's a matter of judgment whether the benefit of more relaxedcoherency management on known single-threaded data pages is worthwhile,even though you'll need to do some error-prone, manual cache-cleaningwhen a thread migrates.

Atomicity, critical regions, and multiprocessor locks: SMP systemsin which multiple CPUs run the same code and negotiate shared datastructures don't work without some attention to making data reads andupdates atomic in the sense found in Part 2.

The MIPS architecture's ll/sc (load-linked/store-conditional)instructions are its primitive building blocks for atomicity and locksand were designed to scale well to large multiprocessors.

Uncertain memory ordering and memory barriers: In a cache-coherentmultiprocessor, when any CPU writes a memory location, another CPUreading that location will (sometime a bit later) see it updated.

But once you get away from the systems built round a single commonbus, it's hard to make sure even that reads and writes stay in the samerelative order.

Most of the time that doesn't matter – the threads running on thedifferent CPUs have to cope with the fact that they don't know muchabout how other threads are progressing. But software using sharedmemory to communicate between threads is always vulnerable.

sync has another life. In many CPUs it has additional or strongerCPU family-specific semantics such as “all data has cleared the systeminterface,” or “wait until all my load and store queues are empty.”

Demon Tweaks for a Critical Routine
Before ending this series, it's worth looking quickly at some optimizedLinux code to get a feelfor how far it's worth taking architecture- or CPU-specific tuning.

The clear page() routine is heavily used in the Linux kernel. Pagesfilled with zero are directly required for the “uninitialized” portionof application data spaces, and it's also used as a way of cleaningex-application pages to avoid accidental leakage of data from oneapplication to another (which wouldviolate the security policy ).

This implementation of clear page() uses several ideas. It unrollsloops –  each iteration of the loop clears a whole cache line (inthis case, that's 32 bytes, eight words).

It also uses the MIPS-specific pref operation to avoid waiting oncache traffic. If prefetch really does need to read data from memory,that's going to take a long time compared with the CPU execution rate:So we'll prefetch a long way ahead. In this case, “a long way” is 512bytes, 16 cache lines.

But if the CPU can understand it, the prefetch hint is thespecialized “prefetch for store” version, which makes a cache entry fora line without actually reading any data in (it relies on theprogrammer's promise to overwrite the whole cache line).

Here's the function with added commentary. Bear in mind that PAGESIZE is usually 4,096.

#define PREF_AHEAD 512
clear_page:
    # the first loop (main_loop) stops short so as notto prefetch off
    # end of page
    addiu a2, a0, PAGE_SIZE – PREF_AHEAD

main_loop:
    # the prefetch. Bring in cache line, but with luckwe won't read
    # memory. But if all this CPU offers is a simpleprefetch, that
    # should work too.
    pref Pref_PrepareForStore, PREF_AHEAD(a0)

    # now we're going to do eight stores
    sw    zero, 0(a0)
    sw     zero, 4(a0)
    sw     zero, 8(a0)
    sw     zero, 12(a0)
    addiu a0, a0, 32     # some CPUschoke on too many writes at
               # full-speed, so increment the loop pointer
               # in the middle to give it a break.
    sw     zero, -16(a0)
    sw     zero, -12(a0)
    sw     zero, -8(a0)
    bne   a2, a0, main_loop
    sw     zero, -4(a0)               # last store in the branch delay slot

    # the second (end) loop does the rest, and hasno prefetch which
    # would overrun.

    addiu a2, a0, PREF_AHEAD
end_loop:
    sw zero, 0(a0)
    sw zero, 4(a0)
    sw zero, 8(a0)
    sw zero, 12(a0)
    addiu a0, a0, 32
    sw zero, -16(a0)
    sw zero, -12(a0)
    sw zero,-8(a0)
    bne a2, a0, end_loop
    sw zero, -4(a0)
    jr ra
    nop

To read Part 5, go to “MIPSspecific issues in the Linux Kernel
To read Part 4, go to “What we really want”.
To read Part 3, go to “WhatHappens on a System Call
To read Part 2, go to “How hardware and software work together.”
To read Part 1, go to “GNU/Linux from eight miles high”

Thisseries of articles is based on material from “SeeMIPS Run Linux,” by Dominic Sweetman, used with the permission ofthe publisher, Morgan Kaufmann/Elsevier, which retains full copyrights.It can be purchased on line.

Dominic Sweetman is asoftware/hardware boundary expert based in London, England, whopreviously served as managing director at Algorithmics Ltd.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.