Getting down to basics: Running Linux on a 32-/64-bit RISC architecture - Part 6The work of fixing around cache aliases is never finished. As the years go on, Linux will offer more exciting facilities. Programmers struggle to fix the places where aliases can break legitimate code, but as new facilities are added to the OS, it breaks again.
I think the CPU designers are now recognizing that relying on software workarounds for aliases creates maintenance problems: Increasingly, new CPUs have at least the D-cache alias-free.
MIPS was a very influential architecture, so some of its competitors faithfully copied its mistakes; as a result, cache aliases are found in other architectures too.
(If imitation is the sincerest form of flattery, imitation of an architecture's mistakes must be tantamount to hero-worship.)
CP0 Pipeline Hazards
As we've seen in some of the examples previously in Part 5, CP0 operations which are dependent on some CP0 register value need to be done carefully: The pipeline is not always hidden for these OS-only operations. The OS is where all the CP0 operations happen, of course.
On a modern CPU (compliant to Release 2 of the MIPS32/MIPS64 specifications), hazard barrier instructions (eret, jalr.hb, and jr.hb) are available.
On older CPUs, only entry into exception and the eret instruction are guaranteed to act as CP0 hazard barriers. So where you're writing a code sequence where something depends on the correct completion of a CP0 operation, you may need to pad the code with a calculated number of no-op instructions (your CPU may like you to use the ssnop rather than a plain old nop - read the manual).
The CPU manual for an older CPU should describe the (CPU-specific) maximum duration of any hazard and let you calculate how many instruction times might pass before you are safe.
Multiprocessor Systems and Coherent
In the 1990s, MIPS CPUs were used by Silicon Graphics, Inc. and others to build cache-coherent multiprocessor systems. SGIwere not exactly pioneers in this area, but the MIPS R4000 was one of the first microprocessors designed from the outset to fit into such a system, and SGI's large MIPS multiprocessor server/supercomputers were very successful products.
However, little about this technology is really specific to MIPS, so we'll be brief. The kind of multiprocessor system we're describing here is one in which all the CPUs share the same memory and, at least for the main read/write memory, the same physical address space - that's an "SMP" system (for Symmetric MultiProcessing -it's "symmetric" because all CPUs have the same relationship with memory and work as peers).
The preferred OS structure for SMP systems is one where the CPUs cooperate to run the same Linux kernel. Such a Linux kernel is an explicitly parallel program, adapted to be executed by multiple CPUs simultaneously.
The SMP version of the scheduler finds the best next job for any CPU that calls it, effectively sharing out threads between the pool of available CPUs. That's a big upgrade for an OS kernel that started life on x86 desktops, but the problems posed by multiple CPUs running in parallel are similar to those posed by multitasking a single CPU.
Back to the hardware. That memory may be built as one giant lump shared by all CPUs, or distributed so that a CPU is "closer" to some memory regions than others.
But all multiprocessor systems have the problem that the shared memory has lower performance than a private one: Its total bandwidth is divided between theCPUs, but—usually more important - the logic that regulates sharing puts delays into the connection and increases memory latency. Performance would be appalling without large, fast CPU caches.
The trouble with the caches is that while the CPUs execute independently a lot of the time, they rely on shared memory to coordinate and communicate with each other.
Ideally, the CPU's individual caches should be invisible: Every load and store should have exactly the same outcome as if the CPUs were directly sharing memory (but faster) - such caches are called "coherent."
But with simple caches (as featured on many MIPS CPUs), that doesn't happen - CPUs with a cached copy of data won't see any change in memory made by some other CPU, and CPUs may update their cached copy of data without writing back to memory and giving other CPUs a chance to see it.
What you need is a cache that keeps a copy of memory data in the usual way, but which automatically recognizes when some other CPU wants to access the same data, and deals with the situation (most of the time, by quietly invalidating the local copy). Since relatively few references are to shared code, this can be very efficient.
It took a while to figure out how to do it. But the system that emerged is something like this: Memory sharing is managed in units that correspond to cache lines (typically 32 bytes).
Each cache line may be copied in any number of the caches so long as they're reading the data; but a cache line that one CPU is writing is the exclusive property of one cache.
In turn, that's implemented within each CPU cache by maintaining a well defined state for each resident cache line. The bus protocol is organized by specifying a state machine for each cached line, with state transitions caused by local CPU reads and writes, and by messages sent between the caches.
Coherent systems can be classified by enumerating the permitted states, leading to terms like MESI, MOSI, and MOESI (from states called modified, owned, exclusive, shared, and invalid). Generally, simpler protocols lead to more expensive sharing.
The first practical systems used a single broadcast bus connecting all caches and main memory, which had the advantage that all participants could see all transactions (simultaneously, and all in the same order—which turns out to make things much simpler).
Much of the intercache communication could be achieved by letting the caches "snoop" on cache-memory refill and write-back operations. That's why coherent caches are sometimes called "snoopy caches." Single-bus systems don't scale well either to more CPUs or to very high frequencies.
Modern systems use more complex networks of cache-to-cache and cache-to-memory connections; that means you can't rely on snooping and have no one place in the system where you can figure out in which order different requests happened. Much more complicated . . .
This technology is migrating to the smallest systems, where multiple processors on a system-on-chip (SoC) share memory. Chip-level multiprocessing (CMP) is more attractive than you might think, because it increases compute power without very high clock rates.
The only known practical design and test methods for SoC don't deliver very high frequencies - and, in any case, a 3-GHz processor taking 70Wof electric power and dissipating it as heat is hardly practical in a consumer device. It's hard to build anything on an SoC that behaves like a single snoopable bus.