The work of fixing around cache aliases is never finished. As the years
go on,
Linux will offer more
exciting facilities. Programmers struggle to fix the places where
aliases can break legitimate code, but as new facilities are added to
the OS, it breaks again.
I think the CPU designers are now recognizing that relying on
software workarounds for aliases creates maintenance problems:
Increasingly, new CPUs have at least the D-cache alias-free.
MIPS was a very influential
architecture, so some of its competitors faithfully copied its
mistakes; as a result, cache aliases are found in other architectures
too.
(If imitation is the sincerest
form
of flattery, imitation of an architecture's mistakes must be tantamount
to hero-worship.)
CP0 Pipeline Hazards
As we've seen in some of the examples previously in Part 5,
CP0 operations which are
dependent on some CP0 register value need to be done carefully: The
pipeline is not always hidden for these OS-only operations. The OS is
where all the CP0 operations happen, of course.
On a modern CPU (compliant to
Release 2 of the MIPS32/MIPS64 specifications), hazard barrier
instructions (eret, jalr.hb, and jr.hb)
are available.
On older CPUs, only entry into exception and the eret instruction
are guaranteed to act as CP0 hazard barriers. So where you're writing a
code sequence where something depends on the correct completion of a
CP0 operation, you may need to pad the code with a calculated number of
no-op instructions (your CPU may like
you to use the ssnop rather than a plain old nop - read the manual).
The CPU manual for an older CPU should describe the (CPU-specific)
maximum duration of any hazard and let you calculate how many
instruction times might pass before you are safe.
Multiprocessor Systems and Coherent
Caches
In the 1990s, MIPS CPUs were used by Silicon Graphics, Inc.
and others to build cache-coherent multiprocessor systems. SGIwere
not
exactly pioneers in this area, but the MIPS R4000 was one of the first
microprocessors designed from the outset to fit into such a system, and
SGI's large MIPS multiprocessor server/supercomputers were very
successful products.
However, little about this technology is really specific to MIPS, so
we'll be brief. The kind of multiprocessor system we're describing here
is one in which all the CPUs share the same memory and, at least for
the main read/write memory, the same physical address space - that's an
"SMP" system (for Symmetric MultiProcessing -it's
"symmetric" because all CPUs have the same relationship with memory and
work as peers).
The preferred OS structure for SMP systems is one where the CPUs
cooperate to run the same Linux kernel. Such a Linux kernel is an
explicitly parallel program, adapted to be executed by multiple CPUs
simultaneously.
The SMP version of the scheduler finds the best next job for any CPU
that calls it, effectively sharing out threads between the pool of
available CPUs. That's a big upgrade for an OS kernel that started life
on x86 desktops, but the problems posed by multiple CPUs running in
parallel are similar to those posed by multitasking a single CPU.
Back to the hardware. That memory may be built as one giant lump
shared by all CPUs, or distributed so that a CPU is "closer" to some
memory regions than others.
But all multiprocessor systems have the problem that the shared
memory has lower performance than a private one: Its total bandwidth is
divided between theCPUs, but—usually more important - the logic that
regulates sharing puts delays into the connection and increases memory
latency. Performance would be appalling without large, fast CPU caches.
The trouble with the caches is that while the CPUs execute
independently a lot of the time, they rely on shared memory to
coordinate and communicate with each other.
Ideally, the CPU's individual caches should be invisible: Every load
and store should have exactly the same outcome as if the CPUs were
directly sharing memory (but faster) - such caches are called
"coherent."
But with simple caches (as featured on many MIPS CPUs), that doesn't
happen - CPUs with a cached copy of data won't see any change in memory
made by some other CPU, and CPUs may update their cached copy of data
without writing back to memory and giving other CPUs a chance to see
it.
What you need is a cache that keeps a copy of memory data in the
usual way, but which automatically recognizes when some other CPU wants
to access the same data, and deals with the situation (most of the
time, by quietly invalidating the local copy). Since relatively few
references are to shared code, this can be very efficient.
It took a while to figure out how to do it. But the system that
emerged is something like this: Memory sharing is managed in units that
correspond to cache lines (typically 32 bytes).
Each cache line may be copied in any number of the caches so long as
they're reading the data; but a cache line that one CPU is writing is
the exclusive property of one cache.
In turn, that's implemented within each CPU cache by maintaining a
well defined state for each resident cache line. The bus protocol is
organized by specifying a state machine for each cached line, with
state transitions caused by local CPU reads and writes, and by messages
sent between the caches.
Coherent systems can be classified by enumerating the permitted
states, leading to terms like MESI,
MOSI, and MOESI (from states called modified, owned,
exclusive, shared, and invalid). Generally, simpler protocols
lead to more expensive sharing.
The first practical systems used a single broadcast bus connecting
all caches and main memory, which had the advantage that all
participants could see all transactions (simultaneously, and all in the same
order—which turns out to make things much simpler).
Much of the intercache communication could be achieved by letting
the caches "snoop" on cache-memory refill and write-back operations.
That's why coherent caches are sometimes called "snoopy caches."
Single-bus systems don't scale well either to more CPUs or to very high
frequencies.
Modern systems use more complex networks of cache-to-cache and
cache-to-memory connections; that means you can't rely on snooping and
have no one place in the system where you can figure out in which order
different requests happened. Much more complicated . . .
This technology is migrating to the smallest systems, where multiple
processors on a system-on-chip (SoC) share memory. Chip-level multiprocessing (CMP)
is
more attractive than you might think, because it increases compute
power without very high clock rates.
The only known practical design and test methods for SoC don't
deliver very high frequencies - and, in any case, a 3-GHz
processor taking 70Wof electric power and dissipating it as heat is
hardly practical in a consumer device. It's hard to build anything on
an SoC that behaves like a single snoopable bus.