Achieving cache coherence in a MIPS32 multicore design -

Achieving cache coherence in a MIPS32 multicore design

Historically, memory coherence in multiprocessor systems was often achieved through bus “snooping,” where each core was connected to a common multitier bus and was able to snoop on memory-access traffic of processor peers to regulate the coherence status of individual cache lines. For that, each core maintained the coherence status of L1 cache lines locally and posted status changes to peers via the common bus.

The increasing size and complexity of the system-on-a-chip (SoC) led to restructuring of the multitier-bus philosophy in favor of localized point-to-point connections with centralized traffic routing. This configuration enabled speed and power improvements on now localized bus segments due to reduced load and segment length. Also, bus-contention problems eased, and throughput increased for the localized data exchange. In response to this trend in system architecture, the Open Core Protocol (OCP) standard emerged to consolidate this design philosophy. Further, emergence of IP-provider business models catalyzed the standardization of IP interconnect and design methods to facilitate design reuse centered on an open standard.

However, localized bus transactions, as conducted through OCP interconnect segments, decouple processors throughout a multicore cluster. Coherence schemes cannot be directly based on bus snooping and reliance on bus arbitration to ensure access ordering. Different methods of communication are needed to ensure data is accessed consistently. Additional challenges arise in the ordering of competing L1-line data requests. One way to address these challenges is to add coherence-message communication to each processing element as depicted in Figure 1 in a system MIPS calls a Coherent Processing System (CPS) . This system provides the means of snoop-type cache coherence.

View the full-size image

Coherence messages embody a new type of command within the OCP protocol. Members of the processor system send coherence messages toward a centralized coherence manager that provides access ordering (serialization) and message routing to provide snoop-type access to peer members. These peers will respond with their individual L1-line status and post a message response. Depending on responses, the coherence manager initiates data movement for coherent data among cores and funnels access toward higher-level memory hierarchies such as L2 and L3 caches. I/O coherence units also provide a means to phase-in/out data toward/from the coherent address space and are part of coherent-message exchange.

In addition to new message-type commands within the OCP protocol, individual processors are required to respond to coherent status requests and are therefore not solely initiators (masters) of bus transactions. The CPS might address this requirement by providing an OCP slave port to receive and respond to messages initiated by the coherence manager. Coherent requests by a processor will use the OCP master port. Within the processing cluster, coherence-message exchanges between cores and the coherence manager are dubbed interventions . OCP slave ports of processors receiving interventions are therefore intervention ports .

As depicted in Figure 1 , each individual processor of the MIPS 1004K system is based on our multithreaded processor architecture, providing two independent threads and processing context within the envelope of a single-scalar, 9-stage pipeline. Level 1 data-cache tag arrays are duplicated to be accessible simultaneously for CPU operation and intervention lookup. MESI-style cache-line coherency is supported.

The coherence manager of the processing system receives and serializes incoming messages through its request unit–OCP slave ports, driven by each CPU and I/O-coherence units. Serialized messages are routed depending on their address space and context either to higher-level cache hierarchies using the memory interface unit , or toward processor peers and I/O-coherence units using the snoop agent. The snoop agent initiates OCP master transactions (interventions) to look up the coherent L1 cache-line status for each processor. Interventions returned to the initiator of a message, called self-interventions , allow the initiator to provide access ordering. Responses to coherent messages initiated by CPUs as well as data responses are formulated within the response unit and routed to individual CPUs.

Coherent OCP commands
OCP commands used within the 1004K CPS can be classified into three categories:

Coherent messages maintain a MESI-style cache-line status. These messages are a result of CPU load and store operations and can initiate data movement between CPUs and the memory subsystem. All peer CPUs of the CPS will receive coherent messages posted by an initiator and respond according to their cache-line coherent state. The coherence manager will initiate data movement as required.

Coherent cache-manipulation commands are used for cache-line maintenance within the coherent address space. I/O traffic will bring new coherent lines into the domain or remove coherent context from cache lines. Further, operations that synchronize memory hierarchy are performed.

Noncoherent commands perform OCP main-port transactions on memory regions outside the coherent address space. These represent OCP read and write commands.

Coherent messages
The CPS may implement four coherent messages that are caused by L1 cache-line-status changes due to CPU load and store activity. The initiating CPU sends this message as an OCP master-port command. Peer CPUs of the system receive interventions based on this line-status change and will respond with their local cache-line status.

The first message type is the CohReadOwn , denoting a cache miss that occurred through an attempt to modify a cache line. As Figure 2 shows, peer cores encountering this line in status “Modified” will force a write-back into the memory subsystem and perform a local invalidate. As an optimization, locally encountered line data will be forwarded to the requester CPU to reduce access latency. The requester CPU will install this line as “Exclusive” and perform the line-modifying instruction. Then the cache-line status will change to “Modified.” While waiting for line refill, the requester CPU will continue execution of another thread.

View the full-size image

The CohReadShared message indicates that a cache miss occurred through a line read operation. No line modification is intended. As Figure 3 shows, peer cores encountering this line in status “Modified” will force a write-back into the memory subsystem. Hitting peer lines will migrate to “Shared” status. Hit data is forwarded to the requester core and installed in state “Shared.” Then the line read operation is performed. While waiting for line refill, the requester CPU will continue execution of another thread.

View the full-size image

CohUpgrade indicates that a line-modifying instruction encountered a cache hit on a “Shared” line. As Figure 4 shows, peer cores will be notified to invalidate hitting lines. The “Shared” line is then upgraded to “Modified” after the modifying instruction is executed.

View the full-size image

Finally, the CohWriteBack message signifies eviction of a coherent cache line. The coherence manager will initiate data movement through the intervention port and forward data to the memory subsystem. The evicted cache line is then replaced by a new–possibly coherent–address. In this case, a CohReadOwn or CohReadShared has caused the eviction.

Coherent cache manipulation commands
In response to cache manipulations, coherence messages are initiated and sent to peers.

CohCopyBack –write back a coherent cache line to the memory subsystem. Cache-line hits in state “Modified” will be written back. Line status migrates to “Shared.” CopyBack data movement will be initiated by the coherence manager using the intervention port.

CohInvalidate –purge a coherent cache line without writing back its contents to the memory subsystem. This command is always data-less and is posted to each peer of the CPS. Invalidate-type cache operations cause a CohInvalidate message.

CohWriteInvalidate –an I/O coherence unit injects a new cache line into the coherent domain. Existing peer line data will be invalidated throughout the CPS.

CohReadInvalidate –an I/O coherence unit notifies the system about a cache line leaving the coherent domain. Existing peer line data will be invalidated throughout the CPS.

CohCompletionSync –data-less command to maintain ordering. Local buffers of CPS peers are flushed towards the memory subsystem. The CPU-SYNC instruction causes the CohCompletion-Sync for CPUs attending the coherent domain. SYNC command arguments (sync types) help control the depth of flush operations throughout memory hierarchies. The CPS reserves certain argument encodings to support low-overhead access ordering.

Noncoherent commands
Traditional OCP commands such as “Read” and “Write” are supported throughout the CPS to handle data access for noncoherent memory access. The Read command is issued when a miss within a cached, noncoherent address or an uncached access causes a read operation from the memory subsystem. Response data–if cacheable–will be installed as noncoherent, whereas uncached data are consumed directly. Fetch as well as load and store activity causes Read transactions. The Write command is issued when cached, noncoherent eviction data or uncached-address-range stores will be written back to the memory subsystem. The OCP main port of a core performs the command and data phases of the transaction.

OCP works well for CPS
The OCP interconnect lent itself well to support message-based coherence implementations. A centralized coherence manager serializes coherence messages emanating from an individual core and inquires about the coherence status of peer cores. Data forwarding between cores decreases access latency and reduces traffic to higher levels of memory hierarchy. Individual cores possess an OCP master port to initiate data access and an OCP slave port to receive inquiries from the coherence manager.

Matthias Knoth is a design engineer for MIPS Technologies, Inc., responsible for low-power micro-architecture and 1004K processor implementation. Knoth has more than 13 years experience in the semiconductor industry with companies including Siemens Research., Siemens Microelectronics, Infineon Technologies and Quicksilver Technology. Knoth holds a masters in electronics from the University of Technology, Chemnitz, Germany. YOu may reach him at

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.