Cache Coherence Issues for Real-Time Multiprocessing - Embedded.com

Cache Coherence Issues for Real-Time Multiprocessing


Much has been published on cache organization and cache coherence in the context of general purpose computing. This article discusses these issues as they relate to real-time applications and embedded systems in particular.

As more embedded applications become complex enough to require multiprocessing, cache coherence technology needs to be evaluated and adapted to these applications. The context for this article is real-time applications, with emphasis on high-end tasks that require multiple processors. Rather than attempt to precisely define real-time computing, I'll describe the key relevant application issues and discuss several broad classes of real-time applications. Periodic real-time computing is generally controlled by a fixed time base that often takes the form of a main time frame and smaller divisions within the frame. The processors are made aware of this base through repetitive interrupts that are either derived from a very accurate internal timer or supplied from an external source that may be common to other equipment in the system. Typically hundreds of tasks are executed within the frame. Some tasks are executed every frame, and some are executed only when a condition occurs; but the conditional tasks have a pre-determined slot in which to execute. The dominant concern in this model is deterministic execution times. Task overruns can seriously disrupt the complete process.

Continuous real time deals with monitoring and controlling physical variables as they occur in time. In the early days of computing, these applications were handled with analog computers. In some cases the analog computers were directly replaced with digital computers, but the system is organized as if analog computers were still being used. Most continuous systems are implemented as periodic systems with very fast frames.

Event-driven real-time applications are characterized by essentially unpredictable combinations of events requiring handling by a computer system. A good example of this is the system used by Raytheon to detect wind shear bursts at airports. With the use of radar technology, atmospheric conditions are continuously monitored to detect the deadly rapid shifts of winds known as microbursts. The detection takes extensive computational analysis structured as hundreds of tasks that are activated as required by the varying conditions. The key parameter is the speed of response to events. Because there will be conflicts among events, an efficient priority scheme with many levels is required to assure that the most critical events are analyzed first. Due to the large number of tasks and the unpredictable sequences, systems like this one often need to be organized as general purpose computers„plus, they can take advantage of some of the general purpose computing technology developments.

Clearly many actual systems are combinations of the above cases. Highly periodic systems are more suitable for specialized system designs. Systems requiring more flexibility may need to use more general purpose architectures and techniques, but generally at a higher cost.

Embedded systems
Embedded systems are computing systems dedicated to specific tasks. In many cases, the work being done was originally done by custom logic. Simple microprocessors or digital signal processors were then used for a more flexible implementation. Now 32-bit RISC microprocessors provide the performance and features of a full computer system at a reasonable cost. This feature makes the use of fuller functionality operating systems and makes languages practical and desirable. These systems are easier to develop and maintain. More complex algorithms and intelligence can now be implemented, yielding more sophisticated products. For larger applications, multiple processors can be used as a coherent system, providing further flexibility and scalability. As I discussed before, event-driven systems often require high resource flexibility.

Multiprocessing architectures
The earliest multiprocessor systems used master/slave configurations in which a master processor executed the operating system and the slave processors executed applications. Later multiprocessors were introduced that used a symmetric architecture in which any of the processors could temporarily become the “master” by executing a shared copy of the operating system, avoiding master bottlenecks. Another advantage of symmetric multiprocessing is load-leveling. Symmetric multiprocessing facilitates system load balancing by “floating” applications to any of the processors.

In practice, symmetric multiprocessing has often been identified with systems in which all of the memory of the system is fully shared. This situation offers the most flexibility. However, these systems tend to suffer from scalability problems unless the bandwidth of the interconnection scheme can sustain the memory traffic required by the largest configurable number of processors. Although impressive improvements continue to be achieved in bus bandwidth technology, improvements in the speed of silicon within an integrated circuit (such as a processor) will always outpace the speed achievable through inter-component connections. Solutions to this problem generally limit flexibility in some way. In embedded systems, the flexibility can be tailored to the task, rather than follow a general-purpose computing model completely.

NUMA architectures
Researchers have explored the advantages of symmetric multiprocessors with “unsymmetric” memory. Non-Uniform Memory Access (NUMA) was coined in conjunction with this research. Many of the newer commercial multiprocessor systems embrace the NUMA concept. NUMA architectures generally partition memory into local memories that are placed close to one or more processors. The memories that are not the closest to a processor or processor cluster are considered to be remote memories (from the viewpoint of the processor or cluster). Shared memory is either allocated into one of the local memories or distributed among the local memories. In the case of single local allocation, the “host” processor for this memory may be penalized by the shared memory accesses from the other processors. An example of a local/remote memory architecture is the Stanford University Dash system, which is illustrated in Figure 1 . The Dash architecture actually used multiple buses for interconnection, and a great deal of the Stanford research dealt with the use of directories to implement cache coherence. Many of the commercial NUMA systems use this type of organization.

A logical extension of the local-remote architecture is to add one or more global memory modules. Shared data or code can be placed in global memory and private or slightly shared data or code can be placed in local memories.

Although these concepts have been formulated in the context of general purpose computers, they can be usefully applied to embedded systems. As a matter of fact, embedded systems are more suitable to NUMA concepts than general purpose systems, because knowledge of the application can be used to partition memories to optimize the system performance. Memory subsystems can be added in proximity to the processors that will be using the data most often. If several processors use mostly shared data, their memories can be structured to reflect this. Other processors that make occasional use of this data can be more remote but still have access to the data.

NUMA memory management
A useful way to organize memory in a NUMA system is in terms of memory pools. A memory pool is a homogeneous block of physical memory. The operating system is aware of these pools and their relationship to processors, including any characteristics that affect access performance and cache coherence. The operating system can use these characteristics to optimize process to processor allocations. Some NUMA systems also allow users to control these allocation

The local memories in NUMA systems can be organized into local memory pools. These pools are located in physical proximity to one or more processors. A local memory pool is a remote memory pool from the viewpoint of other processors that are located at a further distance from the local memory pool. Systems with global memories also have global memory pools. A global memory pool is equidistant from all processors in the system.

Kernel code may be loaded into global memory pools or replicated into the local memory pools. Kernel data structures must be loaded into global memory or into one of the local pools, becoming a remote memory pool to some processors. Likewise, shared application read-only pages can be loaded into local/remote pools, global pools, or replicated. Application writable pages that are shared must be in global memory pools or in local/remote pools. Private data can simply be loaded into the appropriate local pool.

NUMA operating systems have system default allocation policies or the application engineers can declare their own default policies. Different policies can be declared for each of the memory regions of a task or process. The more dynamic applications can take advantage of automatic operating system assignment/migration capabilities.

Cache issues
In the early days of caching, many real-time customers objected to the use of caches in computer systems due to the indeterminism that they introduced. Acceptance tests often required that caches be turned off to observe the worst case behavior. The speed of processors as compared to memories has made this viewpoint impractical. This level of indeterminism must simply be allowed for, although it cannot be ignored. This allowance is easier to do in periodic applications than in event-driven applications because the caches tend to fall into repetitive patterns.

In systems using multiple processors with caches, issues dealing with the cache coherence of shared data need to be addressed. The subject of cache coherence algorithms has been extensively researched. Modern processors designed for multiprocessing provide cache coherence support in hardware. But this hardware support still requires some timing overhead in maintaining the cache coherence. Hardware coherence algorithms require examination of stored cache tags or directories. These examinations can interfere with processor access to caches and buses. For this reason, it is desirable to identify non-shared data so as to eliminate its coherence maintenance impact. The concept of memory pools is still applicable, but the level of cache coherence for each pool needs to be identified. Global memory pools are often declared to be coherent for all processors that have access to it. Remote pools may or may not be declared to be coherent, providing design trade-off opportunities. Clearly, knowledge of an embedded system application can be used to optimize these coherency decisions in addition to the actual connection structure. Modern processors allow cache coherency assignments to be made on a page basis, although the system design needs to support the necessary connectivity and address propagation. The key cache coherence schemes are discussed below.

Write-through caches
The simplest cache coherence scheme is the use of a write-through protocol. All writes to a shareable memory location are written through the cache to the actual memory location. So any modified data is always kept updated in memory. The problem with this scheme is that the caches do not benefit the write traffic at all, which can easily become excessive as the number of processors increase. Often a memory location is updated several times before any other processor reads it„this is wasted traffic. For these reasons most modern processors implement some other protocol, although some implement a write-through protocol in addition. In applications where there is moderate amount of shared data writing, it is practical to use this protocol for the shared data only due to its simplicity. Most processors that implement multiple protocols allow protocol selection on a page basis.

Copy-back caches
In this scheme, the write-traffic problem is addressed by not updating the memory on every write. Modified data is kept in the caches, and memory reads are monitored to detect access requests of data that is in one of the caches. This complicates the cache logic due to additional state maintenance. Many variations of the state logic are possible, but a specific technique has become very popular. This technique is identified by the initials of its states: Modified-Exclusive-Shared-Invalid or MESI. Each cache maintains its own state, each of which should be consistent with each other.

On copy-back caches, all operations are normally on a cache line basis, which is bursted between the cache and memory. Common line sizes are four and eight 64-bit words. The system design is optimized for these burst operations. Before any portion of a cache line is written, the whole line is read into the cache. These reads with intent to modify are identified as different from regular reads. The following description is a simplified explanation of the MESI protocol, illustrated in Figure 2. A cache line starts as invalid. A read makes it exclusive. A read from the same address by a different processor makes it shared. A read with intent to modify makes it modified and it causes the other processor caches that may have contained this line to invalidate them. A read of a modified line by another processor causes the data to be written into memory before it is read by the reading processor. The logic becomes complicated due to exceptional sequences and the fact that often write-through and no-cache protocols are intermixed with the copy-back protocol.

The advantage of this protocol is the reduction in write data traffic. Modified cached data is not written into memory until it is needed by another processor. The penalty is that all caches that may share data must monitor all transactions dealing with the potentially shared memory locations.

If all processors are sharing a common bus, this monitoring can be done efficiently. But if there are multiple buses in the system, address propagation can become complicated and consume time. Some systems use duplicate copies of cache tags or directory schemes to reduce these “snooping” problems. In a periodic or well-understood embedded system, the snooping overhead can be minimized by not snooping data known to be private.

Cache directories
Directory schemes keep track of the usage of shared memory regions in order to limit the coherence maintenance to the processors that are actually using the shared data. The description below is for a directory used in conjunction with the copy-back protocol described above. In a directory, for every shareable and cacheable memory address line, there is a “presence” bit for each coherent processor cache in the system. Whenever a line is requested for a cache, the presence bit for the line corresponding to this cache is set. The existence of a “dirty” bit for each line will be explained later. Multiple presence bits for a line will be set as the different processors read this data. The dirty bit remains false while all of the requests for this line have been reads. If any of the cached processors intends to write into a cached memory line, it must indicate as much to the directory with a read with intent to modify command. This command will cause the presence bits to be used to invalidate the cache lines of all of the caches that previously contained this line. All of these presence bits for the line are then cleared and the presence bit is set corresponding to the processor that intends to write the line, along with the dirty bit for the line. Note that the processor did not need to write the line out from the cache. If any other processor requests to access this line, the cache that contains the modified data has to write it into the memory first, which resets the dirty bit and restarts the whole sequence. Thus processor caches do not need to “snoop” each other because the directory does it for them.

Clearly there is overhead cost associated with the directory storage. One problem with this type of cache directory is that the largest number of total caches in the system needs to be fixed, because a bit is allocated for each memory line. This worst case storage cost is incurred even if there is a single processor in the system, as long as it is expected to be expandable. Thus this scheme is most practical for systems with a known moderate number of nodes. Some systems address this problem by configuring the directory as a linked list in which each directory node points to the next node that may contain relevant information. However, the price paid is in indeterminism, because the time that is required to follow the linked list can vary considerably.

Software cache coherence
Cache coherence in a multiprocessor can also be implemented with software procedures. While this is impractical in a general purpose system, it may be realistic in a well-understood embedded system. A simple scheme that is adequate for some systems is not to cache shared data. If the access to this data is infrequent, this would be a reasonable trade-off. Another simple software managed scheme is to allow data that is periodically updated to be incoherent for periods of time when it is not being used and invalidated before a task is initiated. Flush sequences may be included as part of application programs, although some operating systems may not facilitate this. These techniques require shared data to be segregated, but in NUMA architectures segregation is often done to optimize system performance anyway. The use of software cache coherence may allow the use of simpler processors that do not support hardware cache coherence.

I/O cache coherence
The MESI protocol is designed for multiple processors, but it is also used for a single processor and direct-memory-access I/O. The I/O master acts as if it were a processor without a cache. If it needs data from memory that is modified in a cache, the data must be written into memory before it is read. Caches that may contain data that is being written by an I/O master must monitor these writes and invalidate accordingly. I/O operations are often structured in cache line increments to take advantage of the efficient bursting mechanisms. As with overall coherence, software I/O cache coherence is also a possibility. In embedded systems with cyclical or well understood operations, it may be more efficient to force data from caches into memory ahead of time so as not to slow down time-critical I/O. Likewise, an I/O buffer page being written into may be invalidated more efficiently as a block before the I/O transfer.

Embedded applications
NUMA technology, now becoming widespread in general purpose computers, is directly applicable to embedded multiprocessors. These architectures introduce new issues on cache coherence. For the most complex systems requiring flexibility, techniques such as cache directories may be worthwhile. A directory scheme automatically implements the type of coherence partitioning that an embedded system designer could do explicitly based on his knowledge of the application. For simpler or more predictable systems, minimizing the cache coherence overhead can be accomplished through partitioning or the use of a combination of hardware and software cache coherence techniques.

Alfredo Romagosa is chief scientist at Concurrent Computer Corp., Fort Lauderdale, FL.

References:
Bolosky, William J., Michael L. Scott, and Robert P. Fitzgerald, “Simple but Effective Techniques for NUMA Memory Management,” Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, 1989, pp. 19–31.

Hausman, F.T., “Combating the Number-One Cause of U.S. Airliner Tragedies,” For Your Information, Summer, 1990.

Heuser, Mark, PowerMAX OS NUMA Concept Specification, Concurrent Computer Corporation, 1994.

Lenoski, Daniel, James Laudon, Kourosh Gharachorloo, Wolf-Dietrich Weber, Anoop Gupta, John Hennessy, Mark Horowitz, and Monica S. Lam, “The Stanford Dash Multiprocessor,” Computer, March 1992, pp. 63–79.

PowerPC 604 User's Manual, IBM/Motorola, 1994.

Smith, Alan Jay, “Cache Memories,” Computing Surveys, September 1982, pp. 473–530.

Stenstrom, Per, “A Survey of Cache Coherence Schemes for Multiprocessors,” Computer, June 1990, pp. 12–24.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.