Much has been published on cache organization and cache coherence in the context of general purpose computing. This article discusses these issues as they relate to real-time applications and embedded systems in particular.
As more embedded applications become complex enough to require multiprocessing, cache coherence technology needs to be evaluated and adapted to these applications. The context for this article is real-time applications, with emphasis on high-end tasks that require multiple processors. Rather than attempt to precisely define real-time computing, I'll describe the key relevant application issues and discuss several broad classes of real-time applications. Periodic real-time computing is generally controlled by a fixed time base that often takes the form of a main time frame and smaller divisions within the frame. The processors are made aware of this base through repetitive interrupts that are either derived from a very accurate internal timer or supplied from an external source that may be common to other equipment in the system. Typically hundreds of tasks are executed within the frame. Some tasks are executed every frame, and some are executed only when a condition occurs; but the conditional tasks have a pre-determined slot in which to execute. The dominant concern in this model is deterministic execution times. Task overruns can seriously disrupt the complete process.
Continuous real time deals with monitoring and controlling physical variables as they occur in time. In the early days of computing, these applications were handled with analog computers. In some cases the analog computers were directly replaced with digital computers, but the system is organized as if analog computers were still being used. Most continuous systems are implemented as periodic systems with very fast frames.
Event-driven real-time applications are characterized by essentially unpredictable combinations of events requiring handling by a computer system. A good example of this is the system used by Raytheon to detect wind shear bursts at airports. With the use of radar technology, atmospheric conditions are continuously monitored to detect the deadly rapid shifts of winds known as microbursts. The detection takes extensive computational analysis structured as hundreds of tasks that are activated as required by the varying conditions. The key parameter is the speed of response to events. Because there will be conflicts among events, an efficient priority scheme with many levels is required to assure that the most critical events are analyzed first. Due to the large number of tasks and the unpredictable sequences, systems like this one often need to be organized as general purpose computers„plus, they can take advantage of some of the general purpose computing technology developments.
Clearly many actual systems are combinations of the above cases. Highly periodic systems are more suitable for specialized system designs. Systems requiring more flexibility may need to use more general purpose architectures and techniques, but generally at a higher cost.
In practice, symmetric multiprocessing has often been identified with systems in which all of the memory of the system is fully shared. This situation offers the most flexibility. However, these systems tend to suffer from scalability problems unless the bandwidth of the interconnection scheme can sustain the memory traffic required by the largest configurable number of processors. Although impressive improvements continue to be achieved in bus bandwidth technology, improvements in the speed of silicon within an integrated circuit (such as a processor) will always outpace the speed achievable through inter-component connections. Solutions to this problem generally limit flexibility in some way. In embedded systems, the flexibility can be tailored to the task, rather than follow a general-purpose computing model completely.
A logical extension of the local-remote architecture is to add one or more global memory modules. Shared data or code can be placed in global memory and private or slightly shared data or code can be placed in local memories.
Although these concepts have been formulated in the context of general purpose computers, they can be usefully applied to embedded systems. As a matter of fact, embedded systems are more suitable to NUMA concepts than general purpose systems, because knowledge of the application can be used to partition memories to optimize the system performance. Memory subsystems can be added in proximity to the processors that will be using the data most often. If several processors use mostly shared data, their memories can be structured to reflect this. Other processors that make occasional use of this data can be more remote but still have access to the data.
NUMA memory management
The local memories in NUMA systems can be organized into local memory pools. These pools are located in physical proximity to one or more processors. A local memory pool is a remote memory pool from the viewpoint of other processors that are located at a further distance from the local memory pool. Systems with global memories also have global memory pools. A global memory pool is equidistant from all processors in the system.
Kernel code may be loaded into global memory pools or replicated into the local memory pools. Kernel data structures must be loaded into global memory or into one of the local pools, becoming a remote memory pool to some processors. Likewise, shared application read-only pages can be loaded into local/remote pools, global pools, or replicated. Application writable pages that are shared must be in global memory pools or in local/remote pools. Private data can simply be loaded into the appropriate local pool.
NUMA operating systems have system default allocation policies or the application engineers can declare their own default policies. Different policies can be declared for each of the memory regions of a task or process. The more dynamic applications can take advantage of automatic operating system assignment/migration capabilities.
In systems using multiple processors with caches, issues dealing with the cache coherence of shared data need to be addressed. The subject of cache coherence algorithms has been extensively researched. Modern processors designed for multiprocessing provide cache coherence support in hardware. But this hardware support still requires some timing overhead in maintaining the cache coherence. Hardware coherence algorithms require examination of stored cache tags or directories. These examinations can interfere with processor access to caches and buses. For this reason, it is desirable to identify non-shared data so as to eliminate its coherence maintenance impact. The concept of memory pools is still applicable, but the level of cache coherence for each pool needs to be identified. Global memory pools are often declared to be coherent for all processors that have access to it. Remote pools may or may not be declared to be coherent, providing design trade-off opportunities. Clearly, knowledge of an embedded system application can be used to optimize these coherency decisions in addition to the actual connection structure. Modern processors allow cache coherency assignments to be made on a page basis, although the system design needs to support the necessary connectivity and address propagation. The key cache coherence schemes are discussed below.
On copy-back caches, all operations are normally on a cache line basis, which is bursted between the cache and memory. Common line sizes are four and eight 64-bit words. The system design is optimized for these burst operations. Before any portion of a cache line is written, the whole line is read into the cache. These reads with intent to modify are identified as different from regular reads. The following description is a simplified explanation of the MESI protocol, illustrated in Figure 2. A cache line starts as invalid. A read makes it exclusive. A read from the same address by a different processor makes it shared. A read with intent to modify makes it modified and it causes the other processor caches that may have contained this line to invalidate them. A read of a modified line by another processor causes the data to be written into memory before it is read by the reading processor. The logic becomes complicated due to exceptional sequences and the fact that often write-through and no-cache protocols are intermixed with the copy-back protocol.
The advantage of this protocol is the reduction in write data traffic. Modified cached data is not written into memory until it is needed by another processor. The penalty is that all caches that may share data must monitor all transactions dealing with the potentially shared memory locations.
If all processors are sharing a common bus, this monitoring can be done efficiently. But if there are multiple buses in the system, address propagation can become complicated and consume time. Some systems use duplicate copies of cache tags or directory schemes to reduce these “snooping” problems. In a periodic or well-understood embedded system, the snooping overhead can be minimized by not snooping data known to be private.
Clearly there is overhead cost associated with the directory storage. One problem with this type of cache directory is that the largest number of total caches in the system needs to be fixed, because a bit is allocated for each memory line. This worst case storage cost is incurred even if there is a single processor in the system, as long as it is expected to be expandable. Thus this scheme is most practical for systems with a known moderate number of nodes. Some systems address this problem by configuring the directory as a linked list in which each directory node points to the next node that may contain relevant information. However, the price paid is in indeterminism, because the time that is required to follow the linked list can vary considerably.
Software cache coherence
I/O cache coherence
Alfredo Romagosa is chief scientist at Concurrent Computer Corp., Fort Lauderdale, FL.
Hausman, F.T., “Combating the Number-One Cause of U.S. Airliner Tragedies,” For Your Information, Summer, 1990.
Heuser, Mark, PowerMAX OS NUMA Concept Specification, Concurrent Computer Corporation, 1994.
Lenoski, Daniel, James Laudon, Kourosh Gharachorloo, Wolf-Dietrich Weber, Anoop Gupta, John Hennessy, Mark Horowitz, and Monica S. Lam, “The Stanford Dash Multiprocessor,” Computer, March 1992, pp. 63–79.
PowerPC 604 User's Manual, IBM/Motorola, 1994.
Smith, Alan Jay, “Cache Memories,” Computing Surveys, September 1982, pp. 473–530.
Stenstrom, Per, “A Survey of Cache Coherence Schemes for Multiprocessors,” Computer, June 1990, pp. 12–24.