Using a scheduled cache model to reduce memory latencies in multicore DSP designsThe most advanced high-end DSP cores in the market today are fully cache-based by concept while maintaining low latency when accessing higher memory hierarchies (L2/L3). Performance of cache-based DSP systems is highly affected by the cache hit ratio and by the miss penalty.
Hit ratio - the number of accesses that are "hit" in the cache divided by the total number of accesses ("hit" count + "miss" count) - depends on the application locality in time and place. Miss penalty - the number of cycles that the core waits for a "miss" to be served - depends on the physical location of data in the memory system at the time of a cache miss.
Traditional systems rely on the Direct Memory Access (DMA) model in which the DMA controller is used to move data to a memory closer to the core. This method is complicated and requires precise, restrictive scheduling to achieve coherency.
As an alternative, this article describes a new software model - and hardware mechanisms that support it - used in the Freescale SC3850 StarCore DSP subsystem residing in the MSC8156 multi-core DSP. Called the scheduled cache model,, it reduces the need for DMA programming and synchronization to achieve high core utilization.
The scheduled cache model relies on hardware mechanisms (some of which are controlled by software) to increase cache efficiency. Using these mechanisms can yield DMA-like performance while maintaining a relatively simple, flexible software model that reduces overall system development time (TTM).
Reviewing the Cache Memory Basics
A cache-based system uses small, fast memories to store short term copies of data stored in the slower main memory. A typical memory hierarchy of a modern DSP is shown in Figure 1 below.
Each DSP core has its private L1 caches, normally separated to Instruction and Data. Some systems also contain an M1 memory. The larger, second level of memory can be cache (L2) or a regular SRAM memory (M2). Additional memories are larger, slower and may reside on chip like shared on chip M3 or off chip like off chip DDR.
|Figure 1: Typical Modern DSP Memory Hierarchy|
The core accesses the cached copies by their original memory locations (rr by their virtual address, in case translation is used), which means that the core can conveniently view a static memory map. If the content from a memory location that the core needs is not in the cache, the cache automatically fetches it from the main memory.
In comparison, a DMA-based system uses a near memory with its own fixed addresses (memory locations). The programmer must configure the DMA to move data from a far memory to a near memory. Therefore, the software must know when the data is allocated and it must be aware that the same address in the near memory has different contents.
Hit/Miss and Types of Misses. This article uses the following definitions:
* Hit : The core accesses a memory location for which the content is already in the cache. In this case, the access is serviced directly by the cache without any penalties.
* Miss: The core accesses a memory location for which the content is not in the cache. The result of a cache miss is an automatic fetch from higher memory (next level of cache or main memory). The core waits for the fetch to be processed. This waiting is called Miss penalty.
To minimize core wait state, both the number of misses and the average miss penalty should be minimized.
There are three main types of cache misses:
* Compulsory miss: The first core access to a memory location that was never in the cache (we'll explain how even these "compulsory" misses can be avoided).
* Capacity miss: A memory location was in the cache at some point in time, but was thrashed by another memory location (before the current core access), due to the finite size of the cache.
* Conflict miss: A memory location was in the cache at some point in time, but was thrashed (replaced by the content of another memory location) before the current core access, due to one of two reasons: 1) insufficient associativity, which is less frequent in architectures such as the SC3850 core subsystem, which has 8 way caches, and 2) replacement mechanism made imperfect choices. This can be avoided by cache partitioning, as explained later in this article.