Using a scheduled cache model to reduce memory latencies in multicore DSP designs -

Using a scheduled cache model to reduce memory latencies in multicore DSP designs

The most advanced high-end DSP cores in the market today are fully cache-based by concept while maintaining low latency when accessing higher memory hierarchies (L2/L3). Performance of cache-based DSP systems is highly affected by the cache hit ratio and by the miss penalty.

Hit ratio – the number of accesses that are “hit” in the cache divided by the total number of accesses (“hit” count + “miss” count) – depends on the application locality in time and place. Miss penalty – the number of cycles that the core waits for a “miss” to be served – depends on the physical location of data in the memory system at the time of a cache miss.

Traditional systems rely on the Direct Memory Access (DMA) model in which the DMA controller is used to move data to a memory closer to the core. This method is complicated and requires precise, restrictive scheduling to achieve coherency.

As an alternative, this article describes a new software model – and hardware mechanisms that support it – used in the Freescale SC3850 StarCore DSP subsystem residing in the MSC8156 multi-core DSP. Called the scheduled cache model, , it reduces the need for DMA programming and synchronization to achieve high core utilization.

The scheduled cache model relies on hardware mechanisms (some of which are controlled by software) to increase cache efficiency. Using these mechanisms can yield DMA-like performance while maintaining a relatively simple, flexible software model that reduces overall system development time (TTM).

Reviewing the Cache Memory Basics
A cache-based system uses small, fast memories to store short term copies of data stored in the slower main memory. A typical memory hierarchy of a modern DSP is shown in Figure 1 below .

Each DSP core has its private L1 caches, normally separated to Instruction and Data. Some systems also contain an M1 memory. The larger, second level of memory can be cache (L2) or a regular SRAM memory (M2). Additional memories are larger, slower and may reside on chip like shared on chip M3 or off chip like off chip DDR.

Figure 1: Typical Modern DSP Memory Hierarchy

The core accesses the cached copies by their original memory locations (rr by their virtual address, in case translation is used), which means that the core can conveniently view a static memory map. If the content from a memory location that the core needs is not in the cache, the cache automatically fetches it from the main memory.

In comparison, a DMA-based system uses a near memory with its own fixed addresses (memory locations). The programmer must configure the DMA to move data from a far memory to a near memory. Therefore, the software must know when the data is allocated and it must be aware that the same address in the near memory has different contents.

Hit/Miss and Types of Misses . This article uses the following definitions:

* Hit : The core accesses a memory location for which the content is already in the cache. In this case, the access is serviced directly by the cache without any penalties.

* Miss : The core accesses a memory location for which the content is not in the cache. The result of a cache miss is an automatic fetch from higher memory (next level of cache or main memory). The core waits for the fetch to be processed. This waiting is called Miss penalty.

To minimize core wait state, both the number of misses and the average miss penalty should be minimized.

There are three main types of cache misses:

* Compulsory miss : The first core access to a memory location that was never in the cache (we'll explain how even these “compulsory” misses can be avoided).

* Capacity miss : A memory location was in the cache at some point in time, but was thrashed by another memory location (before the current core access), due to the finite size of the cache.

* Conflict miss : A memory location was in the cache at some point in time, but was thrashed (replaced by the content of another memory location) before the current core access, due to one of two reasons: 1) insufficient associativity, which is less frequent in architectures such as the SC3850 core subsystem, which has 8 way caches, and 2) replacement mechanism made imperfect choices. This can be avoided by cache partitioning, as explained later in this article.

Dealing with Locality
Cache memory topologies use the assumption that the core accesses the memory system locally. When defining locality, we distinguish between two types of localities:

Temporal locality , which assumes that if the core accesses a memory location, it will do so again in the near future. Examples include code implementing loops for which the same lines of code are read again and again by the core and when the core accesses stack memory repeatedly when calling and returning from functions.

Spatial locality , which assumes that if the core accesses a memory location, it will access nearby memory locations as well in the near future. Examples include fetching program code in a linear sequential manner and when the core accesses member i in an array, it increases the likelihood it will also access member i+1.

Figure 2 below shows an example of a well localized access pattern generated by a core. The horizontal axis represents time, the vertical axis corresponds to memory locations, and each point is a core access. The highlighted sections show examples of temporally and spatially local access patterns.

Figure 2: Program Access Pattern Example with Good Locality

Figure 3 below shows an example of a poorly localized access pattern. It can be seen that the software accesses a relatively large memory footprint without a specific order and with little temporal reuse, making this code cache unfriendly.(The actual memory size brought to the cache at runtime for the application)

Figure 3: Program Access Pattern Example with Bad Locality

Limitations of the DMA Software Model Concept
The true DMA software model in modern DSP systems normally includes, in addition to large, slow memories, an M2 memory and an L1 cache (sometimes also an M1 memory ).

Despite the L1 assistance, the out-of-the-box performance (running from a slow memory like DDR/M3 without DMA transfers to M2) is poor, potentially making it difficult to use even as a baseline.

To achieve reasonable performance, the programmer configures the DMA controller to bring data into M2 memory, then the core accesses M2 through the L1 cache, and finally the programmer configures the DMA controller to send the updated memory locations out of M2.

There are several drawbacks to using the DMA model, including:

* DMA programming overhead is very large. Most modern DMA engines allow very elaborate transfer programming. Moreover, a modern system does not allow programming of the DMA buffer-descriptors and registers directly. Instead, the programming is done using a device driver and takes hundreds of cycles.

* DMA programming is not flexible. The transfers are not very regular, and therefore, not everything can be managed with DMA.

* On-chip memory requirements are large in DMA model, because buffers are usually allocated statically. Large on-chip memory significantly increases the device size, power consumption, and cost. One can only avoid the need for large on-chip memory in the DMA model at a cost of high system complexity, which requires intensive design effort and longer TTM (Time To Market).

The programmer must make sure that the DMA transactions are synchronized with the accesses from the core direction. For example, the software must verify that the core finishes writing to the L1 cache, then flush the L1 cache to M2, make sure that the flush is completed, and only then, send the data out of M2 using DMA.

This need for software-managed synchronization complicates the first steps of the software development. A basic application flow with DMA transfers into M2 is shown in Figure 4 below . The squares represent data of tasks A/B/C.

Figure 4: Basic DMA Software Model Flow

A pipelined, fully optimized DMA software model based on double buffering is highly efficient because the software can work on the current task while the DMA controller is bringing data for the next task and sending data from the previous task. But, it is very complicated to manage and fits well only for regularly scheduled tasks.

Moreover, it is very complicated for the core to use code that is transferred using DMA, because of alignment requirements and the fact that code branches often contain absolute addresses (rather than relative addresses).

Inaccuracies in the DMA management for both data and code might harm not only the application performance, but also its functionality. An optimized flow is shown in Figure 5 below .

Figure 5: Optimized DMA Software Model Flow

The SC3850 Memory Subsystem Architecture
The SC3850 core efficiently deploys a variable-length execution set (VLES) execution model that utilizes maximum parallelism by allowing multiple address generation and data arithmetic logic units to execute multiple instructions in a single clock cycle, while delivering very compact code, especially in DSP and control applications.

The SC3850 Core Subsystem contains an L1 data Cache, an L1 Instruction Cache, and a unified L2 Cache. The caches are 8-way set associative. The cache policies can be configured individually per memory segment.

This flexibility is achieved thanks to the Memory Management Unit (MMU), which also provides memory protection. To improve cache system performance, an automatic prefetch is initiated after each cache miss.

The L2 cache can be partially or fully used as M2 memory, allowing a DMA software model (M2 is used for providing best quality of service to critical tasks) combined with the cache software model. This L2/M2 partitioning can be changed dynamically in real time.

A block diagram for the SC3850 DSP memory subsystem is shown in Figure 6 below . The SC3850 Core Subsystem supports software-initiated background allocation (also known as touch loading) into its caches, which can increase the hit ratio of the caches and improve the performance. The use of this and other SC3850 mechanisms is described in the following sections.

Figure 6: SC3850 Memory Subsystem Diagram

The SC3850 Scheduled Cache Software Model
Before describing the performance optimizations, one should understand the cache-based software model. The SC3850 StarCore DSP Subsystem has two levels of caches:

* The L1 data cache and L1 program cache serve the core accesses directly with Zero Wait States. Because they are very close to the core, they must be relatively small. As such, they are used for short term data storage for the referenced memory locations.

* The L2 unified cache is larger and, therefore, able to store data for longer periods. This cache has Wait States. It is accessed upon a miss in L1.A naive view of the flow in the cache software model assumes accesses from the far memory through L1 cache and L2 cache with no pre-load. This means the first set of accesses take longer, due to the fetches from far memory.

However, subsequent accesses to the same memory locations are likely to remain cached in L1 and/or L2 (depending on the access pattern and rate) and are served faster. This scenario is shown in Figure 7 below .

Moreover, the caches of the SC3850 have hardware mechanisms designed to improve the performance of applications with good locality automatically. This provides good performance with minimal effort – a convenient starting point for software development.

Figure 7: Nave Cache Software Model Flow

The SC3850 subsystem has several ways for additional optimization of the software model. As seen later, immplementing these optimizations can yield DMA-like performance with lower effort and fewer risks.DMA Without the Hassle
The scheduled cache software model can achieve the performance of the DMA software model by adding software control of data transfers in the system, while maintaining the caches robustness. The following two sections demonstrate how this is done for both hierarchies:

2) M2 Equivalent in the Cache Software Model. The relatively large L2 cache in the cache software model, or the M2 memory in the DMA software model, is used for medium- to long-term storage of data from far memory locations.

To achieve DMA-like performance in the scheduled cache model, the software can preload memory locations into the L2 cache, using the L2 software prefetch mechanism. This mechanism is triggered with a single core instruction and executed in the background. It can be programmed to load one dimensional and two dimensional arrays into L2.

A higher degree of control, making the scheduled cache software model even more similar to the DMA model, controls which memory locations are allocated to which parts of the cache. Applying this method (called “cache partitioning”) can eliminate unwanted thrashing of memory locations from the cache, thus avoiding many of the conflict misses. (Not applicable for conflict misses caused by limited associativity ).

This can be applied to critical memory locations accessed repeatedly during the application, but with a possible low inter-access rate. Managing the cache in such a way, coupled with the previously discussed prefetches, is similar to performing DMA transfers into M2 memory, but it is more flexible.

Specifically, a DMA transfer of 256 KB into M2 requires that at least this size be allocated in M2. In the cache partitioning method, 256 KB of memory locations can be directed to 128 KB of the cache. This means that these 256 KB compete for cache space among them, but not with other memory locations.

In addition, this method is less error prone: Incorrect partitioning wastes cache space, thus impacting the performance, but not the functionality. The L2 prefetch and partitioning method is more general than DMA, in the sense that it can be applied not only for data, but also for code (impractical in the DMA model).

Outbound coherency between L2 cache and the main memory (resembles the DMA outbound from M2) can be done using a cache sweep mechanism that allows controllable write back, synchronization, or invalidation of the cache contents.

The main advantage of the scheduled cache software model is that it is much easier to maintain functionality through the optimization process, because the basic synchronization (among a specific core L1 and L2 caches and the main memory) is automatic.

A DMA transaction might cause a performance penalty, while the core waits for the transaction completion, or even a functional error, because data is accessed by the core using different memory locations than the original ones.

The L2 software prefetch simply attempts to load the required data into the L2 cache before it is needed, but no waiting is necessary, because if the prefetch does not complete on time, the result is a miss in the L2 cache, which leads to a regular fetch. Obviously, no functional error can occur due to the prefetch because the core accesses the cache using the addresses of the original location in the memory.

Figure 8: Cache Software Model Flow with L2 Software Prefetch and Flush

M1 Equivalent in the Cache Software Model . To complete the operation and bring data even closer to the core, the programmer or the compiler can use core instructions (d/pfetch) for fine granularity fetches into L1 caches. This operation is meant to follow a higher granularity L2 software prefetch, but also works as a standalone operation—it can automatically fetch into L2 cache, in case of an L2 miss.

The equivalent of prefetching data to a Zero Wait State memory in a classic DMA software model is loading into an M1 memory. Speeding up writes to reach M1-like performance is done by letting the cache know that a specific block of data is to be written (using the dmalloc instruction).

The cache prepares itself for the write by allocating space for this block without fetching the stale data from memory. This method assumes the writes cover the entire block and are not dependant on the previous contents of this block. Using this method provides a double gain: it removes latency on writes and it reduces traffic due to write allocation on external buses.

Table 1.DMA versus cache model comparison

In summary, as show in Table 1 above , the scheduled cache model approach used in the SC3850 has good out-of-the-box performance, which provides a convenient starting point for software development.

Optimizing the software, that is, transitioning to the scheduled cache model, involves fewer risks than traditional DMA methods and can be done incrementally, potentially reaching DMA-like performance.

Moreover, thanks to the risk free nature of the cache software model, some DMA-like transactions can also be implemented for non-ordered software task scheduling (something that is not practical in the DMA software model).

Ofer Lent is a DSP Applications Engineer in the DSP Systems and Applications team at Freescale. He holds a B.Sc. in Electrical Engineering from Ben-Gurion University (2007).

Moshe Anschel is the Project Manager of the DSP Platform Systems and Architecture for the Worldwide Design organization of the Networking and Media Group (NMG) within Freescale Semiconductor. He holds a B.Sc. in Electrical Engineering from Tel-Aviv University (1999) and an MBA from the Interdisciplinary Center Herzliya (2007).

Erez Steinberg is the Project Leader of the DSP Video Systems and Applications team for the DSP Platforms department. He holds a B.Sc. in Electrical Engineering from Ben-Gurion University (2002) and a MBA from Tel-Aviv University (2007).

Itay Peled is the Project Leader of the DSP modeling and Architecture team for the DSP Platforms department. He received a B.Sc. in Electrical Engineering in 2000 from Ben-Gurion University.

Amir Kleen is a DSP Applications Engineer in the DSP Systems and Applications team. He received a B.Sc. in Electrical Engineering in 2005 from Ben-Gurion University.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.