Tips for effective usage of the shared cache in multi-core architectures - Embedded.com

Tips for effective usage of the shared cache in multi-core architectures

While traditional single core systems employ a dedicated cache, theintroduction of multi-core platforms presents the opportunity toconsider the shared use of cache by multiple processors. Designs whichincorporate a larger shared cache bring new benefits to applicationsand users. One example of the shared cache multi-core processors is Intel Core Duo processor inwhich a 2MB L2 cache is shared between two cores (Ref [1] ).

One obvious benefit of the shared cache is to reduce cacheunderutilization since, when one core is idle, the other core can haveaccess to the whole shared resource. Shared cache also offers fasterdata transfer between cores than does system memory. The newarchitecture simplifies the cache coherence logic and reduces thesevere penalty caused by false sharing. It is well suited forfacilitating multi-core application partitioning and pipelining (Ref [2] ). Overall, the shared cachearchitecture brings more benefits and provides a betterperformance/cost ratio than does dedicated cache.

This article addresses specific software programming techniqueswhich, by enabling effective use of shared cache, help users design andfine-tune applications for better performance.

Benefits of the shared cachearchitecture
This article will reference a dual-core/dual-processor system withshared L2 cache as shown in Figure 1. However, since the softwaretechniques discussed are generic, the benefits can be applied to othershared cache systems. We will not discuss hardware specifics of IntelCore Duo processors such as power-saving and pre-fetching logic, pleaserefer to Ref [1] for thosedetails.

Figure1.A dual-core dual-processor system

The benefits of a shared cache system (Figure 1, below ) are many:

* Reduce cacheunder-utilization
* Reduce cache coherencycomplexity
* Reduce false sharing penalty
* Reduce data storageredundancy at the L2 cache level: the same data only needs to be storedonce in L2 cache
* Reduce front-side bustraffic: effective data sharing between cores allows data requests tobe resolved at the shared cache level instead of going all the way tothe system memory
* Provide new opportunities andflexibility to designers:
    o Fasterdata sharing option between the cores than using system memory
    o One corecan pre/post-process data for the other core (application partitioning,pipelining)
    o Alternativecommunication mechanisms between cores by using shared cache

The usage models for migrating single-core applications tomulti-core can be grouped into two categories. One usage model (A ) is to replace multiplesingle-core PCs with a single multi-core system, in which case userswill likely deploy each core just like an individual PC.

The other model (B ) is tocombine the power of multiple cores in order to get a performance boostfor a single application, in which case each core does part of the taskin order to achieve the overall performance gain.

To compare the private cache and shared cache systems, we furtherdivide the usage models into four scenarios, as shown in Table 1, below . We conclude thatthe overall edge belongs to the shared cache architecture.

Table1. Comparison of private cache and shared cache.

As show in Table 1, above ,the shared cache performance is the same as or better than dedicatedcache in these scenarios.

Five performance-tuning tips forusing use shared cache effectively
The caches are largely transparent to the users since they do notreally have explicit control over the operations. However, betterprogramming techniques will make a difference and help users achieve animproved cache hit ratio and increased performance.

Tip#1. Useprocessor affinity to place applications/threads properly. Thecapability to place threads and applications to a specific core isessential in multi-core environment. Without this control the systemmay suffer performance degradation caused by unnecessary datacontention. Processor affinity functions are provided by variousoperating systems to allow fine control on thread and applicationplacement.

Windows*example:

   /* Set processor affinity */
    BOOL WINAPI SetProcessAffinityMask(Handle hProcess,DWORD_PTR     dwProcessAffinityMask);

   /* Set thread affinity */
    DWORD_PTR WINAPI SetThreadAffinityMask( HandlehThread,         DWORD_PTRdwThreadAffinityMask);

Linux*example:

   /* Get the CPU affinity for a task */
     extern int sched_getaffinity (
      pid_t pid, size_t cpusetsize, cpu_set_t*cpuset);

   /* Set the CPU affinity for a task */
    extern int sched_setaffinity (
     pid_t pid, size_t cpusetsize,
constcpu_set_t *cpuset);

Tip #2. Usecache blocking to improve cache hit ratio. The cache blockingtechnique allows data to stay in the cache while being processed bydata loops. By reducing the unnecessary cache traffic (fetching andevicting the same data throughout the loops), a better cache hit rateis achieved.

For example, a large data set needs to be processed in a loop manytimes. If the entire data set does not fit into the cache, the firstelements in the cache will be evicted to fit the last elements in.Then, on subsequent iterations through the loop, loading new data willcause the older data to be evicted, possibly creating a domino effectwhere the entire data set needs to be loaded for each pass through theloop. By sub-dividing the large data set into smaller blocks andrunning all of the operations on each block before moving on to thenext block, there is a likelihood that the entire block will remain incache through all the operations.

Cache blocking sometimes requires software designers to “thinkoutside the box” in order to choose the flow or logic that, while notthe most obvious or natural implementation, offers the optimum cacheutilization (Ref[3] ).

Tip #3. Usecache-friendly multi-core application partitioning and pipelining. During the migration process from single core to multi-core, users maywant to re-partition the single-core application to multiplesub-modules and place them on different cores, in order to achieve ahigher degree of parallelism, hide latency and get better performance.

A recent study (Ref [2] ) on SNORT (an open source Intrusion Detection software) reveals thatsupra-linear performance increase can be achieved by properpartitioning and pipelining of the processing. The shared cache is wellsuited to facilitate such a migration process as it provides a fastermechanism for modules running on different cores to share data andenables the cores to work efficiently together.

Understanding the shared cache architecture and application-specificcache usage scenarios can help designers to partition and pipeline theapplication in a cache-friendly way. For example, it may be worthseparating the contention-only threads (threads that are not sharing any data butneed to use cache ) from each other, and assigning them to coresthat are not sharing the cache (forexample, core 0 and core 2 in Figure 1 ).

On the other hand, it is also important to consider assigningdata-sharing threads to cores that are sharing the cache, or evenassigning them to the same core if they need to interact with eachother very frequently, to take advantage of faster data sharing in thecaches.

Tip #4.Minimize updating shared variables. During performance tuning,it is advisable to consider the frequency of updating the sharedvariables, as a disciplined approach to updating shared variables mayimprove the performance. One way to do that is by letting each threadmaintain its own private copy of the shared data and updating theshared copy only when it is absolutely necessary.

Let us look at an openMP* example first. OpenMP isa collection of compiler directives and runtime libraries that help thecompiler to handle threading (Ref[4] ).In the following example, a variable shared_var needs to be updated during the loop by all threads. To ensure there isno data race, a “critical” directive is used to ensure that at anygiven time only one thread is updating the content of shared_va r.

A thread will wait before entering the critical section until thereis no other thread currently working on the section. Note this schemerequires each thread to update the shared variable every single time.In addition, synchronization has to be used to protect the shared data.

   /* The following loop is assigned to N threads with each        thread doing some of theiterations. Each thread needs to        update shared_var during iteration. Synchronization is        maintained by using “critical”section.*/

   #pragma omp parallel for private(i)
    for (i=0; i< 100; i++){

       /* Do other tasks */
        x = do_some_work();

       /* Use critical section to protect shared_var */
       #pramga omp critical
       shared_var = shared_var + x;
}

In openMP, there is a reduction clause which lets each thread workon its own copy and then consolidates the results after they all havecompleted their tasks.

The following is the updated code using the reduction clause:

   /* With the use of reduction clause, now each threadworks    on its private copy of shared_var and onlyconsolidate results in the end */

   #pragma omp parallel for private(i) shared(x)
        reduction(+: shared_var)

   for (i=0; i<100; i++) {
        x = do_some_work();

       shared_var = shared_var + x;
}

As we see from the above example, not only do we reduce thefrequency of updating shared_var, but also we get a performance increase by avoiding synchronization.

Tip #5.Takeadvantage of L1 cache eviction process by adding delay between writeand read. The delayed approach takes advantage of L1 cachecapacity eviction by inserting a delay between write and read. Assumetwo things:

1) We have a writer thread oncore 0 and a reader thread on core 1 as showin in Figure 1; and
2) We insert a delay periodafter the writer thread modifies the data and before the reader threadrequests the data.

Because of the L1 cache capacity limit, the data may be evicted tothe shared L2 cache as a result of this delay. Since the data isalready there, the reader thread will get an L2 cache read hit. As acomparison, in the case without the delay, the reader thread will getan L2 cache read miss because the data is still in the private L1 cachein core 0.

The delay in read access does not impact the system throughput, asit only introduces an offset to the read access. The ideal amount ofdelay is dependent on the L1 cache size and usage situation, as well asthe amount of data being processed.

In reality, since there may be multiple contenders for the caches,it is often difficult to determine with accuracy the amount of delayneeded for the shared data to be evicted into the shared L2 cache. Thistechnique is only recommended for fine-tuning the applications andsqueezing out some possible cycles.

The following is an example showing an implementation of the delay.When N=1, it is the original flow without any delay. The writer threadgenerates a buffer and the reader thread consumes it immediately. WhenN>1, a delay is applied to the first read access.

By introducing this offset between write and read access, cache hitratio may be improved as some of the data may already be evicted intoshared L2 cache. As a result, the reader thread will get more L2 cacheread hits than would be the case without the delay.

   Writer_thread_on_core_0()
    {
            /* N = 1:normal flow;
            N > 1: flowwith delay */

           first produce N buffers;
            while(workAmount-N)
            {
               signal reader thread to process;
               write one buffer;
               wait for signal from reader thread;
            }
    }

   Reader_thread_on_core_1()
   {
           while (workAmount-N)
           {
               wait forsignal from writer thread;
               read onebuffer;
               signal writerthread;
           }
           read N buffer;
   }

Four rules on  what to avoid
Rule #1: Falsesharing. This is a well-known problem in multi-processorenvironments. It happens when threads on different processors write tothe same cache line, but at different locations. There is no realcoherency problem, but a performance penalty is paid as a cache linestatus change is detected by the coherency logic.

The shared cache reduces false sharing penalties, both because thereis no false sharing at the L2 level and because coherency is maintainedover fewer caches. However, false sharing is inherited from the privateL1 cache. Moreover, in the multiprocessor systems, there may be falsesharing between cores that are not sharing the cache. For example, inFigure 1, threads running on core 0 possibly will run into falsesharing issues with threads running on core 2.

One way to fix false sharing issues is to allocate the data todifferent cache lines. Another way is to move the troubled threads tothe same core. More information on false sharing can be found inseveral articles (Ref [5] and [6] ).

Rule #2.Improper thread placement. Improper thread placement may causeunnecessary cache contention. For example, if two applications are notsharing data but both use a significant amount of cache, one shouldconsider assigning them to cores that are not sharing cache, such ascore 0 and core 1 in Figure 1 .

The utility of Intel VTune analyzer such assampling wizard are helpful in collecting information on cache usage. Ahigh concentration of L2 events (e.g. L2 Request Miss and L2 Read Miss)usually indicates areas for further investigation.

It is important to assign the applications in contention to coresthat are not sharing the cache, while assigning the data-sharingapplications to cache-sharing cores or even to the same core to takeadvantages of L2/L1 data caches. In some cases, performance/cachetrade-off decisions need to be made to ensure that the critical modulesget proper treatment in terms of cache usage.

Rule #3. Avoidexcessive use of synchronization mechanisms. As we partition andpipeline the single-core application to run better on the multi-coresystems, synchronization between threads on different cores is oftenneeded to avoid data race.

The semaphore or mutex APIs provided by each operating system areconvenient to use. However, excessive usage of these APIs may be costlyin terms of performance.

One way to improve performance is to consider less expensive optionssuch as sleep, mwait or monitor. The other way is to keepsynchronization operations to a minimum, being careful not to becomeoverly cautious about threading issues and end up protecting allpublicly shared variables with synchronization. Users need to usesynchronization only if necessary as excessive use of synchronizationdoes have performance implications.

Rule #4.Knowing to turn off hardware pre-fetch. Hardware pre-fetch isuseful for applications that have a certain degree of data spatiallocality. With pre-fetching, the data is in the cache even before therequest happens. When used properly, the pre-fetch can hide memoryaccess latency and improve performance.

However, for applications in which data patterns lack spatiallocality, the hardware pre-fetch may have a negative impact onperformance. Therefore, it is important for users to understand thedata pattern of their specific applications and to consider turning offthe pre-fetch option in certain situations.

Conclusion
Because there is no explicit control over the usage, caches are largelytransparent to users. Nevertheless, an understanding of techniquesavailable for optimizing shared cache (e.g. placing threads andapplications properly on different cores based on data sharing orcontention relationships) can help software and system designers takefuller advantage of the new multi-core systems.

Shared-cache architecture multi-core processors, such as theIntel Core Duo processor, take ahuge step toward bringing the benefits of power-saving, dynamic cacheutilization and flexibility to system designers and end-users. Theshared cache architecture presents exciting design opportunities andmore flexibility for efficient work-partitioning and fast data-sharingamong multiple cores.

Tian Tian is a Software TechnicalMarketing Engineer  at Intel Corp.supporting multi-core software development as well as the  IXP2XXXproduct line. His previous experience includes firmware development forIntel IXP4XX product line NPE (Network Processor Engine). He was alsothe main designer for Intel IXP4XX product line Operating SystemAbstraction Layer (OSAL) in software release v1.5. Currently he isengaged with developing methods to optimize applications for multi-coreprocessors.

References
1. Intel Smart Cache, http://www.intel.com/products/processor/coreduo/smartcache.htm
2. “Supra-linear Packet Processing Performance with Intel Multi-coreProcessors” WhitePaper  Intel Corporation, 2006
3. Phil Kerly, CacheBlocking Techniques on Hyper-Threading Technology Enabled Processors.
4. openMP* website, http://www.openmp.org
5. Andrew Binstock, DataPlacement in Threaded Programs
6. Dean Chandler, ReduceFalse Sharing in .NET*,
7. “IA-32 Intel Architecture Optimization Reference Manual”, IntelCorporation, 20058. “Developing Multithreaded Applications: A Platform ConsistentApproach”, Intel Corporation, March 2003
9. Don Anderson and Tom Shanley, “Intel Pentium Processor SystemArchitecture, 2nd Edition”, Addison Wesley, 1995
10. Shameem Akhter and Jason Roberts, Multi-Core Programming:Increasing Performance through Software Multi-threading

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.