CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

A real-time HPC approach for optimizing Intel multi-core architectures (Part 2 of 3)



Industrial Control Designline

So, what can one do if real-time HPC application data does not fit into cache? The answer depends on the amounts of data used for the time-critical versus non-timecritical tasks. For example, the data used in a control algorithm must be readily available at all times and should be kept in cache at all cost. The data that is used for user interface display or calculation of real-time parameters can be flushed. If the data used by the time-critical code fits into the cache, the application must overcome cache strategy that can be simply described as "use it or lose it," by, well, using the data. In other words, accessing data even when it is not required will keep it in the cache. In some cases, the CPU may offer explicit instructions for locking down parts of the cache but that is more exception than a rule. For example, if there is control data that may be used as the plant approaches some critical point, accessing data in each control step is a must. An argument that executing control code each iteration that may otherwise be required to run only once every 1000 iterations is overkill, since even in the worst case the out-of-cache execution may be only 20x slower, is correct, at least when viewed form an HPC point of view. Following the described approach would yield 50x worst CPU utilization and proportionally slower execution. Unfortunately, in the real-time HPC this line of argument is false because a 20x slower execution of control algorithm can result in serious damage— the real-time requirement states that every control action must be taken before a given deadline, which may be much shorter that the worst-case out-of-cache 20x longer execution time.

Another way to keep data in the CPU cache is to prevent any other thread from execution on the given processor. This is where ability of an RTOS to reserve certain CPUs for execution of a single task becomes extremely powerful. One does have to keep in mind that certain caches may be shared between multiple CPUs (for example, L3 cache in Intel's i7 architecture is shared between up to 8 CPUs) residing on the same physical chip so reserving a core on the processor that churns a lot of data on its other cores will be ineffective.

Finally, what can one do if the real-time data does not fit in the cache? Short of redesigning the underlying algorithm to use less data, or further prioritizing importance of different real-time tasks and devising a scheduling scheme that will keep the most important data in cache, there is not much that can be done. The penalty can be reduced if one can design an algorithm that can access data in two directions. If the data is always accessed in the same order, once it does not fit into the cache any more, each single element access will result in the cache miss. On the other hand, if the algorithm alternates between accessing data from first-to-last and last-to-first element, cache misses will be limited only to the amount of data actually not fitting into the cache: the data accessed last in the previous step is now accessed first and is thus already located in the cache. While this approach will always reduce algorithm execution time in the absolute terms, the relative performance benefit will decrease as more and more data does not fit into the cache.
1 | 2 | 3 | 4

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :