CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

A real-time HPC approach for optimizing Intel multi-core architectures (Part 2 of 3)



Industrial Control Designline
Editor's Note:  In this three part series, Dr. Algosa Vrancic and Jeff Meisel presents findings that demonstrate how a novel approach with Intel hardware and software technology is allowing for real-time high-performance computing (HPC) in order to solve engineering problems with multi-core processors that were not possible only five years ago.
  • Part 1 is a review of real-time concepts that are important for understanding this domain of engineering problems, and a comparison of traditional HPC with real-time HPC.
  • Part 2 outlines software architecture approaches for utilizing multi-core processors, along with cache optimizations.
  • Part 3 will consider industry examples that employ this particular methodology.


Cache Considerations
In traditional embedded systems, CPU caches are viewed as a necessary evil. The evil side shows up as a nondeterministic execution time inversely proportional to the amount of code and/or data of a time-critical task located inside the cache when the task execution has been triggered. For demonstration purposes, we will profile cache performance to better understand some important characteristics. The technique applied is using a structure within LabVIEW called a timed loop, shown in Figure 12.


Figure 12: Timed loop structure (used for benchmark use-cases).

The timed loop acts as a regular while loop, but with some special characteristics that lend themselves to profiling hardware. For example, the structure will execute any code within the loop in a single thread. The timed loop can be configured with microsecond granularity, and it can be assigned a relative priority that will be handled by the RTOS. In addition, it can set processor affinity, and it can also react to hardware interrupts. Although the programming patterns shown in the previous section do not utilize the timed loop, it is also quite useful for dealing with realtime HPC applications, and parallelism is harvested through the use of multiple structures and queue structures to pass data between the structures. The following describes benchmarks that were performed to understand cache performance.

An execution time of a single timed loop iteration as a function of the amount of cached code/data is shown in Figure 13. The loop runs every 10 milliseconds, and we use an indirect way to cause the loop's code/data to be flushed from the cache; a lower priority task that runs after each iteration of the loop adds 1 to each element of an increasingly larger array of doubles flushing more and more of time critical task's data from the CPU cache. In addition to longer runtime, in the worst-case scenario the time goes from 4 to 30 microseconds for an increase by a factor of 7.5. Figure 13 also shows that decaching also increases jitter. The same graph can be also used to demonstrate the "necessary" part of the picture. Even though some embedded CPUs will go as far as completely eliminating cache to increase determinism, it is obvious that such measures will also significantly reduce performance. Besides, few people are willing to go back one or two CPU generations in performance especially as the amounts of L1/L2/L3 cache are continuously increasing providing enough room for most applications to run while incurring only minor cache penalties.


Figure 13: Execution time of a simple timecritical task as a function of amount of cached code/data on 3.2 GHz/8-MB L3 cache i7 Intel CPU using LabVIEW Real Time. Initial ramp-up due to 256K L2 cache.

1 | 2 | 3 | 4

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :