A real-time HPC approach for optimizing Intel multi-core architectures (Part 2 of 3)
By Dr. Aljosa Vrancic and Jeff Meisel, National Instruments
Editor's Note: In this three part series, Dr. Algosa
Vrancic and Jeff Meisel
presents
findings that
demonstrate how a novel approach with Intel hardware and software
technology
is allowing for real-time high-performance computing (HPC) in order to
solve engineering
problems with multi-core processors that were not possible only five
years
ago.
- Part 1 is a review of real-time
concepts that are important for
understanding this domain of engineering problems, and a comparison of
traditional HPC with real-time HPC.
- Part
2 outlines software architecture approaches for utilizing multi-core
processors, along
with cache optimizations.
- Part 3 will consider industry examples
that employ
this particular methodology.
Cache Considerations
In traditional embedded systems, CPU caches are viewed as a necessary
evil. The
evil side shows up as a nondeterministic execution time inversely
proportional to
the amount of code and/or data of a time-critical task located inside
the cache when
the task execution has been triggered. For demonstration purposes, we
will profile
cache performance to better understand some important characteristics.
The
technique applied is using a structure within LabVIEW called a timed
loop, shown
in Figure 12.

Figure 12: Timed loop structure (used for
benchmark use-cases).
The timed loop acts as a regular while loop, but with some special
characteristics
that lend themselves to profiling hardware. For example, the structure
will execute
any code within the loop in a single thread. The timed loop can be
configured
with microsecond granularity, and it can be assigned a relative
priority that will be
handled by the RTOS. In addition, it can set processor affinity, and it
can also react
to hardware interrupts. Although the programming patterns shown in the
previous
section do not utilize the timed loop, it is also quite useful for
dealing with realtime
HPC applications, and parallelism is harvested through the use of
multiple
structures and queue structures to pass data between the structures.
The following
describes benchmarks that were performed to understand cache
performance.
An execution time of a single timed loop iteration as a function of the
amount of
cached code/data is shown in Figure 13. The loop runs every 10
milliseconds, and
we use an indirect way to cause the loop's code/data to be flushed from
the cache; a
lower priority task that runs after each iteration of the loop adds 1
to each element
of an increasingly larger array of doubles flushing more and more of
time critical
task's data from the CPU cache. In addition to longer runtime, in the
worst-case
scenario the time goes from 4 to 30 microseconds for an increase by a
factor of
7.5. Figure 13 also shows that decaching also increases jitter. The
same graph can
be also used to demonstrate the "necessary" part of the picture. Even
though some
embedded CPUs will go as far as completely eliminating cache to
increase determinism,
it is obvious that such measures will also significantly reduce
performance.
Besides, few people are willing to go back one or two CPU generations
in performance
especially as the amounts of L1/L2/L3 cache are continuously increasing
providing enough room for most applications to run while incurring only
minor
cache penalties.

Figure 13: Execution time of a simple timecritical
task as a function of amount of cached
code/data on 3.2 GHz/8-MB L3 cache i7 Intel
CPU using LabVIEW Real Time. Initial ramp-up
due to 256K L2 cache.