|
A real-time HPC approach for optimizing Intel multi-core architectures (Part 2 of 3)
By Dr. Aljosa Vrancic and Jeff Meisel, National Instruments
(06/30/09, 09:50:00 AM EDT)
So, what can one do if real-time HPC application data does not fit into
cache? The
answer depends on the amounts of data used for the time-critical versus
non-timecritical
tasks. For example, the data used in a control algorithm must be
readily
available at all times and should be kept in cache at all cost. The
data that is used for
user interface display or calculation of real-time parameters can be
flushed.
If the data used by the time-critical code fits into the cache, the
application must
overcome cache strategy that can be simply described as "use it or lose
it," by, well,
using the data. In other words, accessing data even when it is not
required will keep
it in the cache. In some cases, the CPU may offer explicit instructions
for locking
down parts of the cache but that is more exception than a rule. For
example, if there
is control data that may be used as the plant approaches some critical
point, accessing
data in each control step is a must. An argument that executing control
code
each iteration that may otherwise be required to run only once every
1000 iterations
is overkill, since even in the worst case the out-of-cache execution
may be only
20x slower, is correct, at least when viewed form an HPC point of view.
Following
the described approach would yield 50x worst CPU utilization and
proportionally
slower execution. Unfortunately, in the real-time HPC this line of
argument is false
because a 20x slower execution of control algorithm can result in
serious damage—
the real-time requirement states that every control action must be
taken before a
given deadline, which may be much shorter that the worst-case
out-of-cache 20x
longer execution time.
Another way to keep data in the CPU cache is to prevent any other
thread from
execution on the given processor. This is where ability of an RTOS to
reserve certain
CPUs for execution of a single task becomes extremely powerful. One
does have to
keep in mind that certain caches may be shared between multiple CPUs
(for
example, L3 cache in Intel's i7 architecture is shared between up to 8
CPUs)
residing on the same physical chip so reserving a core on the processor
that churns
a lot of data on its other cores will be ineffective.
Finally, what can one do if the real-time data does not fit in the
cache? Short
of redesigning the underlying algorithm to use less data, or further
prioritizing
importance of different real-time tasks and devising a scheduling
scheme that will
keep the most important data in cache, there is not much that can be
done. The
penalty can be reduced if one can design an algorithm that can access
data in two
directions. If the data is always accessed in the same order, once it
does not fit into
the cache any more, each single element access will result in the cache
miss. On the
other hand, if the algorithm alternates between accessing data from
first-to-last and
last-to-first element, cache misses will be limited only to the amount
of data actually
not fitting into the cache: the data accessed last in the previous step
is now accessed
first and is thus already located in the cache. While this approach
will always reduce
algorithm execution time in the absolute terms, the relative
performance benefit
will decrease as more and more data does not fit into the cache.
1
|
2
|
3
|
4
|
Rate this article:
|
Low
High
|
|
|