Editor's Note: In this three part series, Dr. Algosa
Vrancic and Jeff Meisel
presents
findings that
demonstrate how a novel approach with Intel hardware and software
technology
is allowing for real-time high-performance computing (HPC) in order to
solve engineering
problems with multi-core processors that were not possible only five
years
ago.
Part 1 is a review of real-time
concepts that are important for
understanding this domain of engineering problems, and a comparison of
traditional HPC with real-time HPC.
Part
2 outlines software architecture approaches for utilizing multi-core
processors, along
with cache optimizations.
Part 3 will consider industry examples
that employ
this particular methodology.
Cache Considerations
In traditional embedded systems, CPU caches are viewed as a necessary
evil. The
evil side shows up as a nondeterministic execution time inversely
proportional to
the amount of code and/or data of a time-critical task located inside
the cache when
the task execution has been triggered. For demonstration purposes, we
will profile
cache performance to better understand some important characteristics.
The
technique applied is using a structure within LabVIEW called a timed
loop, shown
in Figure 12.
Figure 12: Timed loop structure (used for
benchmark use-cases).
The timed loop acts as a regular while loop, but with some special
characteristics
that lend themselves to profiling hardware. For example, the structure
will execute
any code within the loop in a single thread. The timed loop can be
configured
with microsecond granularity, and it can be assigned a relative
priority that will be
handled by the RTOS. In addition, it can set processor affinity, and it
can also react
to hardware interrupts. Although the programming patterns shown in the
previous
section do not utilize the timed loop, it is also quite useful for
dealing with realtime
HPC applications, and parallelism is harvested through the use of
multiple
structures and queue structures to pass data between the structures.
The following
describes benchmarks that were performed to understand cache
performance.
An execution time of a single timed loop iteration as a function of the
amount of
cached code/data is shown in Figure 13. The loop runs every 10
milliseconds, and
we use an indirect way to cause the loop's code/data to be flushed from
the cache; a
lower priority task that runs after each iteration of the loop adds 1
to each element
of an increasingly larger array of doubles flushing more and more of
time critical
task's data from the CPU cache. In addition to longer runtime, in the
worst-case
scenario the time goes from 4 to 30 microseconds for an increase by a
factor of
7.5. Figure 13 also shows that decaching also increases jitter. The
same graph can
be also used to demonstrate the "necessary" part of the picture. Even
though some
embedded CPUs will go as far as completely eliminating cache to
increase determinism,
it is obvious that such measures will also significantly reduce
performance.
Besides, few people are willing to go back one or two CPU generations
in performance
especially as the amounts of L1/L2/L3 cache are continuously increasing
providing enough room for most applications to run while incurring only
minor
cache penalties.
Figure 13: Execution time of a simple timecritical
task as a function of amount of cached
code/data on 3.2 GHz/8-MB L3 cache i7 Intel
CPU using LabVIEW Real Time. Initial ramp-up
due to 256K L2 cache.
In real-time HPC, the cache is extremely important because of its
ability to keep
the CPU's computational resources busy. For example, let's assume that
we want to
add 1 to all elements in a single-precision array on a 3-GHz processor
that can
execute one single-precision floating point operation every clock
cycle—a task of
only 3 GFLOPs. The memory bandwidth required to keep the FPU busy would
have to be at least 24 GB/s (12 GB/s in each direction), bandwidth
above the latest
generation of processors with built in memory controllers. The
three-channel i7
CPU tops out at 18 GB/s. However, the CPUs can perform more than one
FLOP
per cycle because they contain multiple FPUs. Using SSE instructions,
one can add
four single precision floating point numbers every cycle so our array
example would
require at least 96 GB/s memory bandwidth to prevent stalls. The red
curve in
Figure 14 contains benchmark results for the described application.
Figure 14 shows
GFLOPs as a function of array size on 3.2 GHz i7 (3.2 GHz, 8 MB L3) for
x[i] =
x[i] + 1 (red curve) and x[i] = A*x[i]2 + B*x[i] + c (black curve)
operation on each
element of a single-precision floating point array using SSE
instructions. The
second graph zooms into first 512 KB region. Three steps are clearly
visible: L1
(32K), L2 (256K) and L3 (8M). Benchmark executed on LabVIEW Real-Time
platform thus minimum amount of jitter. When all data fits in L1, one
CPU can
achieve approximately 8.5 GFLOPs requiring 72 GB/s memory bandwidth.
When
running out of L2, CPU delivers 4.75 GFLOPs requiring 38 GB/s. Once
data does
not fit into CPU caches any more, the performance drops to 0.6 GFLOPs
completely
bounded by 4.8 GB/s memory bandwidth.
Zooming into the plot further also shows additional step at the
beginning for the
red curve, which may point to another level of cache 8K.
The ratio between maximum and minimum performance is a whopping 14x.
The
situation gets worse on a quad core CPU since the application can
easily be parallelized.
In the best case, the four CPUs can deliver 36 GFLOPs since the caches
are independent and in the worst case the performance stays at 0.6
GFLOPs since
the memory bandwidth is shared among the processors. The resulting
maximum/
minimum performance ratio jumps to 56x. As a verification, we run
another test
for which more FLOPs are performed for each array element brought into
the FPU.
Instead of only adding one to the element, we calculate a second order
polynomial,
which requires four FLOPs compared to one originally. Results are shown
in Figure
14 (black curve). AS expected, the maximum performance goes up to 15
GFLOPs
since the memory is being accessed less. For the same reason, the
performance difference
between data completely located in L1 and L2 caches, respectively,
drops.
As main memory latency and bandwidth becomes a gating factor, we again
see large
drop-off in the GFLOPs performance, though to a lesser value of "only"
8x.
Figure 14: GFLOPs as a function of array size on
3.2 GHz i7.
The above simple example demonstrates that multiple cores with their
fast caches
can deliver better-than-linear performance increase if the problem that
did not
originally fit into a single-CPU cache can fit into multiple CPU caches
after it
has been parallelized. However, Figure 14 also implies that any
unnecessary data
transfer between the CPUs can seriously degrade performance especially
if data has
to be moved to/from main memory. Causing cache misses while
parallelizing the
application can not only eat up all performance improvements resulting
from an
increased number of CPUs, but it can also cause an application to run
tens of times
slower than on single CPU.
So, what can one do if real-time HPC application data does not fit into
cache? The
answer depends on the amounts of data used for the time-critical versus
non-timecritical
tasks. For example, the data used in a control algorithm must be
readily
available at all times and should be kept in cache at all cost. The
data that is used for
user interface display or calculation of real-time parameters can be
flushed.
If the data used by the time-critical code fits into the cache, the
application must
overcome cache strategy that can be simply described as "use it or lose
it," by, well,
using the data. In other words, accessing data even when it is not
required will keep
it in the cache. In some cases, the CPU may offer explicit instructions
for locking
down parts of the cache but that is more exception than a rule. For
example, if there
is control data that may be used as the plant approaches some critical
point, accessing
data in each control step is a must. An argument that executing control
code
each iteration that may otherwise be required to run only once every
1000 iterations
is overkill, since even in the worst case the out-of-cache execution
may be only
20x slower, is correct, at least when viewed form an HPC point of view.
Following
the described approach would yield 50x worst CPU utilization and
proportionally
slower execution. Unfortunately, in the real-time HPC this line of
argument is false
because a 20x slower execution of control algorithm can result in
serious damage—
the real-time requirement states that every control action must be
taken before a
given deadline, which may be much shorter that the worst-case
out-of-cache 20x
longer execution time.
Another way to keep data in the CPU cache is to prevent any other
thread from
execution on the given processor. This is where ability of an RTOS to
reserve certain
CPUs for execution of a single task becomes extremely powerful. One
does have to
keep in mind that certain caches may be shared between multiple CPUs
(for
example, L3 cache in Intel's i7 architecture is shared between up to 8
CPUs)
residing on the same physical chip so reserving a core on the processor
that churns
a lot of data on its other cores will be ineffective.
Finally, what can one do if the real-time data does not fit in the
cache? Short
of redesigning the underlying algorithm to use less data, or further
prioritizing
importance of different real-time tasks and devising a scheduling
scheme that will
keep the most important data in cache, there is not much that can be
done. The
penalty can be reduced if one can design an algorithm that can access
data in two
directions. If the data is always accessed in the same order, once it
does not fit into
the cache any more, each single element access will result in the cache
miss. On the
other hand, if the algorithm alternates between accessing data from
first-to-last and
last-to-first element, cache misses will be limited only to the amount
of data actually
not fitting into the cache: the data accessed last in the previous step
is now accessed
first and is thus already located in the cache. While this approach
will always reduce
algorithm execution time in the absolute terms, the relative
performance benefit
will decrease as more and more data does not fit into the cache.
Figure 15 and 16 show benchmarks from a real-world application to which
we
applied all cache management methods described above. The figures show
time
required to multiply a symmetric matrix with a vector. The
multiplication is part of
a control algorithm that has to calculate 3000 actuator control values
based on 6000
sensor input values in less than 1 ms. (This industry example is from
the European
Southern Observatory and is mentioned in the final section of the
paper).
Initially, we tried to use standard libraries for matrix vector
multiplication but we
could not achieve desired performance. The algorithms were top notch
but they
were not optimized for real-time HPC. So, we developed a new
vector-matrix
multiplication algorithm that took advantage of the following:
In the control applications, the interaction matrix whose
inverse is used for
calculation of actuator values does not change often. Consequently,
expensive
offline preprocessing and repackaging of the matrix into a form that
takes
advantage of L1/L2 structure of the CPU as well as exercises SSE vector
units
to their fullest when performing online real-time calculation is
possible.
By dedicating CPUs to control a task only, the control
matrix stays in the cache
and offers the highest level of performance.
Splitting vector matrix multiplication into parallel tasks
increases the amount
of cache available for the problem.
The new algorithm can access matrix data from both ends.
Figure 15 shows a Matrix vector multiplication time (worst case) as a
function
of matrix size. Platform: Dual Quad-core 2.6-GHz Intel' Xeon
processors, with
a 12'MB cache each. The results were achieved by using only 4 cores, 2
on each
processor. Utilizing all 8 cores resulted in further multiplication
time for 3k x 3k
matrix of 500 us.
Figure 15: Matrix vector multiplication time (worst
case) as a function of matrix size.
Figure 16 depicts the Nehalem (8M L3) " cache boundary approximately
1900.
The curve is similar to that of curve for the Intel' Xeon' processor.
The difference
is due to direction toggling smaller because of a much larger increase
in memory
bandwidth compared to the increase in computation power. Excellent
memory
performance is visible for the 6K x 6K case: for 20% CPU clock
increase, there is
a 160% increase in performance (210% for non-toggling approach).
Figure 16: Nehalem (8M L3) " cache boundary
approximately 1900.
The results show that we are able to achieve 0.7 ms multiplication time
for the
required 3k x 3k matrix. The 1-millisecond limit is reached for the
matrix size
of about 3.3k x 3.3k, which is also close to the total L2 cache size (2
processors x
12 MB = 24 MB L2). Increasing the matrix size 4 times (6k x 6k) speeds
execution
time 17 times, implying a 4x degradation in GFLOPs. Using direction
toggling
approach results in up to 50 percent relative performance improvements
for data
sizes slightly larger than the available L2. The speed-up reduces in
relative terms as
the matrix is getting larger.
Watch for Part 3 - Industry Examples, coming next week.