A real-time HPC approach for optimizing Intel multi-core architectures (Part 2 of 3)
By Dr. Aljosa Vrancic and Jeff Meisel, National Instruments
In real-time HPC, the cache is extremely important because of its
ability to keep
the CPU's computational resources busy. For example, let's assume that
we want to
add 1 to all elements in a single-precision array on a 3-GHz processor
that can
execute one single-precision floating point operation every clock
cycle—a task of
only 3 GFLOPs. The memory bandwidth required to keep the FPU busy would
have to be at least 24 GB/s (12 GB/s in each direction), bandwidth
above the latest
generation of processors with built in memory controllers. The
three-channel i7
CPU tops out at 18 GB/s. However, the CPUs can perform more than one
FLOP
per cycle because they contain multiple FPUs. Using SSE instructions,
one can add
four single precision floating point numbers every cycle so our array
example would
require at least 96 GB/s memory bandwidth to prevent stalls. The red
curve in
Figure 14 contains benchmark results for the described application.
Figure 14 shows
GFLOPs as a function of array size on 3.2 GHz i7 (3.2 GHz, 8 MB L3) for
x[i] =
x[i] + 1 (red curve) and x[i] = A*x[i]2 + B*x[i] + c (black curve)
operation on each
element of a single-precision floating point array using SSE
instructions. The
second graph zooms into first 512 KB region. Three steps are clearly
visible: L1
(32K), L2 (256K) and L3 (8M). Benchmark executed on LabVIEW Real-Time
platform thus minimum amount of jitter. When all data fits in L1, one
CPU can
achieve approximately 8.5 GFLOPs requiring 72 GB/s memory bandwidth.
When
running out of L2, CPU delivers 4.75 GFLOPs requiring 38 GB/s. Once
data does
not fit into CPU caches any more, the performance drops to 0.6 GFLOPs
completely
bounded by 4.8 GB/s memory bandwidth.
Zooming into the plot further also shows additional step at the
beginning for the
red curve, which may point to another level of cache 8K.
The ratio between maximum and minimum performance is a whopping 14x.
The
situation gets worse on a quad core CPU since the application can
easily be parallelized.
In the best case, the four CPUs can deliver 36 GFLOPs since the caches
are independent and in the worst case the performance stays at 0.6
GFLOPs since
the memory bandwidth is shared among the processors. The resulting
maximum/
minimum performance ratio jumps to 56x. As a verification, we run
another test
for which more FLOPs are performed for each array element brought into
the FPU.
Instead of only adding one to the element, we calculate a second order
polynomial,
which requires four FLOPs compared to one originally. Results are shown
in Figure
14 (black curve). AS expected, the maximum performance goes up to 15
GFLOPs
since the memory is being accessed less. For the same reason, the
performance difference
between data completely located in L1 and L2 caches, respectively,
drops.
As main memory latency and bandwidth becomes a gating factor, we again
see large
drop-off in the GFLOPs performance, though to a lesser value of "only"
8x.

Figure 14: GFLOPs as a function of array size on
3.2 GHz i7.
The above simple example demonstrates that multiple cores with their
fast caches
can deliver better-than-linear performance increase if the problem that
did not
originally fit into a single-CPU cache can fit into multiple CPU caches
after it
has been parallelized. However, Figure 14 also implies that any
unnecessary data
transfer between the CPUs can seriously degrade performance especially
if data has
to be moved to/from main memory. Causing cache misses while
parallelizing the
application can not only eat up all performance improvements resulting
from an
increased number of CPUs, but it can also cause an application to run
tens of times
slower than on single CPU.