A real-time HPC approach for optimizing Intel multi-core architectures (Part 2 of 3)
By Dr. Aljosa Vrancic and Jeff Meisel, National Instruments
Figure 15 and 16 show benchmarks from a real-world application to which
we
applied all cache management methods described above. The figures show
time
required to multiply a symmetric matrix with a vector. The
multiplication is part of
a control algorithm that has to calculate 3000 actuator control values
based on 6000
sensor input values in less than 1 ms. (This industry example is from
the European
Southern Observatory and is mentioned in the final section of the
paper).
Initially, we tried to use standard libraries for matrix vector
multiplication but we
could not achieve desired performance. The algorithms were top notch
but they
were not optimized for real-time HPC. So, we developed a new
vector-matrix
multiplication algorithm that took advantage of the following:
- In the control applications, the interaction matrix whose
inverse is used for
calculation of actuator values does not change often. Consequently,
expensive
offline preprocessing and repackaging of the matrix into a form that
takes
advantage of L1/L2 structure of the CPU as well as exercises SSE vector
units
to their fullest when performing online real-time calculation is
possible.
- By dedicating CPUs to control a task only, the control
matrix stays in the cache
and offers the highest level of performance.
- Splitting vector matrix multiplication into parallel tasks
increases the amount
of cache available for the problem.
- The new algorithm can access matrix data from both ends.
Figure 15 shows a Matrix vector multiplication time (worst case) as a
function
of matrix size. Platform: Dual Quad-core 2.6-GHz Intel' Xeon
processors, with
a 12'MB cache each. The results were achieved by using only 4 cores, 2
on each
processor. Utilizing all 8 cores resulted in further multiplication
time for 3k x 3k
matrix of 500 us.

Figure 15: Matrix vector multiplication time (worst
case) as a function of matrix size.
Figure 16 depicts the Nehalem (8M L3) " cache boundary approximately
1900.
The curve is similar to that of curve for the Intel' Xeon' processor.
The difference
is due to direction toggling smaller because of a much larger increase
in memory
bandwidth compared to the increase in computation power. Excellent
memory
performance is visible for the 6K x 6K case: for 20% CPU clock
increase, there is
a 160% increase in performance (210% for non-toggling approach).

Figure 16: Nehalem (8M L3) " cache boundary
approximately 1900.
The results show that we are able to achieve 0.7 ms multiplication time
for the
required 3k x 3k matrix. The 1-millisecond limit is reached for the
matrix size
of about 3.3k x 3.3k, which is also close to the total L2 cache size (2
processors x
12 MB = 24 MB L2). Increasing the matrix size 4 times (6k x 6k) speeds
execution
time 17 times, implying a 4x degradation in GFLOPs. Using direction
toggling
approach results in up to 50 percent relative performance improvements
for data
sizes slightly larger than the available L2. The speed-up reduces in
relative terms as
the matrix is getting larger.
Watch for Part 3 - Industry Examples, coming next week.
Copyright 2009 Intel Corporation