CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

A real-time HPC approach for optimizing Intel multi-core architectures (Part 2 of 3)



Industrial Control Designline

Figure 15 and 16 show benchmarks from a real-world application to which we applied all cache management methods described above. The figures show time required to multiply a symmetric matrix with a vector. The multiplication is part of a control algorithm that has to calculate 3000 actuator control values based on 6000 sensor input values in less than 1 ms. (This industry example is from the European Southern Observatory and is mentioned in the final section of the paper).

Initially, we tried to use standard libraries for matrix vector multiplication but we could not achieve desired performance. The algorithms were top notch but they were not optimized for real-time HPC. So, we developed a new vector-matrix multiplication algorithm that took advantage of the following:
  • In the control applications, the interaction matrix whose inverse is used for calculation of actuator values does not change often. Consequently, expensive offline preprocessing and repackaging of the matrix into a form that takes advantage of L1/L2 structure of the CPU as well as exercises SSE vector units to their fullest when performing online real-time calculation is possible.
  • By dedicating CPUs to control a task only, the control matrix stays in the cache and offers the highest level of performance.
  • Splitting vector matrix multiplication into parallel tasks increases the amount of cache available for the problem.
  • The new algorithm can access matrix data from both ends.
Figure 15 shows a Matrix vector multiplication time (worst case) as a function of matrix size. Platform: Dual Quad-core 2.6-GHz Intel' Xeon processors, with a 12'MB cache each. The results were achieved by using only 4 cores, 2 on each processor. Utilizing all 8 cores resulted in further multiplication time for 3k x 3k matrix of 500 us.


Figure 15: Matrix vector multiplication time (worst case) as a function of matrix size.


Figure 16 depicts the Nehalem (8M L3) " cache boundary approximately 1900. The curve is similar to that of curve for the Intel' Xeon' processor. The difference is due to direction toggling smaller because of a much larger increase in memory bandwidth compared to the increase in computation power. Excellent memory performance is visible for the 6K x 6K case: for 20% CPU clock increase, there is a 160% increase in performance (210% for non-toggling approach).


Figure 16: Nehalem (8M L3) " cache boundary approximately 1900.


The results show that we are able to achieve 0.7 ms multiplication time for the required 3k x 3k matrix. The 1-millisecond limit is reached for the matrix size of about 3.3k x 3.3k, which is also close to the total L2 cache size (2 processors x 12 MB = 24 MB L2). Increasing the matrix size 4 times (6k x 6k) speeds execution time 17 times, implying a 4x degradation in GFLOPs. Using direction toggling approach results in up to 50 percent relative performance improvements for data sizes slightly larger than the available L2. The speed-up reduces in relative terms as the matrix is getting larger.


Watch for Part 3 - Industry Examples, coming next week.

Copyright 2009 Intel Corporation
1 | 2 | 3 | 4

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :