CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

A real-time HPC approach for optimizing Intel multi-core architectures (Part 2 of 3)



Industrial Control Designline

In real-time HPC, the cache is extremely important because of its ability to keep the CPU's computational resources busy. For example, let's assume that we want to add 1 to all elements in a single-precision array on a 3-GHz processor that can execute one single-precision floating point operation every clock cycle—a task of only 3 GFLOPs. The memory bandwidth required to keep the FPU busy would have to be at least 24 GB/s (12 GB/s in each direction), bandwidth above the latest generation of processors with built in memory controllers. The three-channel i7 CPU tops out at 18 GB/s. However, the CPUs can perform more than one FLOP per cycle because they contain multiple FPUs. Using SSE instructions, one can add four single precision floating point numbers every cycle so our array example would require at least 96 GB/s memory bandwidth to prevent stalls. The red curve in Figure 14 contains benchmark results for the described application. Figure 14 shows GFLOPs as a function of array size on 3.2 GHz i7 (3.2 GHz, 8 MB L3) for x[i] = x[i] + 1 (red curve) and x[i] = A*x[i]2 + B*x[i] + c (black curve) operation on each element of a single-precision floating point array using SSE instructions. The second graph zooms into first 512 KB region. Three steps are clearly visible: L1 (32K), L2 (256K) and L3 (8M). Benchmark executed on LabVIEW Real-Time platform thus minimum amount of jitter. When all data fits in L1, one CPU can achieve approximately 8.5 GFLOPs requiring 72 GB/s memory bandwidth. When running out of L2, CPU delivers 4.75 GFLOPs requiring 38 GB/s. Once data does not fit into CPU caches any more, the performance drops to 0.6 GFLOPs completely bounded by 4.8 GB/s memory bandwidth.

Zooming into the plot further also shows additional step at the beginning for the red curve, which may point to another level of cache 8K. The ratio between maximum and minimum performance is a whopping 14x. The situation gets worse on a quad core CPU since the application can easily be parallelized. In the best case, the four CPUs can deliver 36 GFLOPs since the caches are independent and in the worst case the performance stays at 0.6 GFLOPs since the memory bandwidth is shared among the processors. The resulting maximum/ minimum performance ratio jumps to 56x. As a verification, we run another test for which more FLOPs are performed for each array element brought into the FPU. Instead of only adding one to the element, we calculate a second order polynomial, which requires four FLOPs compared to one originally. Results are shown in Figure 14 (black curve). AS expected, the maximum performance goes up to 15 GFLOPs since the memory is being accessed less. For the same reason, the performance difference between data completely located in L1 and L2 caches, respectively, drops. As main memory latency and bandwidth becomes a gating factor, we again see large drop-off in the GFLOPs performance, though to a lesser value of "only" 8x.


Figure 14: GFLOPs as a function of array size on 3.2 GHz i7.


The above simple example demonstrates that multiple cores with their fast caches can deliver better-than-linear performance increase if the problem that did not originally fit into a single-CPU cache can fit into multiple CPU caches after it has been parallelized. However, Figure 14 also implies that any unnecessary data transfer between the CPUs can seriously degrade performance especially if data has to be moved to/from main memory. Causing cache misses while parallelizing the application can not only eat up all performance improvements resulting from an increased number of CPUs, but it can also cause an application to run tens of times slower than on single CPU.
1 | 2 | 3 | 4

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :