Traditional microprocessor designs are reaching performance limits due to power wall from increased frequency and circuit area and memory wall from the performance gap between CPU and memory. Diminishing returns on instruction-level parallelism have also increased the difficulty of scaling performance as predicted by Moore Law.
Recently, the programmability of these add-on graphics processing units (GPU) and streaming processors has increased due to demands for increased realism in 3D games and graphics applications.
Therefore, general-purpose computing on these devices is now possible. Accordingly, almost every computer system now has a heterogeneous platform with CPU and GPU or streaming processor to provide both graphics rendering and general-purpose computations. Today, GPUs provide substantially more computational power compared to state-of-the-art CPUs, and the performance gap between them is expected to increase over time.
Thus, exploiting the ever increasing computing power of modern GPUs is a challenge. In the past, writing parallel programs for these high-performance heterogeneous computer systems required familiarity with graphics APIs or vendor-specific APIs.
These APIs and programming paradigms are extremely dif?cult to implement. The most popular parallel programming paradigms, such as OpenMP and MPI, are unsuitable for theseheterogeneous multicore architectures. Vendor-specific GP GPU APIs are hard to programming and porting across varied architectures. Therefore OpenCL is proposed to easily program and migrate between diverse architectures.
Although OpenCL is computationally powerful and compatible with different platforms, fully utilizingOpenCL devices requires careful tuning of computing kernels. This study discusses the architectures of these highly efficient GPUs and applies a unified programming standard called OpenCL to fully utilize their capabilities. Despite their great potential, applications of these GPUs are challenging because of their diverse underlying architectural characteristics.
In this study, several optimizing techniques are applied on OpenCL-compatible heterogeneous multicore architectures to achieve thread-level and data-level parallelisms. The architectural implications of these techniques are discussed. Finally, optimization principles for these architectures are proposed. The experimental reveal average speedups of 24 and 430 for non-optimized and optimized kernels, respectively.
To read this external content in full, download the complete paper at the author archives at Chung Yuan Christian University, Taiwan.