Comparison of OpenMP & OpenCL Parallel Processing Technologies - Embedded.com

Comparison of OpenMP & OpenCL Parallel Processing Technologies

Nowadays, quad-core, multi-core & GPUs have already become the standard for both workstations and high performance computers. These systems use aggressive multithreading so that whenever a thread is stalled, waiting for data, the thread can efficiently switch to execute another thread.

Achieving good performance on these modern systems requires explicit structuring of the applications to exploit parallelism and data locality.

Multi-core technology offers very good performance and power efficiency and OpenMP has been designed as a programming model for taking advantage of multi-core architecture. The problem with GPUs are that their architecture is quite different to that of a conventional computer and code must be (re)written to explicitly expose algorithmic parallelism. A variety of GPU programming models have been proposed.

The most popular development tool for scientific GPU computing has proved to be CUDA (Compute Unified Device Architecture), provided by the manufacturer NVIDIA for its GPU products. However, CUDA is not designed for heterogeneous systems, while OpenCL programming model, by the Kronos Group supports cross-platform, parallel programming of heterogeneous processing systems.

Given, a diversity of high-performance architectures, there is a question of which is the best fit for a given workload and extent to which an application benefit from these systems,depends on availability of cores and other workload parameters.

This paper addresses these issues by implementing parallel algorithms for the four test cases and compares their performance in terms of time taken to execute and percentage of speed-up factor achieved.

The focus of our study is on the performance of benchmark comparing OpenMP and OpenCL. We observed that OpenCL programming model is a good option for mapping threads on different processing cores.

Balancing all available cores and allocating sufficient amount of work among all computing units, can lead to improved performance. In our simulation, we used Fedora operating system; a system with Intel Xeon Dual core processor having thread count 24 coupled with NVIDIA Quadro FX 3800 as graphical processing unit.

To read more of this external content, download the paper from the author online archives at Visvesvaraya National Institute of Technology.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.