Picking the right multicore architecture for your compute-intensive application - Embedded.com

Picking the right multicore architecture for your compute-intensive application

From small, highly-integrated system on chips to full blown multicore powerhouses, the multicore revolution is here. But what is the best way to address it in your systems? Trying to leverage all that compute power at your fingertips is a daunting task.

Some of the hardware that is available – like the Freescale T4240, which has 12 multithreading cores that schedule 24 threads where two threads share a core, and where three sets of four cores share a 2MB cache, can get very complex. Is it better to run just one OS domain with all cores and threads scheduled from a single OS domain, or is it better to divide up all the compute power into many individual OS domains where they all have control of set of tasks? It really depends on the applications used. Is the application parallel safe, and is it data-intensive? Taking advantage of the shared level 2 cache may make a good boundary.

Other hardware choices include a standard set of CPUs, like that in the Intel Core i7, combined with a built-in GPU. This system can hyper-thread eight threads across four cores on the CPU complex as well as leverage the GPU for general-purpose compute. While supporting the heterogeneous CPU-GPU mix adds to system complexity, it can be worth the trouble if the system can achieve a higher performance for compute-intensive applications.

Once it’s understood just how the application can be broken up, then determine what methods and languages are available to build the application. If using multi-OS configurations with either symmetric or asymmetric CPUs, shared memory is typically employed to pass data and messages between the OS domains. While that is not the only method, it generally speaks to passing a command with some data to an operating system domain, where an interrupt will process the message and thread it accordingly. But what APIs can be used?

There are several to choose from. The Multicore Association maintains the MCAPI (Multicore Communication API), which is designed specifically for the multi-OS paradigm. MCAPI (Figure 1) can build on top of an adjacent specification, MRAPI (Multicore Resource API), which provides the low layer shared memory as a resource between multiple OS domains.

Figure 1: MCAPI is a message-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any implementation.

Other choices for this type of architecture include using a similar set of API’s that are proprietary in nature. Whatever is easy to configure and maintain for the long run may be the best for implementation. One important attribute is the overhead of such an interface. These cores typically share memory, which is much faster than across an Ethernet. If one of the reasons to divide up the application into several OS domains is to prevent cache thrashing (a process where each thread of execution is fighting for the same cache line to read and write data to and from it), then an efficient implementation is necessary.

There are several choices for programming within an OS domain that contains several symmetric CPUs. One option is to utilize what’s available in the operating system’s threading model. There are several languages that can be used to enable this with a standard threaded OS environment: OpenMP, OpenCL, and Cilk/Cilk++. Each of these programming paradigms has a different syntax. Some are simple but may not offer the same level of control. Some require extensive changes to a typical C language syntax. And some are not supported on all architectures, so check to make sure the language chosen is available on the architecture used, with compiler support and an operating system that supports them as well.

For those who are into extreme programming and who want to take full advantage of all the gates in a system, consider GPGPU (General Purpose GPU programming) There are several factors to consider: language, drivers and bandwidth. GPUs (Graphics Processing Units) are specifically designed for graphics manipulation at a pixel level, calculating vectors of data, and rendering ever complex 3D images at high frame rates. Because of this, they have the capability to calculate complex algorithms on small data sets very quickly.

Drivers are non-trivial and necessary for GPGPU, which must be supported by the operating system. Many GPU vendors do not give out their source code, since it is part of their intellectual property as it is not only in the GPU core, but in the drivers as well. While some GPU vendors provide drivers for some of the more popular operating systems, not all operating systems are supported.

Then there are the languages to choose from for GPGPU: OpenCL, a Khronos standard, and CUDA, which is specific to Nvidia GPUs. All take similar approaches to parallel programming and have varied benchmark performance metrics that go either way. The hardware choice may suggest that one is better suited than the other. As OpenCL is an open standard available on most platforms and drivers, there is compiler support and it can use CPUs and GPUs in the same design without changing the code. This may be a good starting point.

Lastly, how much data needs to be propagated to the remote GPU’s and across which type of bus may influence your decision on just how much to process with those nodes. The more data intensive, the closer the GPU should be to the CPU complex. Putting it across a PCIe bus, where it may have to share bandwidth with other peripherals, may impact performance. If the GPU is close to the CPU, then those issues are minimized.

There is no “magic” here. One needs to dive into each of the architecture choices, including hardware, software, language, and compiler, in order to assess how each component influences the architecture chosen in order to optimize the particular algorithm. There is no “one size fits all” for high performance compute systems – at least not yet.

Stephen Olsen will talk about these issues in greater detail at the Multicore Developers Conference , May7-8, in a presentation titled How to Leverage Multicore Architectures for Compute-Intensive Applications (ME1119) .

Stephen Olsen is a product line manager for VxWorks at Wind River. Prior to Wind River, Stephen was involved with Mentor Graphics as a consultant, system architect, and RTOS engineering manager. He co-chaired VSIA's Hardware dependent Software (HdS) design working group, worked on the MRAPI specification for the Multicore Association, and authored several papers on system design, USB, multicore/multi-OS design, and power management. He was awarded a patent on debugging hardware accelerated operating systems.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.