As an embedded developer who may have an application running on a single processor and would like to improve your performance or performance per watt by moving to multicore, you’d likely want to know and answer to this question: Is there a magic bullet? In other words, can you just move your application on to a multicore platform and it will automatically run faster?
Looking for the magic bullet. If you're using a personal computer or server with multiple applications, it is possible to automatically get performance improvements because the different applications are independent.
Yes, they share the file system on the hard disk and a few other things managed by the OS, but they don't share data and need to synchronize with each other to perform their services. As long as they are independent and don't interfere with each other, you likely get performance improvements. However, with an increasing number of cores, you see diminishing returns in this type of system.
In contrast, if you are working with an embedded system, your application would have to automatically be distributed across multiple cores to achieve performance improvement. All applications have some code that is inherently sequential and they generally have areas that can be run concurrently where synchronization is required.
Different parts of an application may also share global variables and may use pointers to reference those global variables. When distributing an application across multiple cores, you gain true concurrency.
This means that global variables that could safely be accessed in a single processor situation, have to be safeguarded to avoid the multiple cores accessing the variables at the same time and corrupting the data.
You may now be wondering how you can combine speed, planning, continuous assessment and adjustment in a multicore approach to achieve desired results in the shortest period of time.
You must first start by selecting an appropriate programming model for a current project that potentially offers long-term cost savings in future projects if done right.
Four common models that are familiar to embedded system programmers are OpenMP (Open Multiprocessing), OpenCL (Open Computing Language), as well as two message passing-based protocols: MPI (Message Passing Interface) and MCAPI (Multicore Communications API).
OpenMP is commonly used for loop processing across multiple cores in SMP environments. OpenMP is a language extension using preprocessor directives and is used in systems with general purpose SMP Operating Systems such as Linux, as well as in systems with shared memory and homogeneous cores.
While embedded systems may include general purpose computing elements, they often have some requirements better served with asymmetrical approaches, which go beyond the focus area of OpenMP. The OpenMP Architecture Review Board (ARB), a technology consortium, manages OpenMP.
OpenCL is primarily used for GPU programming. It is a framework for writing programs that span CPUs, GPUs (graphics processing unit) and potentially other processors. OpenCL provides a path to execute applications that go beyond graphics processing, for example vector and math processing, covering some areas of embedded computing. The OpenCL open standard is maintained by the Khronos Group.
Message passing: A familiar route to multicore
Message passing, an even more commonly used model, is used in networking, high-performance computing, and often included in operating system functional areas. In other words, it's a ubiquitous model and familiar to a lot of programmers, as it is used in both SMP and AMP environments. MPI is native to high-performance computing whereas MCAPI is primarily focused on embedded computing.
Both use explicit communication functions such as send and receive. MPI provides capabilities for widely distributed computing, in dynamic networks, typically with GP OSes supporting the process model, whereas MCAPI is more focused on closely distributed (embedded) computing, in static topologies where there may be a GP OS, an RTOS, or no OS.
MPI and MCAPI have similarities as well as complementary differences and can therefore beneficially be combined in a distributed cluster of many-core computers.
What is message passing? By whatever name, message passing uses an explicit communication such as send and receive. Message passing can be connectionless or connected, and the functions can be blocking or non-blocking.
The message itself consists of a header, i.e. metadata such as source and destination address, message and/or payload size, priority, etc., and sometimes payload data. This is analogous with a letter where the address and return address represent metadata and the content is the payload. And just like a letter in the mail, the message is processed to determine the appropriate route, the priority and latency requirements. A network packet offers another example of a message.
If the sender and the receiver share memory, data sharing (i.e., the metadata with its reference or pointer) is exchanged between the sender and the receiver. The receiver can reference the payload data without actually moving the data, and the advantage of using referencing to share data is that less data has to be moved, which reduces transaction time. However, when the sender and receiver cannot see the same memory, the data must be moved.
Message passing is generally platform agnostic. In networking where the sender and the receiver may be using different kinds of computers and operating systems and where the packet may go through many computers along the way, you use the sockets API, which looks the same in all of the systems involved in the communication process. Using a similar paradigm, message passing can be applied to a number of different multicore platforms.
Message passing also is scalable, as you see with the networking and MPI used in high-performance computing, which uses an incredible amount of cores. Because it is transport agnostic, potential bus bandwidth issues, which can occur in SMP environments with many cores, can be avoided.
Something to pay attention to with message passing is the transaction cost. The transaction cost comes from processing the message and transferring the metadata and the data payload. Any time you can keep the data payload from moving, you save transaction time. Message passing is a time-tested method and expertise is widely available.
Looking at MCAPI, you see that it has three different communication modes:
1. Connectionless Messages
2. Connected Packet Channels
3. Connected Scalar Channels
Connectionless Messages is the most flexible mode, sent from a source to a destination without requiring a connection and with blocking and non-blocking functionality. Communication is buffered with per message priority.
Connected Packet Channels use connected unidirectional FIFO channels with blocking and non-blocking functionality. Communication is buffered with per channel priority.
Connected Scalar Channels are for 8-, 16-, 32- or 64-bit scalars and use connected unidirectional FIFO channels with blocking functionality and with per channel priority.
Other functional groups provide node and endpoint management, as well as management of non-blocking operations and general support functions. A 3-tuple, consisting of domain id, node id and port id, defines an endpoint.
After MCAPI has been initialized, the application creates endpoints on the local node and retrieves endpoints from remote nodes. Based on the application, the proper communication mode and blocking or non-blocking functionality is selected.
When completed, the application can send and receive the messages. If non-blocking functions are used, the proper non-blocking management functions are used to complete the communications. Beyond the MCAPI API, the application is not aware of how the communication is managed. The how is managed by the MCAPI implementation and supporting tools.
MCAPI Enablement. On a single processor, function parameters are passed on the call stack. Distributing functions across multiple cores using message passing involves passing the parameters and computational results by explicit communication. Functions are encapsulated by communication calls to deliver parameters and results. Parameters can be passed by reference or by movement, if shared memory is used.
Migrating to multicore using MCAPI
Let's look at the flow in migrating an application to multicore using message passing.
Your first step is to determine the composition and behavior of the application. In some cases, the application is well understood and in other cases it may have to be analyzed through profiling.
It is important to emphasize the functions that use the most CPU cycles. Tackling the hotspots first is the fastest path to performance improvement. Once you have established your first assumptions of how to distribute the application function on multiple cores, it's time to map the application to the multicore platform.
The following step is to encapsulate the functions that are mapped to other cores. To do so, you add communication primitives, in their simplest form, by sending and receiving messages.
Now there’s the more complex task of adding the communications software. This involves movement of metadata and possibly payload data, synchronizing and safeguarding the meta and payload data during the communication.
The communication needs to be adapted to the transport, taking into account operating systems, drivers if applicable, and also considering that the two sides of the communication may be on different types of cores. As you add communication channels, this task becomes increasingly complex.
Once you have encapsulated, added primitives and written the communications software, it's time to compile, test, and validate it against the requirements. Chances are that you won't get it right the first time, which will lead you to the step of measuring and analyzing to discover where assumptions made can be improved. This flow may have to be repeated multiple times before you are finished.
This migration flow (Figure 1, below ) can be greatly simplified with the MCAPI programming model and tools support.
MCAPI is a relatively simple API with approximately 50 different functions. Most applications do not need them all, and often require only a small subset. Even so, tools can simplify the MCAPI-enabling task.
Clickon image to enlarge.
Figure 1: Application Flow and Topology Map
The tool shown in the figure above is aware of the underlying topology map, in terms of nodes and ports. It can therefore provide information to the programmer about which nodes and ports are available.
You first select which MCAPI function to use (i.e., send or receive message, blocking or non-blocking, destination node, and which ports to use), then select parameters either from existing variables or those offered by the tool.
The template is filled in and MCAPI code is generated and inserted into the application at the location that the programmer chooses. Code generation as provided by this tool speeds up the programming effort, and because the tool is topology aware, the code templates are tested, reducing the number of errors and amount of debugging required.
Validation & Optimization
Once you have a working prototype, you need to validate it against the requirements defined at the outset. Does the programming model match the application and did you provision the system well to support this programming model?
You need to look at the mapping to see if you succeeded in smoothing out the function pipeline or if unexpected stalls and waiting occur. Of course, you need to evaluate if the multicore platform is capable of supporting the application within the requirements, and it is extremely valuable to have access to a simulator in combination with rapid prototyping tools when evaluating the hardware platform.
Rapid Prototyping & Exploration
A packet processing application (Figure 2, below ) which has one extremely compute-intensive function illustrates the concept and usefulness of rapid prototyping in this context.
In this example, a programming platform is used with MCAPI message passing support, a real time OS, and a multicore capable simulator. Global shared memory moves data between processor local memories.
The simulator allows you to modify the cycles required to access global shared memory and to easily add cores. The tools let you quickly rearrange the communication topology and provision the resources in the topology.
You need to find out how memory accesses latency impacts performance and if you can re-factor the application to find an optimal number of cores for parallelizing this function.
What you will find is that with perfect memory (i.e., zero-cycle access latency), you can improve performance with up to five parallel cores, but will likely not see any improvement with the addition of cores.
However, if you look at batch processing multiple data sets to reduce the transaction cost, you will find that by combining multiple packets per transfer with realistic memory access latency, you can have almost linear improvement, with up to twice the number of cores.
Clickon image to enlarge.
Figure 2: Rapid Prototyping and Exploration
Without having the capability to explore an application in this way, it's highly unlikely that a development team could have gone beyond the five cores without leaving performance on the table.
And the Winner Is?
Without a doubt, planning ahead will get you the results faster that you are after in your migration to multicore. First you need to look at the application and its behavior to determine a suitable programming model.
For a function pipeline, it is suggested that you use a message passing model. This message passing can of course be used in other situations as well, since one of its attractive features is its ability to scale to very large systems with numerous cores.
You also need to look at standards, which help provide solutions as well as future-proofing your efforts for this project to the next and more. Make sure that the standard applies to the target system used (i.e., message passing MCAPI is more suitable for embedded systems whereas MPI is the obvious choice for widely distributed high-performance computing ).
For the multicore platform, the approach depends on whether the platform is already available or whether you need to define a suitable platform. A programming platform providing the appropriate tools and runtime software, to enable you to iteratively explore options and to reduce time to product, would be the most suitable choice for such a project.
In other words, multicore programming is not that much different than the rest of life. The more planning, proper tools and smarts applied to examine the process and ensure you’re following the optimal path, the faster your team will achieve an optimized multicore application. While this appears slower at the outset, just remember the story about the tortoise and the hare and who really wins the race.
Sven Brehmer , President & CEO of PolyCore Softwar e, is a founding member of the Multicore Association and chairman of the Communications API working group. Prior to founding PolyCore Software, Brehmer served as senior director for Wind River's Embedded Platforms Division as a result of the acquisition of Integrated Systems in 2000, where Brehmer served as the chief operating officer and executive vice president of DIAB-SDS. Brehmer was also President and CEO of Diab Data after receiving his Master's degree in Electronics Engineering from the Royal Institute of Technology in Stockholm,Sweden.