Multicore systems are popular but problematic. The authors describe the problem with communications APIs and what the Multicore Association is doing about it.
What do firefighting and multicore programming have in common? Both are hot jobs. Firefighters and multicores both need to get the job done as quickly and effectively as possible. They both require reliable, standardized tools. Firefighters always act as a team, and the same goes for multicore. But most importantly, they both have to communicate well. Without communication, firefighters don't survive and the cores in a multicore system may as well be operating alone.
Analogies aside, it's important to point out that “excellent communication” is a relative term that depends on the application's requirements. Regardless of the implementation, however, multicore systems can be classified according to their memory architectures and their communication mechanisms. Before we go on, we should point out that when we say multicore, we're talking about systems with two or more processing elements, including homogeneous (same processor type) and heterogeneous (different processor types) multiprocessor systems, as well as coprocessors and hardware accelerators. We should also point out that this article focuses on multicore-enabled closely distributed embedded applications, but we'll take a look at the similarities and differences of the memory architectures and communication application programming interfaces (APIs) used in desktops, servers, and networks.
Memory architectures and communication APIs
In a shared-memory system, all the processors can access all available memory as one global address space. Shared memory is typically accessed through a bus and controlled by some type of locking mechanism to avoid simultaneous access of the same memory by multiple cores. This arrangement provides a straightforward programming model where each processor can access the memory directly. Shared memory permits passing data by reference without actually moving the data. On the other hand, shared memory can become a bottleneck when too many processors try to access it at the same time. This bottleneck suggests that the memory architecture doesn't scale well with an increasing number of processors.
In a distributed memory system each of the processors can only access its own local memory; no global memory address space exists across them, and communication relies on various forms of message passing. A core with its own local memory doesn't have to share the access to its memory, providing an efficient and scalable structure. When one core requires data from another core or cores need to synchronize among themselves, data must be physically moved (in other words, not by reference). Message passing can be asynchronous, meaning that while waiting for data, other computations can be performed until the data arrives. Alternatively, message passing can be synchronous; the waiting task is blocked until data arrives. If both shared and local memory is available it's possible to create efficient communications structures by combining the best features of both.
Traditional comms APIs
To support these different memory models and communication mechanisms, a variety of API standards have been developed over the years including OpenMP (Multi Processing), Sun Remote Procedure Call (RPC), Common Object Request Broker Architecture (CORBA), and Message Passing Interface (MPI).
For shared memory architectures that use simple communication schemes, the most widely used implementations and APIs are proprietary. For example, TI's DSP/BIOS Link is specifically designed for the company's chips. For more complex implementations and communication schemes, operating systems with symmetric multiprocessing (SMP) are commonly used.
From a standards perspective, OpenMP is probably the most widely used API for shared memory architectures, supporting multiprocessing programming in C/C++ and Fortran on many architectures, including UNIX and Microsoft Windows platforms. OpenMP consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior. OpenMP is a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer. The core elements of OpenMP are the constructs for thread creation, workload distribution (work sharing), data environment management, thread synchronization, user-level run-time routines, and environmental variables.
OpenMP supports incremental parallelism, allowing it to work on one portion of the program at a time; no dramatic code changes are needed. This means that OpenMP can be gradually introduced into existing applications, thereby reducing the pain incurred when transitioning from a single core to a multicore system. Since OpenMP is most useful to parallelize loops, it may only be applicable to a portion of an application, which limits the opportunity to exploit all forms of parallel execution. Since embedded applications are generally more event driven than general-purpose and high-performance computing, it's possible that the opportunity for parallel (loop) threads is further reduced.
Since multiprocessing implementations are generally widely distributed (as opposed to on-chip), most standard protocols and APIs are aimed at widely distributed architectures. Distributed architectures range from the Internet, wide area networks, local area networks, servers, all the way down to single-chip devices that contain a variety of processing elements. These architectures use some form of message passing to transfer data and commands. Message passing provides a portable style of parallel programming but in general is difficult to program and doesn't support incremental parallelization of an existing sequential program. Message passing was initially defined for client/server applications running across a network and includes costly semantics that aren't often required by tightly coded scientific applications running on modern scalable systems, let alone embedded applications.
MPI is the most popular form of message passing APIs for widely distributed computing, being both portable (MPI has been implemented for many distributed memory architectures) and fast (each implementation is optimized for the hardware on which it runs). However, although MPI is powerful, it's complex and will likely in its full form consume memory resources beyond what's acceptable in a multicore chip's available memory; MPI may introduce too much computational overhead (latency). Nevertheless, useful features can be borrowed from MPI for an embedded application. Often a so-called hybrid model for parallel programming, using both OpenMP and MPI, is used for programming computer clusters.
It's obvious that we have plenty of standards available to support communication but none designed or intended to support closely distributed embedded multicore systems, especially their hardware-static nature (known number of cores per chip); their tight memory and task-execution”time constraints; or the on-chip interconnect requirements. Furthermore, in the embedded systems world, most applications are asymmetric in nature; multicore-enabled platforms may use heterogeneous cores on the same chip or asymmetric architectures with different operating systems running on homogeneous cores, or some combination of both. To cope with this shift from widely distributed systems, it's helpful for the industry to agree on common, simple, and efficient abstractions for such concurrent systems that enable us to describe key aspects of concurrency in ways that can be simply and directly represented as a set of APIs.
The resource management, communications, and synchronization required for embedded distributed systems are some specific areas of programming multicore systems that must be addressed. The reality is that such systems can't rely only on a single operating system–or even an SMP operating system–for such services. When heterogeneous multicore systems employ a range of operating systems across multiple cores, it means they have resources that can't be managed by any single operating system. This situation is exacerbated further by the presence of hardware accelerators that don't run any form of operating system but must interact with processes that are potentially running on multiple operating systems on different cores.
Before going into the details of the high-performance communication mechanisms required by embedded systems, let's examine the functions and communication requirements of some example embedded applications. In particular, let's look at an automotive application employing tens to hundreds of sensor inputs, which must be read on a periodic basis, and an application that processes network packets.
Controlling an engine
In our automotive application, sensors are continuously monitored to determine the appropriate engine parameter settings. Running the application code, one general-purpose (GP) processor can handle the data tasks responsible for polling the sensors. Another GP processor can handle the scheduled control task that reads the data collected by the data tasks and processes the data by computing values to apply to various actuators in the engine. The frequency at which the control task must be run is determined by the engine's RPM and a particular angle of the engine's crankshaft. This workload and the associated transfer of data implies that the synchronization between control and data tasks must be minimal to avoid negative impacts on the latency of the control task.
Ideally, the sum of the latencies plus message send/receive times should be less than latency of the control loop given current engine RPM. In general, individual tasks are expected to complete in times varying from 1ms up to 1,600ms, depending on the nature of the sensor and the type of processing required for its data. In this application, the control task must be able to determine if data is available from each data task, and if not, the task should be able to proceed in a nonblocking fashion using the last data from the sensor in question; this implies some form of nonblocking message test or select mechanism. It would also be desirable for the control task to use a light-weight communication API to send updates to actuators as “messages.” The data task should use this API to read data from the sensor, implying some sort of driver implementation underneath this API. Finally, the data task should be able to do a nonblocking message send to the control task.
Ideal, we'd be able to try out different ways to partition an application to optimally distribute the application across multiple cores. For example, the automotive application could use one processor for control and one for data tasks; one processor for both control and data tasks plus a SIMD core for signal processing, and other special purpose processors for the remaining data processing; or simply dedicate one core, cylinder, or group of cylinders. Using a standard API that's the same for the different cores and operating systems would make it much easier to try different implementations.
In this example, assume we're working on a multicore-enabled system in which task modules perform various forms of processing on network-packet (IP packet) streams that contain packets ranging in size from 64 bytes to about 1KB. For instance, the packet processing might be a TCP engine. In this application, the multicore chip contains several cores (three to 100, depending on the bandwidth needed), each with a small amount of local private memory or cache. Some of the cores may be specialized hardware accelerators. The cores communicate with each other through some on-chip interconnect. Also, the cores can all talk to some common shared memory through the chip's interconnect, although this shared memory is not assumed to support cache coherency as Figure 1 shows.
There are two desired modes of packet transfer between modules. In one mode, IP packets are brought in from the outside world and streamed between the modules without going through external shared memory. In another mode, packets are placed in shared memory between modules, and each module accesses the packets from the shared memory. A commonly used hybrid of these modes is one in which packets are placed in shared memory, while packet descriptors and metadata are streamed directly between cores without going through shared memory.
In this application, although each of the modules in this system has an independent flow of control, the modules must also communicate and synchronize with each other. The modules access both private and shared data. Therefore, the modules need both local memory (preferably cache) and global shared memory. In general, it's preferable if the shared memory could be globally shared between all the cores. Much of the communication between the cores follows a stream pattern, occurring once or twice per packet. Thus, Module 3 must efficiently receive (from Module 2) and transmit (to Module 4) 20 to 50 bytes of data for every packet it processes, in other words for each 100 to 2,000 cycles of execution.
From a computational perspective, all processing in this application is both parallel and pipelined. Each packet is first processed in a pipelined manner by Module 1 and then by Module 2 (representing hardware acceleration). Modules 1 and 2 perform very little computation for each packet, but they must quickly examine the metadata before the packet is sent to Module 3. All the compute elements represented by Module 3 handle the compute-intensive parallel processing of packets. Each instance of Module 3 performs between 100 and 4,000 cycles of computation on each packet. In other words, multiple Module 3s process independent packets in parallel. Module 3 also accesses shared memory for packet data while it's computing. After Module 3 is done, it might send some data to Module 4 indicating how to process the packets before they're shipped back out on the network.
Performance is commonly scaled by adding more modules (for example, Module 3), without changing the code in Module 3. This implies several important factors. First, the communication and synchronization APIs must be flexible enough and scalable to allow simple upgrade of the system. Second, although the latency of processing a packet is important, the packet throughput through the system is the key processing metric and potential bottleneck. Hence, the communication mechanisms that support the packet transfers must be able to efficiently move the data. Another important factor relates to time-to-market and the ability to quickly port third-party software that previously ran on sequential processors. With a standard communications API, a communications module can simply be added to each functional module, simplifying the porting.
There's an API for us
We've presented these two examples hoping they'll encourage development of an API specifically for embedded multicore systems that has properties similar to MPI (and others) but with minimal overhead and the ability to exploit the proximity of multiple cores on a single chip. The Multicore Association has started work on a message-passing API and a resource-management API (referred to within the organization as CAPI and RAPI, respectively) that will capture the basic elements of communication and synchronization required for embedded distributed systems. The target systems for such an API will span multiple dimensions of heterogeneity (such as core heterogeneity, interconnect heterogeneity, memory heterogeneity, operating-system heterogeneity, software tool-chain heterogeneity, and programming-language heterogeneity). While industry standards, such as MPI and OpenMP, exist for distributed systems programming, they have primarily focused on the needs of (1) distributed systems in the large, (2) SMP systems, or (3) specific application domains (for example scientific computing). Thus, CAPI has similar but more highly constrained goals than these existing standards with respect to scalability and fault tolerance, yet has more generality with respect to application domains.
The Multicore Association's API's will form a layer on top of which other abstractions or applications may be built as shown in Figure 2. For maximum performance, an application can interact directly with the API for inter-core communication, synchronization, and resource allocation. In other words, the application can avoid a series of expensive operating-system calls.
There's a long list of features and functions that these APIs could support, but in actuality these APIs represent a subset of existing APIs. The real challenge in developing such APIs is determining what functions and features not to use. In other words, to meet the stringent demands of the single-chip (or otherwise closely distributed) multicore platform, the APIs may draw upon concepts implemented in the traditional protocols (programming models, semantics, and so forth), which were originally intended for large-scale computing platforms, but within the resource constraint of an embedded multicore system.
Some of the features the Multicore Association is considering for the APIs include making the source-code portable and reusable so the architecture can be processor independent and enabling implementations to be scalable for messaging performance and memory footprint. It should also be possible to build more powerful and complex capabilities on top of this API to enable system-level control via message passing.
The embedded systems industry appears to have endorsed multicore technology, but the gap between its capabilities and the available software support for multicore implementations continues to grow. In this article, we've barely scratched the surface of the issues being resolved in the embedded multicore world, as well as the standards that are being developed within the Multicore Association. Also being explored within the Multicore Association is multicore debugging, an entire topic unto itself. More information can be found at www.multicore-association.org.
Markus Levy is founder and president of the Embedded Microprocessor Benchmark Consortium and serves as the president of the Multicore Association president. He's worked for EDN and Instat/MDR and is coauthor of Designing with Flash Memory. He also worked for Intel as a senior applications engineer and customer training specialist for Intel's microprocessor and flash memory products. You can reach him at .
Prior to founding PolyCore Software, Sven Brehmer served as senior director in charge of Wind River's Embedded Platforms Division, then home of VxWorks, pSOS, and VSPWorks. He came to Wind River through its acquisition of ISI in 2000, where Brehmer served as the COO and executive vice president of DIAB-SDS, a subsidiary of ISI. Prior to DIAB-SDS, Brehmer was the president and CEO of Diab Data. Brehmer has a master's in electronics engineering from the Royal Institute of Technology, Stockholm, Sweden.