Multicoreprocessors have become a mainstay in the last couple of years as the processorshave been powering devices such as mobile phones, tablets, servers, wireless basestations and media servers. In a very short span of five to seven years theindustry has seen multicore processors with dual cores to 128 cores beingdeveloped by leading silicon vendors.
With the rapid paceof the multicore processor evolution, software developers are faced with variouschallenges. Although there are challenges, developers don’t need to worry asthere are solutions to help to overcome development challenges with multicoresoftware.
Designing embeddedsoftware with multicore processors takes some special considerations. One challengewith designing multicore software is when initial thinking/implementation isdone in a sequential manner while designing an application/algorithm.
The sequentialexecution model works with a single core system where execution is occurring onone CPU, but the complexity of the application outpaces the performance offeredby the single core CPU thus forcing the move to a multicore system.
Identification ofthe potential areas to parallelize becomes the most important factor. In somecases an application/algorithm designer manually identifies the parallelization.Some commercial tools like CriticalBlue’s Prism have come online in the lastfew years to help developers with easing this task for complexapplication/algorithms.
Figure 1: Iterativedesign process to find optimum design for given multicore platform
Deciding foroptimal mapping of parallelized blocks on different cores becomes anotherchallenge for developers to conquer. The factors to consider in such asituation are the communication requirements between the paralleled blocks aswell as the input and output interfaces.
When communicationbetween the blocks is frequent and chosen multicore architecture caneffectively support the communication, it is okay to execute the blocks ondifferent cores. But when communication between the cores becomes thebottleneck and there is no option to move to another efficient processorarchitecture, a developer has to go back to the segmentation method and find parallelizationstrategies that can be employed.
Another importantfactor to consider is the use of shared interface peripherals. If two parallelblocks need the same input and/or output peripheral, special consideration hasto be taken at the design process. Effective modeling of such behavior usingsimulator or tools like Polycore Software’s Poly-Platform helps at this stage to design parallel software that is not prone tomulticore communication or input/output bottleneck.
Figure1 above highlightsthe process of perfecting the design in which a developer iterates over thedesign before proceeding to the development phase.
Developing multicore software
In some cases multicore software development is followed by design steps while in other cases it is done in parallel with the design. In any situation, traditional development methodology has been challenged with multicore processors. With multicore processors, development teams need to adopt a common programming model and resource use conventions so that various components can effectively use the processor power.
In the past few years programming models like Open Multi Processing (OpenMP), shown in Figure 2 below , Open Compute Language (OpenCL) and Multicore Application Programming Interface (MCAPI) have emerged. These programming models offer developers solutions for the need to have a cohesive multicore programming model.
Apart from these programming models, some high level operating systems such as Linux and OSE have introduced the notion of multicore/multiprocessor programming using well defined application programming interfaces like “pthreads.”
Figure 2: Selecting the right programming model is the main challenge and can help reduce development time significantly.
As highlighted in Figure 2 above , OpenMP is best suited when an application design can be segmented in a master thread that initiates multiple parallel worker threads running independently on multiple cores. OpenMP 3.0 defines a specific language construct that allows a programmer to indicate a start and an end to parallelism. The programmer can also indicate the logical points where various worker threads can be synchronized.
OpenCL is best suited for when an application has a master thread and needs to compute the workload in highly parallel entities also known as work kernels. A typical application will involve a general purpose processor running a master execution thread with high speed connectivity with multicore processors or graphic units.
Similar to OpenMP, OpenCL defines a specific language construct to enable programmers to indicate regions of computation known as “kernels” as well as programming interfaces to execute the computation. OpenCL also drives on-the-go compilation of compute kernels, which enables a user to either utilize pre-written kernels or write a kernel and immediately prototype it.
MCAPI provides another option for multicore programming by defining well defined interfaces. With MCAPI, a development process can be defined in terms of developing MCAPI nodes. The MCAPI nodes can be discovered dynamically at run time, connected with each other using MCAPI communication channels. Based on application needs, a programmer can construct data and execution flow.
Some applications may not find a match with the capabilities of the standard programming model and will need to ensure that the software development kit provided has capabilities to use the multicore processor architecture.
Application developers must select the best possible programming model that is the most efficient for their application. In some cases the programming model selection is closely tied with the processor architecture as some multicore architecture are better suited for a certain programming model. For example, OpenMP lends itself very well to the architecture with shared memory support.
Standard based open programming model yields several advantages:
- Enables selection of various compliant software blocks;
- Ease of migration from one processor architecture to another architecture supporting the same programming model;
- Establishes common development guidelines across development teams for a simplified development process.
In certain cases the application needs to have many multicore processors on a board. For example, in telecommunications, vast amount of boards use advanced telecommunications computing architecture (ATCA) form factor. One ATCA board can house more than 10 eight core digital signal processors (DSPs). In such situations it is important to consider communication across multicore processors but in others cases the processors may have input and output such as Serial RapidIO (sRIO) or Ethernet to enable communication.
The software layers provided with a multicore software development kit need to ensure that applications can use these mulitcore processor communication interfaces effectively. Standard based programming models have started to take notice of such needs. MCAPI has a working group to focus on extending MCAPI to use sRIO as the physical interface to connect and communicate across many multicore processors.
By definition, OpenCL does not provide physical interface requirement, rather it relies on the host/master processor to distribute various kernels for computation. With the right amount of processing and execution capabilities, it is possible to have one master/host using several different multicore chips as an accelerator.
Apart from the programming model, another important design and development challenge is the ability to debug and tune the product during the system test cycle. Most of the time, debugging and tuning is an afterthought, and is done on a need by needed basis. Current and emerging multicore processors employ more than eight cores interfacing with various high performance input and output peripherals.
The complexity to understand the interactions increases with the number of cores and shared peripherals. In some multicore devices, a debug architecture such as Chip Tools 4 (CTools 4) is used to provide the ability to capture and store debug and trace information on-chip. Developers need to ensure that the software development kit provides an appropriate programming interface to enable the debug and trace buffer and to use them to gain valuable debug and performance information.
Debugging multicore software
Multicore devices also offer unique debug and performance challenges such as resource conflicts and synchronization which are rarely encountered in a single core system.
At the most basic level, multicore devices still require the same basic debug capability that single core systems have used. This includes the ability to run, halt and step the core. But with a multicore system, the ability to synchronously load, run and step all cores simultaneously is extremely useful.
Multicore systems may also feature multiple power domains, in which entire subsystems containing cores and peripherals are shut off. The debug system must be able to survive these power transitions, and even offer the user the ability to control these transitions from the debugger, so that power related features can be more easily tested.
Figure 3: Block diagram overview of multicore debugging features ( To view larger image, click here)
As the processing on a multicore system becomes more advanced, problems such as synchronization between software threads across cores becomes a challenge. With standards such as OpenMP and OpenCL, these software threads may be working across similar or disparate core types.
These threads may be utilizing shared resources, which can result in race conditions and performance degradations. A debug subsystem which can provide a multicore, time correlated view of the software operations can be invaluable in debugging the synchronization and resource conflicts which can occur.
A System Trace capability, which utilizes the MIPI System Trace Protocol to provide software developers a hardware-accelerated, multicore “printf” ability, is a valuable tool for debugging multicore devices. When System Trace is used in a multicore system, a message from each core is identified and globally time stamped by the on-chip System Trace hardware. The result is a device level, time-correlated view of software execution across cores.
Multicore systems also typically feature a multi-layered memory architecture. A multi-layered memory architecture presents unique challenges for debugging and optimization. To optimize the multi-layered memory architecture, the debug system needs to provide the necessary information for a developer to appropriately optimize placement of program and data.
Information such as the exact line where a cache has been missed and forced the memory subsystem to access external memory can reduce the amount of time a developer spends optimizing memory operations. At the system level, a debug subsystem should be able to provide performance information for a shared memory or shared interface such as the external memory interface.
The total performance of a multicore system can be gated by system interfaces, and receiving performance information from those interfaces can quickly reveal system bottlenecks, especially when this performance information is time correlated with software thread operations. In addition to measuring performance, understanding when a particular core is accessing a shared resource relative to another core can aid in debugging resource conflicts.
When hardware events such as the interface performance is added to System Trace, developers gain the ability to see both software thread execution and hardware performance correlated in time, making it possible for software developers to quickly find and eliminate inefficiencies.
Deploying multicore software
The last challenge for developers is to think about the challenges once the product is deployed. It is always a good idea to ensure that the product can be monitored in a deployed situation to understand the issues with the product. With multicore processors, it is important to be able to collect traces without disturbing normal product operation. In some cases it is very useful to download software with special instrumentation to understand the issues.
Figure 4: Example of remote monitoring in deployed situation
Figure 4 above highlights how useful it is for developers to think about selecting a multicore processor that enables the software to collect and deliver traces over the network. In some silicon platforms dedicated silicon blocks like system trace monitors enabled with associated software development kit allows customer to develop software that is ready to deploy.
Today, multicore processors are increasingly getting deployed in a vast majority of applications. This is why it is important that software developers realize the challenges associated with designing, developing, debugging and deploying the software on multicore devices.
Sanjay Bhal is responsible for the definition of multicore software solutions for multicore processor business at Texas Instruments . He holds a MS in Electrical Engineering from the University of Mumbai and a MBA from the Smith School of Business University of Maryland.
Stephen Lau is responsible for the definition of on-chip debug technology and associated emulator products deployed through TI’s Third-Party Emulation Developer Community. He is also responsible for marketing IEEE 1149.7 technology, and developed TI’s first commercial IP license for debug technology. Stephen holds a BS in Electrical Engineering from McMaster University in Hamilton, Ontario, Canada.