Mastering the challenges of multicore programming and debugging -

Mastering the challenges of multicore programming and debugging


In this article we will be discussing various aspects of multicore processing, including a look at different types of multicore processors and why these devices are becoming common and popular today. We’ll then look at some of the challenges introduced by having more than one core on a chip, and how modern multicore-aware debuggers can help to make these complex tasks more manageable.

Systems Performance

There are many ways to increase the performance of an embedded computing system ranging from clever compiler algorithms to efficient hardware solutions. Compiler optimizations are important to get the most efficient instruction scheduling from high-level language code that is easy to read and understand. In addition to this, systems can take advantage of parallelism available in the project to process more than one thing at a time. And of course scaling the clock frequency can be an effective way to get more performance from your computing system.

Unfortunately, the days when clock speeds could be assumed to increase geometrically have passed. And code optimization can only get you so much improvement, particularly now, after many generations of compiler technology development. This leaves us to look to parallelism as the best opportunity to continue scaling our system performance as time goes on.


Digging a well is a task that is hard to parallelize. Others can help, shoveling the dirt away, but the actual digging in the hole is typically a one-person job. As a result, adding more people in the hole will not get the job done any faster. In fact, the others may just get in the way and slow down the process. Some tasks are not suitable for parallelization.

Other tasks are easily parallelized. Digging a ditch is a task suitable for parallelization. Many people can work beside each other.

This picture shows a form of parallelism called MIMD, Multiple Instruction Multiple Data. Each digger is a separate unit and can do different tasks. In this case you can imagine that four diggers may get the job done in about 1/4th the time of a single digger.

With SIMD, Single Instruction Multiple Data, a single digger might use a shovel like this one.

The SIMD unit can only do one type of computation at a time, but it can perform it on several pieces of data in parallel. These types of instructions are common in vector processing units in many processors. This is useful if your data is very regular and you need to do the same operations over and over on a large data set such as in image processing. However, for more general computing tasks this model lacks flexibility and will not yield performance gains.

This leads us to the choice to put multiple full CPU subsystems on a single chip, creating multicore processors. Multiple cores on one chip can scale performance. Each core is a full CPU and can work independently or in concert with other cores.

Different Types of Multicore Processing

There are different combinations of types of cores you might have on a processor chip as well as how the work is distributed among them.

Homogeneous multicore processors have two or more copies of the same processor core. Each core runs autonomously and may communicate and synchronize with other cores through a number of mechanisms like shared memory or mailbox systems. Each processor has its own registers and function units, and may have its own local memory or cache. However, what makes this homogeneous is the fact that all of the cores we are looking at are of the same type.

Another type of multiple core chip is referred to as heterogeneous multicore with two or more different kinds of CPU cores. Here the cores may have very different characteristics which make them well suited for different parts of the system processing needs. One example might be a Bluetooth communications chip where one core is dedicated to managing the Bluetooth protocol stack while the other core might manage external communications, applications processing, the human interface etc. This kind of multiple core chip can be used for applications that need both real time dedicated performance on one core and system management capabilities on the other.

Now we’ll look at how the cores are used. Symmetric multiprocessing (SMP) happens when you have more than one core, and the cores run the same project code base. Different cores may be running different parts of the code at the same time, but the code is built as a single project and is dispatched to the separate cores by some controlling program like a real-time operating system (RTOS). By necessity, the cores working this way must be of the same type since they all use the same project code compiled for one type of processor.

Asymmetric multiprocessing (AMP) happens when you have more than one core or processor, and each processor is running its own project application. The separate cores may synchronize or communicate from time to time, but they each have their own code base which they execute. Since they are each running their own project, these cores can be of different types, or heterogeneous cores. However, this is not a requirement. If two or more of the same type of cores run different project code, they are homogeneous cores, running AMP.

Notice that for SMP operation you must have multiple homogeneous cores since they all run code from the same single project code base. However, if you have multiple projects with different code bases for the different cores to run, these can be different cores such as in a heterogeneous system. However, if the cores are the same, that works as well.

Reasons for Using Multicore

Over the last several years, Moore’s law, coined in the mid 1960’s, finally seems to be running out of steam, or at least slowing down. Processor clock rates no longer double every 2-3 years and in fact the highest speed CPUs have hit a ceiling in the low single digit GHz range for many years now.

One way to continue pushing the performance envelope is to have more CPU cores working together if you can use them efficiently.

While speeds have plateaued, transistor size has continued to shrink. Although slower than in the past, the small transistors enable the packing of more logic on a single chip. As a result, using these transistors to put several CPU cores on a single chip can take advantage of much faster and wider bus interconnects between the several CPU and memory subsystems.

Heterogeneous asymmetric multiprocessing is very useful where an application has two or more workloads that have very different characteristics and requirements. One might be real-time and interrupt-latency dependent, while the other might be more dependent on throughput than response time. This model works very well: For example, a device might dedicate one core to manage a communications protocol stack like Bluetooth or Zigbee, while another core acts as an application processor running human interactions and overall system management operations. The communications processor, being isolated, can provide excellent real-time response needed by the protocol stack. In addition, the communication software can be certified to a standard making the entire product easy to certify by keeping functional modifications separate from this part of the system.

Challenges Using Multicore

What kinds of challenges are introduced when you put more than one CPU core on a chip? Well, let’s dig into it.

A monolithic application or software may not be able use the available computing resources efficiently. You need to organize the application into parallel tasks which can run at the same time to use resources of more than one core. This might require an unfamiliar way for software engineers to think of embedded design. Migrating existing single-loop code might not be very easy. Too few threads or even too many threads can become performance barriers.

Applications that share data structures or I/O devices among multiple threads or processes could have serial bottlenecks. In order to maintain data integrity, access to these shared resources might have to be serialized by using locking techniques, for example, read lock, read-write lock, write lock, spinlock, mutex, and so on. Inefficiently designed locks could create bottlenecks due to high lock contentions between multiple threads or processes trying to acquire the lock to use a shared resource. This could potentially degrade the performance of the application or software. The performance of an application could even degrade as the number of cores or processors increase if some cores are stalling others waiting on common locks causing two cores to perform worse than one.

An unevenly distributed workload can be inefficient in utilizing computing resources. You might have to break large tasks into smaller ones that can be run in parallel. You might have to change serial algorithms into parallel ones for improving performance and scalability. However, if some tasks run very quickly, and others take a significant amount of time, the quick tasks may spend a significant amount of time waiting for the long tasks to complete. This results in valuable compute resources idling and poor performance scaling.

An RTOS will likely help you but might not solve everything. In an SMP system this is virtually a must to schedule tasks over a number of similar cores. The work to be done can be divided by data or by function. If you divide things up by data chunks, each thread might do all of the steps in a pipeline of processing. Alternatively, you might have one thread do one step in the function, while another does the next step, etc. The advantages of one technique over the other will depend on the characteristics of the work to be done.

Debugging in Multicore Environments

The first thing that is useful when debugging a multicore system is visibility of all cores. Ideally, we should be able to start and stop cores simultaneously or individually— that is, single step one core while others are running or stopped. Multicore breakpoints can be very useful to control the operation of one core predicated on the state of another.

Multicore trace can be very difficult to implement. Managing the high bandwidth of trace information from several cores, as well as dealing with potentially different types of trace data from different kinds of cores is a real challenge.

(Source: IAR Systems, diagram courtesy of Arm Ltd.)

Here is an example of a processor with both heterogeneous and homogeneous multicore implementations. There are two homogeneous core groups, one based on a dual Arm Cortex-A57 and the other on a quad Cortex-A53. These groups are homogeneous within themselves but heterogeneous among the two groups.

The CoreSight debug architecture provides protocols and mechanisms for communicating with the debug resources on all of the cores and it falls to the debugger to manage all of this information and parse messages from different cores. The cross trigger interfaces and matrix (CTI, CTM) allows simultaneous halting of both cores, triggering of trace and more. The trace infrastructure includes the serial (SWD) and parallel (TPIU) trace ports used for smoothing the trace flow, and the trace funnels which combine the trace from each source into a single flow. Compared to the dual-core part, the diagram shown represents a much more complex chip to control.

The C-SPY Debugger in IAR Embedded Workbench provides support for both symmetrical and asymmetrical multicore debugging. This is enabled through the debugger options on the multicore tab. To enable symmetric multicore debug, all that is required is that the number of cores be entered to let the debugger know how many different processors to communicate with. Other IDEs might have similar options available.

On the right (above), you can see a view in the debugger where a 4-core Cortex-A9 SMP cluster has its cores’ status displayed with core number 2 stopped while the other three cores are executing.

An asymmetric multicore system might use a heterogeneous multicore part, like the ST STM32H745/755 which has one Cortex-M7 core and a separate Cortex- M4. In this case, when the debugger runs it uses two instances of the IDE (Master and Node). One for each core since the two cores are running different project code.

In each instance of the IDE, there is status information about the core being controlled as well as the other core controlled in the other window. There are options that can be selected to control the behavior of the debugger so that starting and stopping the cores together or separately is under the control of the developer.

This full control is possible thanks to the cross trigger interfaces (CTI) and cross trigger matrix (CTM) together form the Arm embedded cross trigger feature. There are three CTI components, one at system level, one dedicated to the Cortex-M7 and one dedicated to the Cortex-M4. The three CTIs are connected to each other via the CTM as illustrated in figure below. The system-level and the Cortex- M4 CTIs are accessible to the debugger via the system access port and associated APB-D. The Cortex-M7 CTI is physically integrated in the Cortex-M7 core and is accessible via the Cortex-M7 access port.

(Source: IAR Systems, diagram courtesy of STMicroelectronics from M0399 Reference manual)

The CTIs allow events from various sources to trigger debug and trace activity. For example, a breakpoint reached in one of the processor cores can stop the other processor, or a transition detected on an external trigger input could be set to start code trace.

In this example with a heterogeneous multicore processor which has a Cortex-M7 core and a Cortex-M4 core on a single chip, two separate programs are used: one to run on the Cortex-M4 and the other running on the Cortex-M7. Each project uses FreeRTOS to manage the software running on the processors. The two cores communicate through a shared memory interface. However, the applications both use the FreeRTOS message passing mechanisms to communicate with the other processor and hide the complexity of the underlying mechanisms. So, from one CPU’s perspective it is just sending or receiving messages with another task. It is transparent that the other task happens to be running on another CPU core.

The image below is the Workspace explorer widow in the IDE. The overview of two projects is displayed here so you can see the contents of both the Cortex-M7 and Cortex-M4 projects.

By selecting one of the other tabs at the bottom of the window you can switch focus to either the M4 project or the M7 project.

The Cortex-M7 project has a task which sends messages to tasks running on the Cortex-M4. The Cortex-M4 has two instances of a receive task running. The Cortex-M7 has a “check” task which runs periodically to see if things are still running correctly.

Finally, the debugger loads both projects. This means that an additional instance of Embedded Workbench for the second debugger is started.

To set up the debugger for asymmetric multiprocessing support, we need to designate one project as the “Master” and the other as the “Node” project. In fact, the selection is arbitrary and only determines which project has the capability to launch the other on startup.

The “Node” project has no special settings and is unaware that it is running as a “Node” to another project.

In this way, when the “Master” project has its debugger started, it automatically launches another instance of the IDE to accommodate a second debugger session in which the second project will run.


Multicore enables performance gains when Moore’s law runs out. However, multicore presents debugging challenges and requires specific development approaches so the application can take maximum advantage of the multicore architecture.

Once the debug setup is configured, multicore debugging has never been easier. If you have used tools for debugging mono-cores before, you will recognize everything included in this and you will probably never understand other people talking about how difficult multicore debugging is for them.

Modern hardware and software tools will help you overcome multicore debugging challenges.

Note: Figure images are by IAR Systems unless otherwise noted. 

Aaron Bauch is a Senior Field Application Engineer at IAR Systems working with customers in the Eastern United States and Canada. Aaron has worked with embedded systems and software for companies including Intel, Analog Devices and Digital Equipment Corporation. His designs cover a broad range of applications including medical instrumentation, navigation and banking systems. Aaron has also taught a number of college level courses including Embedded System Design as a professor at Southern NH University. Mr. Bauch Holds a Bachelor’s degree in Electrical Engineering from The Cooper Union and a Masters in Electrical Engineering from Columbia University, both in New York, NY.

Related Contents:

For more Embedded, subscribe to Embedded’s weekly email newsletter.


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.