Programming heterogeneous multicore embedded SoCs -

Programming heterogeneous multicore embedded SoCs

Over the last decade, the market demand for increased processing performance with reduced power and area footprint has remained strong and embedded SoCs have stepped up to the challenge. This performance, power and area (PPA) improvement has been achieved by adding cores – both general purpose cores and specialized cores such as DSPs and GPUs among other things.  This trend has resulted in networks of heterogeneous multicore embedded SoCs. An example of this trend is shown in Figure 1 .

Figure 1 : Single core to networks of heterogeneous multicore systems

Traditional approaches to programming such complex SoCs focus on manually partitioning the application across the various cores and hand optimizing the appropriate sections of the application for a given core. This approach tends to yield the maximum entitlement but has the following drawbacks:

  • The partitioning is static and has to be redone for each system configuration.
  • Increased time to market because programmers need to develop their own dispatch, communication and synchronization mechanisms
  • The resulting application is not portable
  • Requires detailed knowledge of the SoC and network architecture
  • Make modeling “what-if” scenarios difficult because significant rework is required to move a section of the application that has been mapped and optimized for one type core to another core

An important observation is that as embedded SoCs increase in complexity, they are starting to look a lot like their desktop counterparts from a software architecture standpoint. A typical high-performance heterogeneous embedded SoC consists of the building blocks shown in Figure 2 .

Figure 2 : TI 66AK2H – Sample end equipment

This leads to the following question:  Can we re-use traditional multicore programming paradigms such as OpenCL, OpenMP and MPI in the embedded space?  The remainder of the article focuses on answering this question by using the TI 66AK2H SoC as a case study.

The main compute cores are an ARM MPCore cluster with 4 Cortex-A15s (host) and a DSP cluster with 8 C66x DSP cores (accelerator). The host and accelerator share on chip and off chip memory. The 66AK2H is used in end equipment ranging from single SoC systems to multiple networked SoCs. Figure 3 is an overview of standard multicore programming models layered on the 66AK2H. Programming models above ARM MPCore are used to program the ARM cluster. Models that span both ARM & DSP are used to dispatch from ARM to DSP.

  • OpenMP and/or OpenCL can be used to:
    • Dispatch code/data from the host to the accelerator
    • Parallelize applications across the ARM MPCore cluster
    • Parallelize regions of the application dispatched to the DSP cluster
  • Across multiple SoCs, MPI is used to partition the application and manage program execution, data transfer and synchronization

Figure 3: Tooling for 66AK2H

Dispatching to the DSP cluster
Both OpenCL and OpenMP 4.0 (Accelerator model) can be used to offload compute intensive portions of an embedded application to the accelerator. 

void add(int *in1, int *in2, int *out1, int count)
#pragma omp target   map (to: in1[0:count-1], in2[0:count-1], count,
from: out1[0:count-1])
#pragma omp parallel
int i;
#pragma omp for
for (i = 0; i < count; i++)
out1[i] = in1[i] + in2[i];

The above code snippet is an example of dispatching an OpenMP parallel region using the target construct. The map clauses on the construct indicate data movement required before and after executing the dispatched region.

std::vector< Device > devices = context.getInfo< CL_CONTEXT_DEVICES >();
Buffer bufA   (context, CL_MEM_READ_ONLY,  bufsize);
Buffer bufB   (context, CL_MEM_READ_ONLY,  bufsize);
Buffer bufDst (context, CL_MEM_WRITE_ONLY, bufsize);
Program::Sources    source(1, std::make_pair(kernelStr,strlen(kernelStr)));
Program             program = Program(context, source);;
Kernel kernel(program, "add");
kernel.setArg(0, bufA);
kernel.setArg(1, bufB);
kernel.setArg(2, bufDst);
Event ev1,ev2,ev3,ev4;
CommandQueue Q(context, devices[0]);
Q.enqueueWriteBuffer(bufA, CL_FALSE, 0, bufsize, srcA, NULL, &ev1);
Q.enqueueWriteBuffer(bufB, CL_FALSE, 0, bufsize, srcB, NULL, &ev2);
Q.enqueueNDRangeKernel(kernel, NullRange, NDRange(NumVecElements),
NDRange(WorkGroupSize), NULL, &ev3);
Q.enqueueReadBuffer (bufDst, CL_TRUE, 0, bufsize, dst, NULL, &ev4);

The above code is a similar example with OpenCL using the C++ bindings for OpenCL APIs

Using a standard programming model or API allows the programmer to quickly handle the mechanics of dispatching code and data from the host to the accelerator and focus on optimizing code for the system. The programmer can also leverage tooling (debuggers, profilers) & training ecosystems built around these standard programming models.

Programming the accelerator
Compute intensive code is typically a mix of data parallel & taskparallel regions. The 66AK2H accelerator cluster consists of eight DSPcores that share on chip and off chip memory.

Data parallel regions
Such regions can be represented as OpenCL kernels using an ND Range kernel:

kernel void VectorAdd(global const short4* a,
global const short4* b,
global short4*       c)
int id = get_global_id(0);
c[id] = a[id] + b[id];

They can also be represented with an OpenMP parallel for loop:

void VectorAdd(const short4 *a, const short4 *b, short4 *c, int size)
int i;
#pragma omp parallel for
for (i = 0; i < size; i++)
c[i] = a[i] + b[i];

Task parallel regions
Both OpenCL and OpenMP support task parallel programming models. Thismode makes it possible to leverage libraries optimized for the DSPs.

OpenMP/OpenCL – Making the choice
OpenCL and OpenMP have their own execution and memory models for code executing on the accelerator and its associated data.

OpenMP usesa fork-join method of execution with multiple threads co-operating onimplicit or explicit tasks specified by OpenMP directives. Thesethreads have access to variables in shared memory and each thread isallowed to have a temporary view of these variables. The specificationdefines a set of synchronization points where the temporary view ofvariables is synchronized with shared memory.

OpenCL kernels execute on the accelerator. When a host submits a kernel forexecution, it is associated with a 1, 2 or 3-dimensional index space(NDRange) and the kernel executes for each point in this index space.An instance of a kernel is called a work-item. Work-items are collectedinto work-groups. There are four distinct memory regions: Global(accessible to all work-items in all work-groups), constant, local(accessible to a single work group) and private (accessible to a singlework-item).

An approach to picking one of OpenCL or OpenMP is todetermine the execution/memory model that is a better fit for theregion of application code dispatched to the accelerator. The choicealso depends on other factors:

  • Nature of existing code base
    • Is it already written to use OpenCL for dispatch?
    • Does it already use OpenMP to go parallel across threads on the host?
  • Control over data movement required - OpenCL offers more precise control over data movement between the host & accelerator
  • Programmer expertise & preference – using OpenCL APIs for dispatch vs. OpenMP target directives

With TI tooling, it’s possible to mix and match OpenMP 4.0Accelerator Model constructs and OpenCL in the same application. Thisflexibility provides significant advantages. Hence, the programmer:

  •  Does not have to make an early upfront decision on the model
  • Can mix and match the two models based on characteristics of regions in the application that are dispatched to the accelerator

Network of embedded SoCs
The 66AK2H SoC supports multiple off chip transports with varyingbandwidth and latencies – Gigabit Ethernet, Serial Rapid IO andHyperlink (a TI-based high performance chip to chip protocol). What ifyour system has a network of SoCs using one of more transports? Again,there is an existing standard to the rescue – Message Passing Interface(MPI). MPI is a standardized and portable set of APIs used fordeploying applications across distributed memory systems.

UsingMPI APIs such as MPI_Send and MPI_Recv abstract the various inter SoCtransport mechanisms and remove the need to create custom transportAPIs. Figure 8 illustrates a simple MPI program which initializes anarray of elements and sends it to another node. Programmers can alsotake advantage of the tooling ecosystem such as profilers and MPI awaredebuggers. A simple MPI program:

#include "mpi.h"
#include < stdio.h >
#define BUFFER_SIZE  (2048)
#define MASTER       (0)
#define WORKER       (1)
#define TAG          (1)
int main(int argc, char *argv[])
int rank, i;
int buffer[BUFFER_SIZE];
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == MASTER) {
for (i=0; i
} else { /* rank 1 */
for (i=0; i
TAG, MPI_COMM_WORLD, &status);
for (i=0; i
if (buffer[i] != i)
printf("Error: element in buffer[%d] = %dn", i, buffer[i]);
return 0;

Inconclusion, it is possible to program heterogeneous embedded SoCsusing multicore programming paradigms such as OpenMP, OpenCL and MPI.Over time, support for these models on high performance embedded SoCswill improve and broaden, making them the first choice of embeddedsystems programmers on such systems.

Ajay Jayaraj is a senior member of the Compiler team in the Texas Instruments ( Development Organization. He architected the TI OpenMP DSPruntime and is currently responsible for developing OpenMP tooling forTI’s KeyStone SoCs. Ajay was a key member of the architecture team forseveral DSP accelerators and has led compiler teams that co-developedtooling with chip design teams. He also has extensive experience withTI’s ARM and DSP compilers.Prior to TI, Ajay was aResearch Engineer at Georgia Tech working on the DARPA PolymorphicComputing program. He also worked at a startup that developed compilertechnology for FPGAs.


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.