Programming heterogeneous multicore embedded SoCs

Ajay Jayaraj, Texas Instruments

April 25, 2014

Ajay Jayaraj, Texas InstrumentsApril 25, 2014

Over the last decade, the market demand for increased processing performance with reduced power and area footprint has remained strong and embedded SoCs have stepped up to the challenge. This performance, power and area (PPA) improvement has been achieved by adding cores – both general purpose cores and specialized cores such as DSPs and GPUs among other things.  This trend has resulted in networks of heterogeneous multicore embedded SoCs. An example of this trend is shown in Figure 1.

Figure 1 : Single core to networks of heterogeneous multicore systems

Traditional approaches to programming such complex SoCs focus on manually partitioning the application across the various cores and hand optimizing the appropriate sections of the application for a given core. This approach tends to yield the maximum entitlement but has the following drawbacks:

  • The partitioning is static and has to be redone for each system configuration.
  • Increased time to market because programmers need to develop their own dispatch, communication and synchronization mechanisms
  • The resulting application is not portable
  • Requires detailed knowledge of the SoC and network architecture
  • Make modeling “what-if” scenarios difficult because significant rework is required to move a section of the application that has been mapped and optimized for one type core to another core

An important observation is that as embedded SoCs increase in complexity, they are starting to look a lot like their desktop counterparts from a software architecture standpoint. A typical high-performance heterogeneous embedded SoC consists of the building blocks shown in Figure 2.

Figure 2 : TI 66AK2H - Sample end equipment

This leads to the following question:  Can we re-use traditional multicore programming paradigms such as OpenCL, OpenMP and MPI in the embedded space?  The remainder of the article focuses on answering this question by using the TI 66AK2H SoC as a case study.

The main compute cores are an ARM MPCore cluster with 4 Cortex-A15s (host) and a DSP cluster with 8 C66x DSP cores (accelerator). The host and accelerator share on chip and off chip memory. The 66AK2H is used in end equipment ranging from single SoC systems to multiple networked SoCs. Figure 3 is an overview of standard multicore programming models layered on the 66AK2H. Programming models above ARM MPCore are used to program the ARM cluster. Models that span both ARM & DSP are used to dispatch from ARM to DSP.

  • OpenMP and/or OpenCL can be used to:
    • Dispatch code/data from the host to the accelerator
    • Parallelize applications across the ARM MPCore cluster
    • Parallelize regions of the application dispatched to the DSP cluster
  • Across multiple SoCs, MPI is used to partition the application and manage program execution, data transfer and synchronization

Figure 3: Tooling for 66AK2H

Dispatching to the DSP cluster
Both OpenCL and OpenMP 4.0 (Accelerator model) can be used to offload compute intensive portions of an embedded application to the accelerator. 

void add(int *in1, int *in2, int *out1, int count)
    #pragma omp target   map (to: in1[0:count-1], in2[0:count-1], count, \
                              from: out1[0:count-1])
        #pragma omp parallel
            int i;
            #pragma omp for
            for (i = 0; i < count; i++)
                out1[i] = in1[i] + in2[i];

The above code snippet is an example of dispatching an OpenMP parallel region using the target construct. The map clauses on the construct indicate data movement required before and after executing the dispatched region.

std::vector<Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();
Buffer bufA   (context, CL_MEM_READ_ONLY,  bufsize);
Buffer bufB   (context, CL_MEM_READ_ONLY,  bufsize);
Buffer bufDst (context, CL_MEM_WRITE_ONLY, bufsize);
Program::Sources    source(1, std::make_pair(kernelStr,strlen(kernelStr)));
Program             program = Program(context, source);;
Kernel kernel(program, "add");
kernel.setArg(0, bufA);
kernel.setArg(1, bufB);
kernel.setArg(2, bufDst);
Event ev1,ev2,ev3,ev4;
CommandQueue Q(context, devices[0]);
Q.enqueueWriteBuffer(bufA, CL_FALSE, 0, bufsize, srcA, NULL, &ev1);
Q.enqueueWriteBuffer(bufB, CL_FALSE, 0, bufsize, srcB, NULL, &ev2);
Q.enqueueNDRangeKernel(kernel, NullRange, NDRange(NumVecElements),
NDRange(WorkGroupSize), NULL, &ev3);
Q.enqueueReadBuffer (bufDst, CL_TRUE, 0, bufsize, dst, NULL, &ev4);

The above code is a similar example with OpenCL using the C++ bindings for OpenCL APIs

Using a standard programming model or API allows the programmer to quickly handle the mechanics of dispatching code and data from the host to the accelerator and focus on optimizing code for the system. The programmer can also leverage tooling (debuggers, profilers) & training ecosystems built around these standard programming models.

< Previous
Page 1 of 2
Next >

Loading comments...

Most Commented