Mobile SoCs have been multicore for some time now, both in the homogeneous sense of an array of identical (or at least similar) CPU cores and also in the heterogeneous sense of DSPs, GPUs, and other programmable and configurable processing cores on the die. With this variety of parallel processing opportunities available, what kinds of applications and use cases drive the increasing adoption of heterogeneous multicore implementations and what are the benefits available to users?
There are a couple of broadly dissimilar classes of applications that can be crudely classified as high performance computing (HPC) and consumer. The HPC apps may feature long simulations of very large data sets with extreme precision and accuracy requirements, while consumer apps may feature much less stringent accuracy requirements but operate in real time or near real time while still being able to handle relatively large data sets.
The mobile context is dominated by video-rate apps requiring manipulation or analysis of visual data at a low level with a relatively small amount of higher-level code. These apps are inherently heterogeneous in the sense that they contain layers of functions which can be divided between the CPU array and the GPU (which is classed as a single core but in fact consists of a large array in itself) and can therefore achieve best efficiency, meaning higher frame rate or lower power or more responsiveness – or all three, by being distributed across the available resources.
One of the consequences of the emergence of this class of applications is that the purpose and nature of the camera pipeline (ISP) is changing from being primarily aimed at image production to being redefined as a vision processor, usually as part of a heterogeneous trio in cooperation with the CPU and GPU. Application examples include video conferencing with face beautification, where the majority of the workload can either be handled by the GPU or shared between the GPU and the ISP. Video encoding can be a CPU task or can be offloaded onto dedicated encoder hardware that we call a Video Processing Unit (VPU). In this scenario, the objectives are to maintain consistent frame rates while simultaneously keeping to a power budget appropriate for extended use on a mobile device.
A retail analytics application from Vadaro, while using broadly similar low-level tasks, shows another requirement, which is to run multiple kernels (for multiple customer detection) simultaneously on the GPU, while in another app, Find Exact/Find Similar, the CPU is left free for database searching and results manipulation by delegating the vision-specific tasks to the GPU.
These three outcomes – higher frame rate, lower power, and free CPU cycles – are the primary benefits sought by mobile developers and are available through heterogeneous multicore. But how can this be quantified and what can be achieved by all apps?
Three application examples give us some data points. A basic image filtering application, run on a dual CPU system with a PowerVR SGX540 GPU (Figure 1 ) shows that by moving the vast majority of the work to the GPU a performance gain of 95% with a power reduction of 25% can be achieved.
Edge detection, a more computationally taxing workload running on a quad core system with a PowerVR G6200 GPU, shows the GPU maintaining frame rate at half the power consumption of the CPUs and boosting frame rate by 3x within the power budget envelope set by the device power management software. A final example, a software implementation of a VP9 video decoder once again pairing a quad-core CPU with a PowerVR G6200 GPU, shows the heterogeneous solution maintaining the frame rate of heavily optimized CPU code at significantly although not dramatically lower power. The major benefit in this final case is that when the decoder is run from within the browser based app, user interface responsiveness is significantly improved due to the greater availability, at finer granularity, of CPU cycles.
Thus the efficiency improvement available by moving to heterogeneous compute is affected by the type of app, the relative floating point performance of the GPU versus the CPU, and one other significant factor, which is the overhead associated with partitioning the workload between the compute units. In the image filtering app, removing an image copy (originally imposed due to API requirements and not required on the CPU-only version) resulted in approximately 25% improvement in performance with a reduction in power consumption.
The VP9 decoder case illustrates a different overhead, that of setting up, synchronizing, and dispatching the GPU workload. Only a portion of the workload is delegated to the GPU, dividing a naturally homogeneous task into two and making the overhead dominate performance. Occupancy analysis shows that the GPU is capable of taking the whole task while meeting performance deadlines, leading to the conclusion that greater improvement results from handing over the largest possible workload to the GPU.
As a result of these trials, we can make some basic recommendations for heterogeneous compute:
The type and degree of efficiency improvement is heavily dependent on the type of workload. Careful selection of appropriate workloads is necessary; not all apps can achieve all three benefits simultaneously.
Performance is heavily system-specific, especially dependent on the relative capabilities of the CPU and GPU. These vary widely so that specific system knowledge is required to achieve consistent performance. Also, without a zero copy mechanism, data movement through the system can dominate performance.
Optimal workload partitioning is critically important; in general it is best to divide workloads into the largest possible chunks in order to minimize overhead.
A longer version of this article will be presented at the Multicore Developers Conference , May 7- 8, where Peter McGuinness will speak on Multicore Computing for a Mobile Environment (ME1138) .
Peter McGuinness is a director of multimedia technology marketing at Imagination Technologies. He has an extensive background in the architecture and design of integrated circuits and systems for graphics and video, where he holds a number of patents and patent applications. He began his career as a silicon chip designer in 1980 at Plessey Research in England, leaving in 1983 for start-up Inmos, Ltd. (later part of STMicroelectronics).