Benchmarking ARM’s big-little architecture
There seems to be a certain amount of excitement brewing around Samsung's Exynos Octa versus Qualcomm's quad-core Snapdragons and MediaTek's forthcoming True Octa-ness. I thought I would share some of the "big-little" thinking of ARM that I received on a recent trip to Cambridge, England.
Big-little is the approach whereby differently optimized cores implementing the same instruction set architecture are deployed in an SoC. This means tasks can be assigned to different cores depending on the context in which the device finds itself. However, it should also be remembered that big-little is a special case and does not cover the entire heterogeneous multiprocessing (HMP) landscape. HMP also includes the possibility of running software tasks on CPU cores, graphics processing units (GPUs), and other cores besides.
But ignoring GPU-compute at present, big-little comes in three flavors. There may be more flavors in the future. The technique may even evolve into little-big-biggest-other, but for now, ARM identifies three modes of operating big-little.
The simplest is the clustered migration approach adopted by Samsung for the Exynos 5 Octa. This has the advantage of presenting a uniprocessor programming model to the software engineer. It keeps caching and memory management simple. The software runs on a cluster of little cores or migrates up to an equivalent cluster of big cores as an extension of a dynamic voltage and frequency scaling scheme.
Slightly more complex is the CPU migration model. There are multiple big-little pairs. This produces an energy-saving advantage over clustered migration; more cores can be switched off, and it remains relatively simple to integrate into an operating system. Most kernel code is unaffected by the regime, and only the power management software needs to be tuned. However, it still restricts systems to having equal numbers of big and little cores. At maximum, only half the cores can be in use at any time.
The third mode, which has the fewest limitations and the most complexity, is what ARM calls global task scheduling. This requires smart operating system intervention, so that any thread can run on any core. In theory, all the cores could be pressed into service together, though practically, this may have to be done sparingly and for short bursts, due to thermal considerations.
When you get into the global task scheduling, you need to have a task scheduler that can pay attention to workloads and available resources and knows what runs best where, what went to sleep when, and who's got what memory. Not only do those algorithms need to be smart and able to exercise priority, but they also need to be debugged to make sure tasks don't end up blocking each other or getting into wasteful behaviors.To read more go to “On the positive side.”