Multiprocessing is ubiquitous in today’s electronic systems. The key benefits are faster processing due to parallel execution and improved operating characteristics such as power, thermal and latency by engaging the right processor(s) for each activity. The same is valid for multicore systems that typically have even closer data and timing ties between the processing units.
Historically, parallel execution and the resulting performance gains have received most of the attention of the engineering community which produced progressive parallel software standards such as Open Computing Language (OpenCL), Heterogeneous System Architecture (HSA), Open Multi-Processing (OpenMP) and similar. The operating characteristics of multiprocessor systems, like power, although vital for electronic systems, stayed entrapped in the OS-directed power management (OSPM) bundled with the main OS as the exclusive controller of the whole system. The first challenge to OSPM came with multiple guest OSs running on a single processor hypervisor, followed by the multi-OS, multicore hypervisor variant and finally the heterogeneous multi-OS, multiprocessor systems. Consequently, hypervisors and dedicated cores started taking over the OSPM role, and engineers developed homegrown power APIs to coordinate power control between various OSs of the electronic system.
The Latency/Power Tradeoff Problem
There is currently no commonly used standard to manage system power in heterogeneous multiprocessor systems. Each vendor must reinvent APIs and protocols to handle power management and spend time integrating these APIs into each codebase for every processing core in the system. To meet market windows, vendors tend to leverage existing power management solutions in the software they use for each core and then loosely couple these cores together to create ad hoc power-management regimes. These ad hoc regimes tend to have high latency power-state transitions. To work around this, companies create static, infrequently updated data-driven approaches, trading off latency for power. Because of these tradeoffs, vendors have to leave power on the table.
New Power API for Heterogeneous Processors
A solution to this problem is to create an API specification that all software vendors can reasonably implement, a spec that acts as an underlying power management substrate. Because of the unique needs of heterogeneous systems, it should be possible to implement the API using a small amount of code so that even the smallest cores can participate in system-wide power management. The API should also be sufficiently generic so that most heterogeneous architectures can be represented, but not too generic that the API becomes hard to use. Finally, the API should be compatible with existing power management schemes like ARM’s Power State Coordination Interface (PSCI).
The new eXtensible Energy Management Interface (XEMI) developed by AGGIOS and Xilinx over the last two years fulfills each of these requirements.
XEMI is not revolutionary; it is not intended to be. XEMI is similar to ARM’s PSCI. Unlike PSCI, XEMI covers heterogeneous systems. XEMI's intention is to provide a common API that allows all software components to power manage cores and peripherals. At a high-level, XEMI allows the user to specify a high-level power management goal such as suspending a complex processor cluster or just a single core. The underlying implementation is then free to implement an optimal power-saving approach autonomously. This approach cuts latency because the requestor of the action can specify a high-level power goal and not have to execute each step of the power state transition.
Message-Passing Interface Masters System Power
The XEMI API provides the mechanisms for managing power states of components in heterogeneous multi-core systems. By delegating power state control of system components to a central energy management layer, XEMI enables multiple independent processing clusters to share available slave devices in an energy efficient manner.
XEMI assumes a system architecture consisting of one or more processing clusters, central energy management software (which itself can be distributed across multiple cores), as well as slave devices that can enter multiple power states (Figure 1). Furthermore, there may be a hierarchy of power islands and power domains, allowing groups of components to be turned off by either switching off the power locally in case of a power island or for power domains via an external regulator or power management IC (PMIC).
Figure 2. XEMI System Architecture
The processing clusters will submit power/performance requests via XEMI. These requests are received and processed by the power- management controller. The power-management controller is responsible for managing the power state of all slave devices, which it chooses based on the cumulative power performance requirements asserted by the processing clusters. It also is responsible for managing the power state of the processing clusters themselves, which will use XEMI to coordinate their own suspend procedures with the controller.
The suspend procedure of processing clusters is mostly initiated and conducted by the software running on those clusters, while the power- management controller is required in order to perform the final steps of the suspend procedure. The controller is powering down the power islands and power domains the clusters reside in and by potentially adjusting the power state of slave devices the processing clusters are required to operate.
XEMI also includes APIs for requesting the suspend or wake-up of other processing clusters, providing a standardized mechanism to coordinate system sleep states as well as manage master/slave relationships between processing clusters.
The requirements passed in the XEMI APIs can either refer to explicit component capabilities, or include latency requirements, allowing the power management controller to choose the optimum power state for both slave devices as well as the processing clusters. Given that actual latencies will be platform specific, depending on components like external PMICs, XEMI allows these latency details to be encapsulated in the central controller firmware, rather than requiring the software on each of the processing clusters to be adjusted with such details. Application software just needs to know its latency requirements; how these requirements map to states of the various devices is left up to the power management controller.
XEMI for Xilinx Zynq UltraScale+ MPSoC
Aggios and Xilinx have created an implementation of XEMI for the Zynq UltraScale+ MPSoC (Figure 2). This platform was ideal to build the first implementation of XEMI because the programmable logic allowed the engineering team to explore the design space efficiently. In addition, this platform will be ideal for others to continue to refine the XEMI specification because of its general availability and ease of use.
Click for larger image
Figure 2. UltraScale+ MPSoC Architecture
The Zynq UltraScale+ MPSoC contains several processing clusters that can act independently of each other, including a quad ARM Cortex-A53 application processor unit (APU), a dual ARM Cortex-R5 real-time processor unit (RPU), and the programmable logic, which can host one or more soft-core processors. All of these processors can share many of the slave devices. Furthermore, when a processor like the APU is not running, power consumption from leakage can be further reduced by turning off the power island completely. Further reductions in power are possible by turning off the entire Full Power Domain (FPD). XEMI is used for coordinating and implementing these and other transitions.
Click for larger image
Figure 3. Deep Sleep UML Diagram
The Unified Modeling Language (UML) diagram in Figure 3 depicts how XEMI is used for realizing a typical power management use case. The diagram shows the transition from an “all-on” state to a deep-sleep state without a tight wakeup latency requirement for any element in the Full Power Domain (FPD). In deep-sleep both processing units are off, the memories are in retention and the FPD is off.
The Real-Time Processing Unit (RPU) initiates the transition into the deep-sleep state by calling pm_request_suspend . The Platform Management Unit (PMU) then asks the APU to suspend itself with pm_init_suspend. The APU performs its own self-suspend and saves its context in Double Data-Rate (DDR) memory. Once the APU suspend procedure is completed the PMU notifies the RPU with pm_acknowledge . Since no more devices inside the FPD are in use or have tight latency requirements, the PMU turns off the power to the FPD.
The RPU now releases the USB device via the PMU. The PMU calls pm_release_node and initiates its own suspend procedure, configuring the real-time clock (RTC) to be its wake-up source. With no more power management activity, the PMU enters a Sleep state. When a wake event occurs, the PMU knows which devices need to be awakened and it takes care of the correct power-up sequences for power domains and power islands as necessary.
The XEMI API solves heterogeneous multiprocessing power management challenges without many of the tradeoffs necessary in traditional OSPM approaches. It allows software vendors the freedom to build an underlying power management substrate optimized for their platform using efficient implementations. The substrate approach allows designers the ability to reclaim power that traditional implementations leave on the table. Efforts that required a great deal of cross-system coordination, such as powering off many heterogeneous cores, are made easier with the XEMI API's high-level, goal-centric, approach. Over the last 2 years, Aggios and Xilinx have worked to make the vision of XEMI a reality. With the recent introduction of the latest heterogeneous programmable processing SoC from Xilinx, the Zynq UltraScale+ MPSoC, Aggios and Xilinx have created a platform that the ecosystem can use to continue to refine and improve the XEMI API.