Performance exploration of the Accelerator Coherency Port using Xilinx ZYNQ
As the energy efficiency requirements (e.g. GOPS/W) of silicon chips are growing exponentially, computer architects are seeking solutions to continue application performance scaling.
One emerging solution is to use specialized functional units (accelerators) at different levels of a heterogeneous architecture. These specialized units cannot be used as general-purpose compute engines. However, they provide enhanced execution speed and power efficiency for their specific computational workloads.
There exist numerous applications for accelerators in both of the embedded and high performance computing markets. Examples include video processing , software-defined radio, network traffic management, DNA computing and fully programmable hardware acceleration platforms.
Efficient sharing of data in a heterogeneous MpSoC which contains different types of integrated computational elements is a challenging task. Especially when private caches of CPU cores and dedicated memory of accelerators are used to store local copies of data, it is crucial to ensure that every pro- cessing element has a consistent view of the shared memory space.
The Accelerated coherency port (ACP ) was developed by ARM as a hardware solution to facilitate dealing with cache coherency issues when introducing new accelerator blocks to a multi-core system. In fact, ACP enables hardware accelerators to issue coherent requests to the CPU subsystem memory space.
Xilinx ZYNQ all-programmable SoC provides the designers with an ARM Cortex-A9 MPCore sub-system along with a high performance DRAM memory controller and various peripherals. It also implements a complete FPGA fabric. The CPU sub-system and FPGA are connected through AXI interfaces which allow the logic on the fabric to perform cache coherent accesses to the memory through ACP or directly perform accesses to the DRAM.
In this paper, we use the ZYNQ device and build a complete infrastructure to evaluate the performance and energy efficiency of different processor-accelerator memory sharing schemes. The Accelerator coherency port (ACP) emerges as a possible solution by enabling hardware accelerators to issue coherent accesses to the memory space.
In this paper, we quantify the advantages of using ACP over the traditional method of sharing data on the DRAM. We select the Xilinx ZYNQ as target and develop an infrastructure to stress the ACP and high-performance (HP) AXI interfaces of the ZYNQ device. Hardware accelerators on both of HP and ACP AXI inter- faces reach full duplex data processing bandwidth of over 1.6 GBytes/s running at 125 MHz on a XC7Z020-1C device.
The effect of background DRAM and cache traffic on the performance of accelerators is analyzed. For a sample image filtering task, the cooperative operation of CPU and ACP accelerator (CPU-ACP ) gains a speed-up of 1.2X over CPU and HP acceleration (CPU-HP ). In terms of energy efficiency, an improvement of 2.5 nJ (> 20%) is shown for each byte of processed data.
This is the first work which provides detailed practical comparisons on the speed and energy efficiency of various processor-accelerator memory sharing techniques in a configurable heterogeneous platform.
To read this external content in full, download the complete paper from the author archives online.