There is growing interest to replace traditional servers with low- power multicore systems such as ARM Cortex-A9. However, such systems are typically provisioned for mobile applications that have lower memory and I/O requirements than server application.
Thus, the impact and extent of the imbalance between application and system resources in exploiting energy efficient execution of server workloads is unclear.
This paper proposes a trace-driven analytical model for understanding the energy performance of server work- loads on ARM Cortex-A9 multicore systems.
Key to our approach is the modeling of the degrees of CPU core, memory and I/O resource overlap, and in estimating the number of cores and clock fre- quency that optimizes energy performance without compromising execution time.
Since energy usage is the product of utilized power and execution time, the model first estimates the execution time of a program. CPU time, which accounts for both cores and memory response time, is modeled as an M/G/1 queuing system.
Workload characterization of high performance computing, web hosting and financial computing applications shows that bursty memory traffic fits a Pareto distribution, and non-bursty memory traffic is exponentially distributed.
Our analysis using these server workloads re- veals that not all server workloads might benefit from higher num- ber of cores or clock frequencies. Applying our model, we predict the configurations that increase energy efficiency by 10% without turning off cores, and up to one third with shutting down unutilized cores.
For memory-bounded programs, we show that the limited memory bandwidth might increase both execution time and energy usage, to the point where energy cost might be higher than on a typ- ical x64 multicore system. Lastly, we show that increasing memory and I/O bandwidth can improve both the execution time and the energy usage of server workloads on ARM Cortex-A9 systems.
We propose a hybrid measurement-analytical approach to characterize the energy performance of server workloads on state-of-the-art low power multicores.
Firstly, we introduce a general analytical model for predicting execution time and energy used in parallel applications, as a factor of number of active cores and core clock frequency.
The key idea behind the model is to characterize the overlap between the response times of three key types of resources: processor cores, memory and network I/O re- sources.
Using a simple queueing model, we account the over- lap between CPU work cycles, CPU memory cycles and network I/O execution time, and identify the system bottleneck. Furthermore, we model the impact of changing the number of cores on each type of response time.
To apply this general model, we perform a series of baseline runs of an application, during which we collect traces of the total number of cycles, total stall cycles, total last level cache misses, the time-distribution of the last level cache misses and I/O requests profile.
Using a static power profile of the system and these collected metrics, we can predict the execu- tion time and energy usage of an application for different number of cores and clock frequencies, and thus, we can select the configuration that maximizes performance without wasting energy.
We validate the model against direct measurement of execution time and energy on a low-power server based on a quad-core Exynos 4412 ARM Cortex-A9 multicore processor.
Validation on three types of parallel applications spanning high-performance computing, web-hosting and financial computing indicates that the relative error between predictions of our model averages 9%, with 70% of the experiments falling under 5% error.
The second contribution of our model is an analysis of execution performance and energy usage for a series of workloads covering high performance computing, web-hosting and financial applications.
To read this external content in full, download the complete paper from the open author archives online at the University of Singapore.