Locality-Driven Power Optimization Techniques for High-Performance Parallel Systems Open Access
Downloadable ContentDownload PDF
The computational capabilities of high-performance computing (HPC) systems continue to improve, but at a cost of increased electrical power consumption. The environmental and economic impact of this increasing power consumption is motivating research in techniques that can reduce HPC power consumption without significantly impacting the overall system performance. As processor frequency increases have plateaued due to the power and thermal dissipation limits of high density electronic components, the greatest improvement in computational performance has come from increasing hardware level parallelism at the core, processor, node, and overall system level. This trend has, in turn, driven the increased use of parallel programming paradigms to be able to take advantage of the greater hardware capabilities. One major parallel programming paradigm is the Partitioned Global Address Space (PGAS), which uses a Single Program Multiple Data (SPMD) model. For most algorithm types, SPMD threads are constantly communicating and synchronizing across the various levels of hardware parallelism and will intermittently stall because of the response latency from remote thread(s) involved in communications or synchronization. This dissertation describes research to reduce the energy waste from these stalls by leveraging the locality-awareness principle to develop power efficient optimization techniques. Two complementary types of power optimization techniques that can be applied to many common classes of high-performance computing applications are examined. These techniques are: i) intra-process locality-driven power optimizations which oer programmers and system designers opportunities to control processor frequency and sleep states amongst threads of the same process; and ii) inter-process locality-driven power optimizations which is the application of job mix co-placement (i.e., mapping running applications to CPU cores using specific affinity patterns) and co-scheduling (i.e., job ordering based on symbiosis) to threads of different and diverse processes that are executing together on an HPC cluster. The co-placement power optimization can reduce energy consumption up to 25%. The validation of the optimization techniques relied heavily on being able to correctly measure the power utilization of the CPU and memory subsystems. At the time we began our work on this topic, most investigation in power optimization for HPC systems was done using indirect methods such as estimation based on time or CPU performance counters. Instead, we developed a precise and scalable mechanism to directly measure discrete CPU and memory power consumption, closely synchronized with program execution time. Given the evolution of embedded power sensors in later generation Intel® microprocessors, we also integrated our initial non-intrusive measurement system with an intrusive measurement system using the embedded sensors. The two measurement systems working in tandem generated a large volume of experimental output, so we applied Big Data techniques to the processing of the raw data and a systematic framework in which to analyze the results. The measurement framework itself represents a significant contribution to the growing community of researchers in HPC power optimization.