Productive Machine Learning Support for Locality Optimizations in Distributed Memory Systems Open Access
Data movement, and therefore data locality optimization, has significant impact on application performance and power consumption. Data locality optimization refers, in this work, to maximizing, exposing, and exploiting locality in the underlying application by the layers of the system stack, from the programming interface to the hardware architecture. Recent trends in HPC such as increasing parallelism, deep memory hierarchies and new application areas increase the impact of data locality and the difficulty of exploiting it. Programmers can undertake extensive hand-tuning efforts to manually exploit locality which hinders productivity. On the other hand, compilers and runtime systems may attempt to automatically understand and exploit the locality characteristics of the applications. Operating systems and architectures may contribute to data locality optimization as well. However, the benefit from such automatic techniques can be limited.In this dissertation, it is argued that data locality optimization is a key factor for achieving performance and scalability in distributed memory architectures with minimal additional effort compared to shared memory. More importantly, it is here advocated that the majority of such extensive data locality optimizations can only be possible through an end-to-end programming language environment design. To create a design that can reduce the pro- grammer effort significantly while achieving performance and scalability, we propose a new abstraction where data locality optimization is broken into a complex combination of smaller subtasks. These subtasks, then, can be addressed by different sublayers in the programming environment in a cooperative manner.Therefore, first, a formal framework for codesigning locality-aware programming envi- ronments is introduced. Such a codesign framework can help deduce a tradeoff between automatic and manual techniques by engineering a productive division of labor between the user, the compiler and the runtime system. This way, distributed memory programming languages can provide support for scalable and productive data movement.Second, we use this framework to design a language-based optimization infrastructure called LAPPS. This novel, user-driven approach strikes a balance between the ease-of-use of compiler-based automated prefetching, and the high-performance but laborious manual prefetching. This is achieved by assigning different data locality optimization tasks to the programmer, the compiler and the runtime as suggested by the formal framework.Third, a machine-learning-based automated code optimization system is introduced. This system can learn access patterns and leverage the LAPPS infrastructure to optimize data locality. Using this environment, the programmer only annotates constructs such as loops that are time-critical. The system, then, automatically collects application profiles and creates an optimized executable. In a typical programmer’s workflow this tool can be considered as a compiler wrapper. This allows programmers to develop their applications assuming a shared memory architecture, yet achieve scalable performance on distributed memory.Our extensive and thorough experimental analysis shows that the implementations developed in the context of this novel intelligent locality optimization framework can reduce the programmer effort and data movement in HPC systems significantly, thereby reducing the programmability gap between the shared memory and distributed memory systems.