Shoal: smart allocation and replication of memory for parallel programs
Shoal: smart allocation and replication of memory for parallel programs
08 July 2015
Modern NUMA multi-core machines exhibit complex latency and throughput characteristics, making it hard to allocate memory optimally for a given program’s access patterns. However, sub-optimal allocation can significantly impact performance of parallel programs. We present an array abstraction that allows data placement to be automatically inferred from program analysis, and implement the abstraction in Shoal, a runtime library for parallel programs on NUMA machines. In Shoal, arrays can be automatically replicated, distributed, or partitioned across NUMA domains based on annotating memory allocation statements to indicate access patterns. We further show how such annotations can be automatically provided by compilers for high-level domainspecific languages (for example, the Green-Marl graph language). Finally, we show how Shoal can exploit additional hardware such as programmable DMA copy engines to further improve parallel program performance. We demonstrate significant performance benefits from automatically selecting a good array implementation based on memory access patterns and machine characteristics. We present two case-studies: (i) Green-Marl, a graph analytics workload using automatically annotated code based on information extracted from the highlevel program and (ii) a manually-annotated version of the PARSEC Streamcluster benchmark.
Venue : USENIX Annual Technical Conference