The problem of mapping threads, or virtual cores, to
physical cores on multicore systems has been studied for
over a decade. Despite this effort, there is still no method that
will help us decide in real time and for arbitrary workloads
the relative impact of different mappings on performance.
Prior work has made large strides in this field, but these solutions
addressed a limited set of concerns (e.g., only shared
caches and memory controllers, or only asymmetric interconnects),
assuming hardware with specific properties and
leaving us unable to generalize the model to other systems.
Our contribution is an abstract machine model that enables
us to automatically build a performance prediction
model for any machine with a hierarchy of shared resources.
In the process of developing the methodology for building
predictive models we discovered pitfalls of using hardware
performance counters, a de facto technique embraced by the
community for decades. Our new methodology does not rely
on hardware counters at the expense of trying a handful of
additional workload mappings (out of many possible) at runtime.
Using this methodology data center operators can decide
on the smallest number of NUMA (CPU+memory) nodes to
use for the target application or service (which we assume
to be encapsulated into a virtual container so as to match
the reality of the modern cloud systems such as AWS),
while still meeting performance goals. More broadly, the
methodology empowers them to efficiently “pack” virtual
containers on the physical hardware in a data center.