Architectural support for task scheduling: hardware scheduling for dataflow on NUMA systems.

Architectural support for task scheduling: hardware scheduling for dataflow on NUMA systems.

Behram Khan, Daniel Goodman, Salman Khan, Will Toms, Paolo Faraboschi, Mikel Lujan, Ian Watson

08 February 2015

To harness the compute resource of many-core system with tens to hundreds of cores, applications have to expose parallelism to the hardware. Researchers are aggressively looking for program execution models that make it easier to expose parallelism and use the available resources. One common approach is to decompose a program into parallel `tasks' and allow an underlying system layer to schedule these tasks to different threads. Software-only schedulers can implement various scheduling policies and algorithms that match the characteristics of different applications and programming models. Unfortunately with large-scale multi-core systems, software schedulers suffer significant overheads as they synchronize and communicate task information over deep cache hierarchies. To reduce these overheads, hardware-only schedulers like Carbon have been proposed to enable task queuing and scheduling to be done in hardware. This paper presents a hardware scheduling approach where the structure provided to programs by task-based programming models can be incorporated into the scheduler, making it aware of a task's data requirements. This prior knowledge of a task's data requirements allows for better task placement by the scheduler which result in a reduction in overall cache misses and memory traffic, improving the program's performance and power utilization. Simulations of this technique for a range of synthetic benchmarks and components of real applications have shown a reduction in the number of cache misses by up to 72 and 95 {\%} for the L1 and L2 caches, respectively, and up to 30 {\%} improvement in overall execution time against FIFO scheduling. This results not only in faster execution and in less data transfer with reductions of up to 50 {\%}, allowing for less load on the interconnect, but also in lower power consumption.


Venue : N/A

File Name : Supercomputing.pdf