Computer systems are increasingly featuring powerful parallel devices with the advent of manycore CPUs, GPUs and FPGAs. This offers the opportunity to solve large computationally-intensive problems at a fraction of the time of traditional CPUs. However, exploiting this heterogeneous hardware requires the use of low-level programming languages such as OpenCL, which is incredibly challenging, even for advanced programmers.
On the application side, interpreted dynamic languages are increasingly becoming popular in many emerging domains for their simplicity, expressiveness and flexibility. However, this creates a wide gap between the nice high-level abstractions offered to non-expert programmers and the low-level hardware-specific interface. Currently, programmers have to rely on specialized high performance libraries or are forced to write parts of their application in a low-level language like OpenCL. Ideally, programmers should be able to exploit heterogeneous hardware directly from their interpreted dynamic languages.
In this paper, we present a technique to transparently and automatically offload computations from interpreted dy- namic languages to heterogeneous devices. Using just-in- time compilation, we automatically generate OpenCL code at runtime which is specialized to the actual observed data types using profiling information. We demonstrate our technique using R which is a popular interpreted dynamic lan- guage predominately used in big data analytics. Our experimental results show execution on a GPU yields speedups of over 150x when compared to the sequential FastR im- plementation and performance is competitive with manually written GPU code. We also show that when taking into ac- count startup time, large speedups are achievable, even when the applications runs for as little as a few seconds.