News


22-APR-2017

Pandia: comprehensive contention-sensitive thread placement.

Pandia is a system for modelling the performance of in memory parallel workloads. It generates a description of a workload from a series of profiling runs, and combines this with a description of the machine's hardware to model the workload's performance over different thread counts and different placements of those threads.

The approach is “comprehensive” in that it accounts for contention at multiple resources such as processor functional units and memory channels. The points of contention for a workload can shift between resources as the degree of parallelism and thread placement changes. Pandia accounts for these changes and provides a close correspondence between predicted performance and actual performance. Testing a set of 22 benchmarks on 2 socket Intel machines fitted with chips ranging from Sandy Bridge to Haswell we see median differences of 1.05% to 0% between the fastest predicted placement and the fastest measured placement, and median errors of 8% to 4% across all placements.

Pandia can be used to optimize the performance of a given workload for instance, identifying whether or not multiple processor sockets should be used, and whether or not the workload benefits from using multiple threads per core. In addition, Pandia can be used to identify opportunities for reducing resource consumption where additional resources are not matched by additional performance for instance, limiting a workload to a small number of cores when its scaling is poor.

Read

22-APR-2017

A General Model for Placement of Workloads on Multicore NUMA Systems

The problem of mapping threads, or virtual cores, to physical cores on multicore systems has been studied for over a decade. Despite this effort, there is still no method that will help us decide in real time and for arbitrary workloads the relative impact of different mappings on performance. Prior work has made large strides in this field, but these solutions addressed a limited set of concerns (e.g., only shared caches and memory controllers, or only asymmetric interconnects), assuming hardware with specific properties and leaving us unable to generalize the model to other systems. Our contribution is an abstract machine model that enables us to automatically build a performance prediction model for any machine with a hierarchy of shared resources. In the process of developing the methodology for building predictive models we discovered pitfalls of using hardware performance counters, a de facto technique embraced by the community for decades. Our new methodology does not rely on hardware counters at the expense of trying a handful of additional workload mappings (out of many possible) at runtime. Using this methodology data center operators can decide on the smallest number of NUMA (CPU+memory) nodes to use for the target application or service (which we assume to be encapsulated into a virtual container so as to match the reality of the modern cloud systems such as AWS), while still meeting performance goals. More broadly, the methodology empowers them to efficiently “pack” virtual containers on the physical hardware in a data center.

Read


15-APR-2017

Better Splittable Pseudorandom Number Generators (and Almost As Fast)

We have tested and analyzed the {\sc SplitMix} pseudorandom number generator algorithm presented by Steele, Lea, and Flood \citeyear{FAST-SPLITTABLE-PRNG}, and have discovered two additional classes of gamma values that produce weak pseudorandom sequences. In this paper we present a modification to the {\sc SplitMix} algorithm that avoids all three classes of problematic gamma values, and also a completely new algorithm for splittable pseudorandom number generators, which we call {\sc TwinLinear}. Like {\sc SplitMix}, {\sc TwinLinear} provides both a \emph{generate} operation that returns one (64-bit) pseudorandom value and a \emph{split} operation that produces a new generator instance that with very high probability behaves as if statistically independent of all other instances. Also like {\sc SplitMix}, {\sc TwinLinear} requires no locking or other synchronization (other than the usual memory fence after instance initialization), and is suitable for use with {\sc simd} instruction sets because it has no branches or loops. The {\sc TwinLinear} algorithm is the result of a systematic exploration of a substantial space of nonlinear mixing functions that combine the output of two independent generators of (perhaps not very strong) pseudorandom number sequences. We discuss this design space and our strategy for exploring it. We used the PractRand test suite (which has provision for failing fast) to filter out poor candidates, then used TestU01 BigCrush to verify the quality of candidates that withstood PractRand. We present results of analysis and extensive testing on {\sc TwinLinear} (using both TestU01 and PractRand). Single instances of {\sc TwinLinear} have no known weaknesses, and {\sc TwinLinear} is significantly more robust than {\sc SplitMix} against accidental correlation in a multithreaded setting. It is slightly more costly than {\sc SplitMix} (10 or 11 64-bit arithmetic operations per 64 bits generated, rather than 9) but has a shorter critical path (5 or 6 operations rather than 8). We believe that {\sc TwinLinear} is suitable for the same sorts of applications as {\sc SplitMix}, that is, ``everyday'' scientific and machine-learning applications (but not cryptographic applications), especially when concurrent threads or distributed processes are involved.

Read


06-APR-2017

Polyglot programming in the cloud

Graal polyglot vision presentation

View


01-APR-2017

LabelBank: Revisiting Global Perspectives for Semantic Segmentation

Semantic segmentation requires a detailed labeling of image pixels by object category. Information derived from local image patches is necessary to describe the detailed shape of individual objects. However, this information is ambiguous and can result in noisy labels. Global inference of image content can instead capture the general semantic concepts present. We advocate that holistic inference of image concepts provides valuable information for detailed pixel labeling. We propose a generic framework to leverage holistic information in the form of a LabelBank for pixel-level segmentation. We show the ability of our framework to improve semantic segmentation performance in a variety of settings. We learn models for extracting a holistic LabelBank from visual cues, attributes, and/or textual descriptions. We demonstrate improvements in semantic segmentation accuracy on standard datasets across a range of state-of-the-art segmentation architectures and holistic inference approaches.

Read

29-MAR-2017

A Many-core Architecture for In-Memory Data Processing

We live in an information age, with data and analytics guiding a large portion of our daily decisions. Data is being generated at a tremendous pace from connected cars, connected homes and connected workplaces, and extracting useful knowledge from this data is a quickly becoming an impractical task. Single-threaded performance has become saturated in the last decade, and there is a growing need for custom solutions to keep pace with these workloads in a scalable and efficient manner. A big portion of the power in analytics workloads involves bringing data to the processing cores, and we aim to optimize that. We present the Database Processing Unit or DPU, a shared memory many-core that is specifically designed for in-memory analytics workloads. The DPU contains a unique Data Movement System (DMS), which provides hardware acceleration for data movement and preprocessing operations. The DPU also provides acceleration for core to core com- munication via a unique hardware RPC mechanism called the Atomic Transaction Engine or ATE. Comparison of a fabricated DPU chip with a variety of state of the art x86 applications shows a performance/Watt advantage of 3x to 16x.

Read

28-MAR-2017

Building Reusable, Low-Overhead Tooling Support into a High-Performance Polyglot VM

Software development tools that interact with running programs, for instance debuggers, are presumed to demand di cult tradeo s among performance, functionality, implementation complexity, and user convenience. A fundamental change in thinking obsoletes that presumption and enables the delivery of e ective tools as a forethought, no longer an afterthought.

View

28-MAR-2017

PGX.UI: Visual Construction and Exploration of Large Property Graphs

Transforming existing data into graph formats and visualizing large graphs in a comprehensible way are two key areas of interest of information visualization. Addressing these issues requires new visualization approaches for large graphs that support users with graph construction and exploration. In addition, graph visualization is becoming more important for existing graph processing systems, which are often based on the property graph model. Therefore this paper presents concepts for visually constructing property graphs from data sources and a summary visualization for large property graphs. Furthermore, we introduce the concept of a graph construction time line that keeps track of changes and provides branching and merging, in a version control like fashion. Finally, we present a tool that visually guides users through the graph construction and exploration process.

Read

25-MAR-2017

A Two-List Framework for Accurate Detection of Frequent Items in Large Data Sets

The problem of detecting the most frequent items in large data sets and providing accurate frequency estimates for those items is becoming more and more important in a variety of domains. We propose a new two-list framework for addressing this problem, which extends the state-of-the-art Filtered Space-Saving (FSS) algorithm. Two algorithms implementing this framework are presented: FSSAL and FSSA. An adaptive version of these algorithms is presented, which adjusts the relative sizes of the two lists based on the estimated number of distinct keys in the data set. Analytical comparison with the FSS algorithm showed that FSS2L algorithms have smaller expected frequency estimation errors, and experiments on both artificial and real workloads confirm this result. A theoretical analysis of space and time complexity for all considered algorithms was performed. Finally, we showed that FSS2L algorithms can be naturally parallelized, leading to a linear decrease in the maximum frequency estimation error.

Read

17-MAR-2017

Language-Independent Information Flow Tracking Engine for Program Comprehension Tools

Program comprehension tools are often developed for a specific programming language. Developing such a tool from scratch requires significant effort. In this paper, we report on our experience developing a language-independent framework that enables the creation of program comprehension tools, specifically tools gathering insight from deep dynamic analysis, with little effort. Our framework is language independent, because it is built on top of Truffle, an open-source platform, developed in Oracle Labs, for implementing dynamic languages in the form of AST interpreters. Our framework supports the creation of a diverse variety of program comprehension techniques, such as query, program slicing, and back-in-time debugging, because it is centered around a powerful information-flow tracking engine. Tools developed with our framework get access to the information-flow through a program execution. While it is possible to develop similarly powerful tools without our framework, for example by tracking information-flow through bytecode instrumentation, our approach leads to information that is closer to source code constructs, thus more comprehensible by the user. To demonstrate the effectiveness of our framework, we applied it to two of Truffle-based languages namely Simple Language and TruffleRuby, and we distill our experience into guidelines for developers of other Truffle-based languages who want to develop program comprehension tools for their language.

Read

15-MAR-2017

An Efficient Tunable Selective Points-to Analysis for Large Codebases

Points-to analysis is a fundamental static program analysis technique for tools including compilers and bug-checkers. Although object-based context sensitivity is known to im-prove precision of points-to analysis, scaling it for large Java codebases remains an challenge. In this work, we develop a tunable, client-independent, object-sensitive points-to analysis framework where heap cloning is applied selectively. This approach is aimed at large codebases where standard analysis is typically expensive. Our design includes a pre-analysis that determines program points that contribute to the cost of an object-sensitive points-to analysis. A subsequent analysis then determines the context depth for each allocation site. While our framework can run standalone, it is also possible to tune it – the user of the framework can use the knowledge of the codebase being analysed to influence the selection of expensive program points as well as the process to differentiate the required context-depth. Overall, the approach determines where the cloning is beneficial and where the cloning is unlikely to be beneficial. We have implemented our approach using Souffl ́e (a Datalog compiler) and an extension of the DOOP framework. Our experiments on large programs, including OpenJDK, show that our technique is efficient and precise. For the OpenJDK, our analysis reduces 27% of runtime and 18% of memory usage for a negligible loss of precision, while for Jython from the DaCapo benchmark suite, the same analysis reduces 91% of runtime for no loss of precision.

Read


Hardware and Software, Engineered to Work Together