Our Publications
Every year our researchers publish hundreds of papers to share their findings with the industry and the academic community. Our primary research areas are big data and machine learning , cloud computing and programming languages..
Research Papers
Towards an Abstraction for Verifiable Credentials and Zero Knowledge Proofs
Most standards efforts and projects around Verifiable Credentials either do not enable use of Zero Knowledge Proofs to balance privacy and accountability, or are too tightly tied to specific cryptographic libraries, which limits choice, flexibility, progress and sustainability. For example, if a project targets a cryptographic library that stops being maintained or otherwise becomes an undesirable dependency, these events can threaten the sustainability of the whole project. We are working on an abstraction to address this problem, which has additional benefits such as making it much simpler to express and understand use case requirements, especially for people without expertise in using specific cryptography libraries. These slides share some of our observations, ideas, experience and opinions so far.
Macaron: A Logic-based Framework for Software Supply Chain Security Assurance
Many software supply chain attacks exploit the fact that what is in a source code repository may not match the artifact that is actually deployed in one’s system. This paper describes a logic-based framework that analyzes a software component and its dependencies to determine if they are built in a trustworthy fashion. The properties that are checked include the availability of build provenances and whether the build and deployment process of an artifact is tamper resistant. These properties are based on the open-source community efforts, such as SLSA, that enable an incremental approach to improve supply chain security. We evaluate our tool on the top-30 Java, Python, and npm open-source projects and show that the majority still do not produce provenances. Our evaluation also shows that a large number of open-source Java and Python projects do not have a transparent build platform to produce artifacts, which is a necessary requirement to increase the trust in the published artifacts. We show that our tool fills a gap in the current software supply chain security landscape, and by making it publicly available the open-source community can both benefit from and contribute to it.
Smoothing Entailment Graphs with Language Models
The diversity and Zipfian frequency distribution of natural language predicates in corpora leads to sparsity in Entailment Graphs (EGs) built by Open Relation Extraction (ORE). EGs are theoretically-founded and computationally efficient, but as symbolic models for natural language inference, they fail if a novel premise or hypothesis vertex is missing at test-time. We introduce a theory of optimal graph smoothing to overcome vertex sparsity by constructing transitive chains. We then demonstrate an efficient, open-domain smoothing method using an off-the-shelf Language Model to find approximations of missing premise predicates, improving recall by 25.1 and 16.3 percentage points on two difficult directional entailment datasets while raising average precision. Further, in a recent QA task, we show that EG smoothing is most useful for answering questions with lesser supporting text, where missing predicates are more costly. Finally, in controlled experiments with WordNet we show that hypothesis smoothing is difficult, but possible in principle.
Diagnosing Compiler Performance by Comparing Optimization Decisions
Modern compilers apply a set of optimization passes aiming to speed up the generated code. The combined effect of individual optimizations is hard to predict. Thus, changes to a compiler’s code may hinder the performance of generated code as an unintended consequence. Performance regressions are often related to misapplied opti- mizations. The regressions are hard to investigate, considering the vast number of compilation units and applied optimizations. Ad- ditionally, a method may be part of several compilation units and optimized differently in each. Moreover, compiled methods and in- lining decisions are not invariant across runs of the virtual machine (VM). We propose to solve the problem of diagnosing performance regressions by capturing the compiler’s optimization decisions. We do so by representing the applied optimization phases, optimization decisions, and inlining decisions in the form of trees. This paper introduces an approach utilizing tree edit distance (TED) to detect optimization differences in a semi-automated way. We present an approach to compare optimization decisions in differently-inlined methods. We employ these techniques to pinpoint the causes of performance problems in various benchmarks of the Graal compiler.
Automated Machine Learning with Explainability
ML has revolutionized a large range of industry applications with new techniques for consuming complex data modalities, such as images and text. However, for a given dataset and business use case, non-technical users are faced by adoption limiting questions, like which model should I use and how should I set its hyper-parameters. This is challenging and time consuming even for seasoned data scientists. The AutoMLx team at Oracle Labs has developed an automated machine learning pipeline with explainability tools built-in for novice and advanced users. In this talk, we provide an overview of our current and upcoming AutoMLx features and some applications; for example, how to predict construction site delays and how to forecast CPU resource usage based on previous consumption trends.
Security Research: Program Analysis Meets Security
In this paper we present the key features of some of the security analysis tools developed at Oracle, Labs. These include Parfait, a static analyser, Affogato a dynamic analysis based on run-time instrumentation of Node.js applications and Gelato a dynamic analysis tool that inspects only the client-side code written in JavaScript. We show the how these tools can be integrated at different phases of the software development life-cycle. This paper is based on the presentation at the ICTAC school in 2021.
Improving Points-to Analysis with Compiler Optimizations
Points-to analysis and compiler optimizations are often seen as separate topics: Even when points-to analysis is used to optimize an application, the analysis is not leveraging the compiler used later for compilation. We integrate a points-to analysis into the compilation process, i.e., use the compiler IR as input for the analysis and apply the analysis results back into the same compiler IR. This makes it easy to run compiler optimization phases like method inlining and constant folding before the analysis, which increases the precision of the analysis. Also, this simplifies the process of applying analysis results because it eliminates the need to map the analysis results back to the unaltered program. In order to run points-to analysis as part of every build process during development, it also needs to scale well for large applications. To improve the scalability of points-to analysis, we propose saturation to remove variables for which the analysis finds more than a certain number of types. We show that saturation significantly reduces the analysis time while having only a small impact on the precision. To show the scalability and precision of our proposed approach, we evaluate the resulting system with Java web services and benchmarks. We compare our optimized analysis with various configurations of a context-insensitive analysis and also a standard context-sensitive points-to analysis definition.
Vibration Resonance Spectrometry (VRS) for the Advanced Streaming Detection of Rotor Unbalance
Determination of the diagnosis thresholds is crucial for the fault diagnosis of industry assets. Rotor machines under different working conditions are especially challenging because of the dynamic torque and speed. In this paper, an advanced machine learning based signal processing innovation termed the multivariate state estimation technique is proposed to improve the accuracy of the diagnosis thresholds. A novel preprocessing technique called vibration resonance spectrometry is also applied to achieve a low computation cost capability for real time condition monitoring. The monitoring system that utilizes the above methods is then applied for prognostics of a fan model as an example. Different levels of radial unbalance were added on the fan and tested, and then compared with the health state. The results show that the proposed methodology can detect the unbalance with a good accuracy and low computation cost. The proposed methodology can be applied for complex engineering assets for better predictive monitoring that could be processed with on-premise edge devices, or eventually a cloud platform due to its capacity for lossless dimension reduction.
Better Distributed Graph Query Planning With Scouting Queries
Query planning is essential for graph query execution performance. In distributed graph processing, data partitioning and messaging significantly influence performance. However, these aspects are difficult to model analytically, which makes query planning especially challenging. This paper introduces scouting queries, a lightweight mechanism to gather runtime information about different query plans, which can then be used to choose the “best” plan. In a depthfirst-oriented graph processing engine, scouting queries typically execute for a brief amount of time with negligible overhead. Partial results can be reused to avoid redundant work. We evaluate scouting queries and show that they bring speedups of up to 8.7× for heavy queries, while adding low overhead for queries that do not benefit.
Towards Intelligent Application Security
Over the past 20 years we have seen application security evolve from analysing application code through Static Application Security Testing (SAST) tools, to detecting vulnerabilities in running applications via Dynamic Application Security Testing (DAST) tools. The past 10 years have seen new flavours of tools to provide combinations of static and dynamic tools via Interactive Application Security Testing (IAST), examination of the components and libraries of the software called Software Composition Analysis (SCA), protection of web applications and APIs using signature-based Web Application Firewalls (WAF), and monitoring the application and blocking attacks through Runtime Application Self Protection (RASP) techniques. The past 10 years has also seen an increase in the uptake of the DevOps model that combines software development and operations to provide continuous delivery of high quality software. As security has become more important, the DevOps model has evolved to the DevSecOps model where software development, operations and security are all integrated. There has also been increasing usage of learning techniques, including machine learning, and program synthesis. Several tools have been developed that make use of machine learning to help developers make quality decisions about their code, tests, or runtime overhead their code produces. However, such techniques have not been applied to application security as yet. In this talk I discuss how to provide an automated approach to integrate security into all aspects of application development and operations, aided by learning techniques. This incorporates signals from the code operations and beyond, and automation, to provide actionable intelligence to developers, security analysts, operations staff, and autonomous systems. I will also consider how malware and threat intelligence can be incorporated into this model to support Intelligent Application Security in a rapidly evolving world. Bio: https://labs.oracle.com/pls/apex/f?p=94065:11:8452080560451:21 LinkedIn: https://www.linkedin.com/in/drcristinacifuentes/ Twitter: @criscifuentes
A Reachability Index for Recursive Label-Concatenated Graph Queries
Reachability queries checking the existence of a path from a source node to a target node are fundamental operators for querying and processing graph data. Current approaches for index-based evaluation of reachability queries either focus on plain reachability or constraint-based reachability with only alternation of labels. In this paper, for the first time we study the problem of index-based processing for recursive label- concatenated reachability queries, referred to as RLC queries. These queries check the existence of a path that can satisfy the constraint defined by a concatenation of at most k edge labels under the Kleene plus. Many practical graph database and network analysis applications exhibit RLC queries. However, their evaluation remains prohibitive in current graph database engines. We introduce the RLC index, the first reachability index to efficiently process RLC queries. The RLC index checks whether the source vertex can reach an intermediate vertex that can also reach the target vertex under a recursive label-concatenated constraint. We propose an indexing algorithm to build the RLC index, which guarantees the soundness and the completeness of query execution and avoids recording redundant index entries. Comprehensive experiments on real-world graphs show that the RLC index can significantly reduce both the offline processing cost and the memory overhead of transitive closure, while improving query processing up to six orders of magnitude over online traversals. Finally, our open-source implementation of the RLC index significantly outperforms current mainstream graph engines for evaluating RLC queries.
Oracle AutoMLx
This presentation introduces Oracle Labs' AutoMLx package to an audience of university students.
AutoML on the Half Shell:How are our Oysters?
This is a presentation to be given at Analytics and Data Summit 2023 (Redwood Shores, CA, March 14, 2023). It combines two public talks from CloudWorld 2022: 1. General AutoMLx overview 2. Specific ML use-case where Oyster dataset (in collaboration w/ University of New Orleans) is used to showcase AutoML in Oracle Machine Learning (OML). The task is to predict health risk towards oysters. We are allowed to use the dataset as we have a signed DUA between Oracle and University of New Orleans.
Control Flow Duplication for Columnar Arrays in a Dynamic Compiler
Columnar databases are an established way to speed up online analytical processing (OLAP) queries. Nowadays, data processing (e.g., storage, visualization, and analytics) is often performed at the programming language level, hence it is desirable to also adopt columnar data structures for common language runtimes. While there are frameworks, libraries, and APIs to enable columnar data stores in programming languages, their integration into applications typically requires developer interference. In prior work, researchers implemented an approach for automated transformation of arrays into columnar arrays in the GraalVM JavaScript runtime. However, this approach suffers from performance issues on smaller workloads as well as on more complex nested data structures. We find that the key to optimizing accesses to columnar arrays is to identify queries and apply specific optimizations to them. In this paper, we describe novel compiler optimizations in the GraalVM Compiler that optimize queries on columnar arrays. At (JIT) compile time, we identify loops that access potentially columnar arrays and duplicate them in order to specifically optimize accesses to columnar arrays. Additionally, we describe a new approach for creating columnar arrays from arrays consisting of complex objects by performing multi-level storage transformation. We demonstrate our approach via an implementation for JavaScript Date objects. Our work shows that automatic transformation of arrays to columnar storage is feasible even for small workloads and that more complex arrays of objects could benefit from a multi-level transformation. Furthermore, we show how we can optimize methods that handle arrays in different states by the use of duplication. We evaluated our work on microbenchmarks and established data analytics workloads (TPC-H) to demonstrate that it significantly outperforms previous efforts, with speedups of up to 14x for particular queries. Queries additionally benefit from multi-level transformation, reaching speedups of up to 5x. Additionally, we show that we do not cause significant overhead on workloads not suitable for storage transformation. We argue that automatically created columnar arrays could aid developers in data-centric applications as an alternative approach to using dedicated APIs on manually created columnar arrays. Via automatic detection and optimization of queries on potentially columnar arrays, we can improve performance of data processing and further enable its use in common—particularly dynamic—programming languages.
Presentation of Prognostic and Health Management System in AeroConf 2023
Oracle has an anomaly detection solution for monitoring time-series telemetry signals for dense-sensor IoT prognostic applications. It integrates an advanced prognostic pattern recognition technique called Multivariate State Estimation Technique (MSET) for high-sensitivity prognostic fault monitoring applications in commercial nuclear power and aerospace applications. MSET has since been spun off and met with commercial success for prognostic Machine Learning (ML) applications in a broad range of safety critical applications, including NASA space shuttles, oil-and-gas asset prognostics, and commercial aviation streaming prognostics. MSET proves to possess significant advantages over conventional ML solutions including neural networks, autoassociative kernel regression, and support vector machines. The main advantages include earlier warning of incipient anomalies in complex time-series signatures, and much lower overhead compute cost due to the deterministic mathematical structure of MSET. Both are crucial for dense-sensor avionic IoT prognostics. In addition, Oracle has developed an extensive portfolio of data preprocessing innovations around MSET to solve the common big-data challenges that cause conventional ML algorithms to perform poorly regarding prognostic accuracy (i.e, false/missed alarm probabilities). Oracle's MSET-based prognostic solution helps increase avionic reliability margins and system availability objectives while reducing costly sources of “no fault found” events that have become a significant sparing-logistics issue for many industries including aerospace and avionics. Moreover, by utilizing and correlating information from all on-board telemetry sensors (e.g., distributed pressure, voltage, temperature, current, airflow and hydraulic flow), MSET is able to provide the best possible prediction of failure precursors and onset of small degradation for the electronic components used on aircrafts, benefiting the aviation Prognostics and Health Management (PHM) system.
Smoothing Entailment Graphs with Language Models
The diversity and Zipfian frequency distribution of natural language predicates in corpora leads to sparsity when learning Entailment Graphs. As symbolic models for natural language inference, an EG cannot recover if missing a novel premise or hypothesis at test-time. In this paper we approach the problem of vertex sparsity by introducing a new method of graph smoothing, using a Language Model to find the nearest approximations of missing predicates. We improve recall by 25.1 and 16.3 absolute percentage points on two difficult directional entailment datasets while exceeding average precision, and show a complementarity with other improvements to edge sparsity. On an extrinsic QA task, we show that smoothing benefits the lower-resource questions, those with less available context. We further analyze language model embeddings and discuss why they are naturally suitable for premise-smoothing, but not hypothesis smoothing. Finally, we formalize a theory for smoothing a symbolic inference method by constructing transitive chains to smooth both the premise and hypothesis.
Introduction to graph processing with PGX (guest lecture at ENSIMAG)
Graph processing is already an integral part of big-data analytics, mainly because graphs can naturally represent data that capture fine-grained relationships among entities. Graph analysis can provide valuable insights about such data by examining these relationships. In this presentation, we will first introduce the concept of graphs and illustrate why and how graph processing can be a valuable tool for data scientists. We will then describe the differences between graph analytics/algorithms (such as Pagerank [1]) and graph queries (such as `(:person)-[:friend]->(:person)`). Second, we will summarize the different tools and technologies included in our Oracle Labs PGX [2] project and show how they provide efficient solutions to the main graph-processing problems. Finally, we will describe a few current and future directions in graph processing, including graph machine learning and distributed graphs (that could potentially lead to great topics for internships).
Improving Inference Performance of Machine Learning with the Divide-and-Conquer Principle
Many popular machine learning models scale poorly when deployed on CPUs. In this paper we explore the reasons why and propose a simple, yet effective approach based on the well-known Divide-and-Conquer Principle to tackle this problem of great practical importance. Given an inference job, instead of using all available computing resources (i.e., CPU cores) for running it, the idea is to break the job into independent parts that can be executed in parallel, each with the number of cores according to its expected computational cost. We implement this idea in the popular OnnxRuntime framework and evaluate its effectiveness with several use cases, including the well-known models for optical character recognition (PaddleOCR) and natural language processing (BERT).
Exploring topic models to discern zero-day vulnerabilities on Twitter through a case study on log4shell
Twitter has demonstrated advantages in providing timely information about zero-day vulnerabilities and exploits. The large volume of unstructured tweets, on the other hand, makes it difficult for cybersecurity professionals to perform manual analysis and investigation into critical cyberattack incidents. To improve the efficiency of data processing on Twitter, we propose a novel vulnerability discovery and monitoring framework that can collect and organize unstructured tweets into semantically related topics with temporal dynamic patterns. Unlike existing supervised machine learning methods that process tweets based on a labelled dataset, our framework is unsupervised, making it better suited for analyzing emerging cyberattack and vulnerability incidents when no prior knowledge is available (e.g., zero-day vulnerability and incidents). The proposed framework compares three topic modeling techniques(Latent Dirichlet Allocation, Non-negative Matrix Factorization and Contextualized Topic Modeling) in combination of different text representation methods (Bag-of-word and contextualized pre-trained language models) on a Twitter dataset that was collected from 47 influential users in the cybersecurity community. We show how the proposed framework can be used to analyze a critical zero-day vulnerability incident(Log4shell) on Apache log4j java library in order to understand its temporal evolution and dynamic patterns across its vulnerability life-cycle. Results show that our proposed framework can be used to effectively analyze vulnerability related topics and their dynamic patterns. Twitter can reveal valuable information regarding the early indicator of exploits and users behaviors. The pre-trained contextualized text representation shows advantages for the unstructured, domain dependent, sparse Twitter textual data under the cybersecurity domain
Distributed Graph Processing with PGX.D (2022)
Graph processing is one of the top data analytics trends. In particular, graph processing comprises two main styles of analysis, namely graph algorithms and graph pattern-matching queries. Classic graph algorithms, such as Pagerank, repeatedly traverse the vertices and edges of the graph and calculate some desired (mathematical) function. Graph queries enable the interactive exploration and pattern matching of graphs. For example, queries like `SELECT p1.name, p2.name FROM MATCH (p1:person)-[:friend]->(p2:person) WHERE p1.country = p2.country` combine the classic operations found in SQL with graph patterns. Both algorithms and queries are very challenging workloads, especially in a distributed setting, where very large graphs are partitioned across multiple machines. In this lecture, I will present how the distributed PGX [1] engine (known as PGX.D; developed at Oracle Labs [2] Zurich) implements efficient algorithms and queries and solves problems, such as data skew and intermediate-result explosion. In brief, for graph algorithms, PGX.D offers the functionality to compile simple sequential textbook-style GreenMarl [3] algorithms to efficient distributed execution. For queries, PGX.D includes a depth-first asynchronous computation runtime [4] that enables limiting the amount of intermediate data during query execution to essentially support "any-size" patterns. [1] http://www.oracle.com/technetwork/oracle-labs/parallel-graph-analytix/overview/index.html [2] https://labs.oracle.com [3] Green-Marl: A DSL for easy and efficient graph analysis, ASPLOS'12. [4] aDFS: An Almost Depth-First-Search Distributed Graph-Querying System. USENIX ATC'21.
EMNLP'22 Presentation of Proxy Clean Work: Mitigating Bias by Proxy in Pre-Trained Models
Transformer-based pre-trained models are known to encode societal biases, not only in their contextual representations but also in their downstream predictions when fine-tuned on task-specific data. We present D-BIAS, an approach that selectively eliminates stereotypical associations (e.g, co-occurrence statistics) at fine-tuning, such that the model doesn’t learn to excessively rely on those signals. D-BIAS attenuates biases from both identity words and frequently co-occurring proxies, which we select using pointwise mutual information. We apply D-BIAS to a) occupation classification, and b) toxicity classification and find that our approach substantially reduces downstream biases (> 60% in toxicity classification for iden- tities that are most frequently flagged as toxic on online platforms). In addition, we show that D-BIAS dramatically improves upon scrubbing, i.e., removing only the identity words in question. We also demonstrate that D-BIAS easily extends to multiple identities and achieves competitive performance with two recently proposed debiasing approaches: R-LACE and INLP.
Feeling Validated: Constructing Validation Sets for Few-Shot Learning
We study validation set construction via data augmentation in true few-shot text classification. Empirically, we show that task-agnostic methods---known to be ineffective for improving test set accuracy for state-of-the-art models when used to augment the training set---are effective for model selection when used to build validation sets. However, test set accuracy on validation sets synthesized via these techniques does not provide a good estimate of test set accuracy. To support better estimates, we propose DAugSS, a generative method for domain-specific data augmentation that is trained once on task-agnostic data and then employed for augmentation on any data set, by using provided training examples and a set of guide words as a prompt. In experiments with 6 data sets, both 5 and 10 examples per class, training the last layer weights and full fine-tuning, and the choice of 4 continuous-valued hyperparameters, DAugSS is better than or competitive with other methods of validation set construction, while also facilitating better estimates of test set accuracy.
Feeling Validated: Constructing Validation Sets for Few-Shot Intent Classification
We study validation set construction via data augmentation in true few-shot intent classification. Empirically, we demonstrate that with scarce data, model selection via a moderate number of generated examples consistently leads to higher test set accuracy than either model selection via a small number fo held out training examples, or selection of the model with the lowest training loss. For each of these methods of model selection -- including validation sets built from task-agnostic data augmentation -- validation accuracy provides a significant overestimate of test set accuracy. To support better estimates and effective model selection, we propose PanGeA, a generated method for domain-specific augmentation that is trained once on out-of-domain data, and then employed for augmentation for any domain-specific dataset. In experiments with 6 datasets that have been subsampled to both 5 and 10 examples per class, we show that PanGeA is better than or competitive with other methods in terms of model selection while also facilitating higher fidelity estimates of test set accuracy.
A Multi-Target, Multi-Paradigm DSL Compiler for Algorithmic Graph Processing
Domain-specific language compilers need to close the gap between the domain abstractions of the language and the low-level concepts of the target platform. This can be challenging to achieve for compilers targeting multiple platforms with potentially very different computing paradigms. In this paper, we present a multi-target, multi-paradigm DSL compiler for algorithmic graph processing. Our approach centers around an intermediate representation and reusable, composable transformations to be shared between the different compiler targets. These transformations embrace abstractions that align closely with the concepts of a particular target platform, and disallow abstractions that are semantically more distant. Our compiler supports four different target platforms, each involving a different computing paradigm. We report on our experience implementing the compiler and highlight some of the challenges and requirements for applying language workbenches in industrial use cases.
Subject Level Differential Privacy with Hierarchical Gradient Averaging
Subject Level Differential Privacy (DP) is a granularity of privacy recently studied in the Federated Learning (FL) setting, where a subject is defined as an individual whose private data is embodied by multiple data records that may be distributed across a multitude of federation users. This granularity is distinct from item level and user level privacy appearing in the literature. Prior work on subject level privacy in FL focuses on algorithms that are derivatives of group DP or enforce user level Local DP (LDP). In this paper, we present a new algorithm – Hierarchical Gradient Averaging (HiGradAvgDP) – that achieves subject level DP by constraining the effect of individual subjects on the federated model. We prove the privacy guarantee for HiGradAvgDP and empirically demonstrate its effectiveness in preserving model utility on the FEMNIST and Shakespeare datasets. We also report, for the first time, a unique problem of privacy loss composition, which we call horizontal composition, that is relevant only to subject level DP in FL. We show how horizontal composition can adversely affect model utility by either in- creasing the noise necessary to achieve the DP guarantee, or by constraining the amount of training done on the model.
Private and Robust Federated Learning using Private Information Retrieval and Norm Bounding
Federated Learning (FL) is a distributed learning paradigm that enables mutually untrusting clients to collaboratively train a common machine learning model. Client data privacy is paramount in FL. At the same time, the model must be protected from poisoning attacks from adversarial clients. Existing solutions address these two problems in isolation. We present FedPerm, a new FL algorithm that addresses both these problems by combining norm bounding for model robustness with a novel intra-model parameter shuffling technique that amplifies data privacy by means of Private Information Retrieval (PIR) based techniques that permit cryptographic aggregation of clients’ model updates. The combination of these techniques helps the federation server constrain parameter updates from clients so as to curtail effects of model poisoning attacks by adversarial clients. We further present FedPerm’s unique hyperparameters that can be used effectively to trade off computation overheads with model utility. Our empirical evaluation on the MNIST dataset demonstrates FedPerm’s effectiveness over existing Differential Privacy (DP) enforcement solutions in FL.
Machine Learning in Java
An overview of Java and Machine Learning, covering why you might want to write ML applications in more structured languages, what ML tools are available in the Java ecosystem, and some of the recent preview features in the JDK which improve numerical performance.
Automatically Deriving JavaScript Static Analyzers from Specifications using Meta-Level Static Analysis
JavaScript is one of the most dominant programming languages. However, despite its popularity, it is a challenging task to correctly understand the behaviors of JavaScript programs because of their highly dynamic nature. Researchers have developed various static analyzers that strive to conform to ECMA-262, the standard specification of JavaScript. Unfortunately, all the existing JavaScript static analyzers require manual updates for new language features. This problem has become more critical since 2015 because the JavaScript language itself rapidly evolves with a yearly release cadence and open development process. In this paper, we present JSAVER, the first tool that automatically derives JavaScript static analyzers from language specifications. The main idea of our approach is to extract a definitional interpreter from ECMA-262 and perform a meta-level static analysis with the extracted interpreter. A meta-level static analysis is a novel technique that indirectly analyzes programs by analyzing a definitional interpreter with the programs. We also describe how to indirectly configure abstract domains and analysis sensitivities in a meta-level static analysis. For evaluation, we derived a static analyzer from the latest ECMA-262 (ES12, 2021) using JSAVER. The derived analyzer soundly analyzed all applicable 18,556 official conformance tests with 99.0% of precision in 590 ms on average. In addition, we demonstrate the configurability and adaptability of JSAVER with several case studies.
ESEC/FSE'22 presentation: Automatically Deriving JavaScript Static Analyzers from Specifications using Meta-Level Static Analysis
JavaScript is one of the most dominant programming languages. However, despite its popularity, it is a challenging task to correctly understand the behaviors of JavaScript programs because of their highly dynamic nature. Researchers have developed various static analyzers that strive to conform to ECMA-262, the standard specification of JavaScript. Unfortunately, all the existing JavaScript static analyzers require manual updates for new language features. This problem has become more critical since 2015 because the JavaScript language itself rapidly evolves with a yearly release cadence and open development process. In this paper, we present JSAVER, the first tool that automatically derives JavaScript static analyzers from language specifications. The main idea of our approach is to extract a definitional interpreter from ECMA-262 and perform a meta-level static analysis with the extracted interpreter. A meta-level static analysis is a novel technique that indirectly analyzes programs by analyzing a definitional interpreter with the programs. We also describe how to indirectly configure abstract domains and analysis sensitivities in a meta-level static analysis. For evaluation, we derived a static analyzer from the latest ECMA-262 (ES12, 2021) using JSAVER. The derived analyzer soundly analyzed all applicable 18,556 official conformance tests with 99.0% of precision in 590 ms on average. In addition, we demonstrate the configurability and adaptability of JSAVER with several case studies.
Property Graph Support in Relational Database
Presentation to Data Community Conference Switzerland 2022 about the Property Graph feature in Oracle DB 23c.
Industrial Strength Static Detection for Cryptographic API Misuses
We describe our experience of building an industrial-strength cryptographic vulnerability detector, which aims to detect cryptographic API misuses in Java(TM). Based on the detection algorithms of the CryptoGuard, we integrated the detection into the Oracle internal code scanning platform Parfait. The goal of the Parfait-based cryptographic vulnerability detection is to provide precise and scalable cryptographic code screening for large-scale industrial projects. We discuss the needs and challenges of the static cryptographic vulnerability screening in the industrial environment.
Analysing Temporality in General-Domain Entailment Graphs
Entailment Graphs based on open relation extraction run the risk of learning spurious entailments (e.g. win against ⊨ lose to) from antonymous predications that are observed with the same entities referring to different times. Previous research has demonstrated the potential of using temporality as a signal to avoid learning these entailments in the sports domain. We investigate whether this extends to the general news domain. Our method introduces a temporal window that is set dynamically for each eventuality using a temporally-informed language model. We evaluate our models on a sports-specific dataset, and ANT – a novel general-domain dataset based on Word-Net antonym pairs. We find that whilst it may be useful to reinterpret the Distributional Inclusion Hypothesis to include time for the sports news domain, this does not apply to the general news domain.
RASPunzel: A Novel RASP Solution
This document presents an overview of project RASPunzel. It highlights the approach of using an allowlist (instead of a deny list) and summarises the key advantages.
ML-SOCO: Machine Learning-Based Self-Optimizing Compiler Optimizations
Compiler optimizations often involve hand-crafted heuris- tics to guide the optimization process. These heuristics are designed to benefit the average program and are otherwise static or only customized by profiling information. We pro- pose machine learning-based self optimizing compiler op- timizations (ML-SOCO), a novel approach for fitting opti- mizations in a dynamic compiler to a specific environment. ML-SOCO explores—at run time—the impact of optimization decisions and uses this data to train or update a machine learning model. Related work which has primarily targeted static compilers has already shown that machine learning can outperform human-crafted heuristics. Our approach is specifically tailored to dynamic compilation and uses con- cepts like deoptimization for transparently switching be- tween generating data and performing machine learning decisions during compilation. We implemented ML-SOCO in the GraalVM compiler which is one of the most highly optimizing Java compilers on the market. When evaluat- ing ML-SOCO by replacing a loop peeling heuristics with a learned model we encountered multiple speedups larger than 30% in established benchmarks. Apart from improving the performance, ML-SOCO can also be used to assist compiler engineers when improving heuristics for specific domains.
TruffleTaint: Polyglot Dynamic Taint Analysis on GraalVM
Dynamic taint analysis tracks the propagation of specific values while a program executes . To this end, a taint label is attached to these values and dynamically propagated to any values derived from them. Frequent application of this analysis technique in many fields has led to the development of general purpose analysis platforms with taint propaga- tion capabilities. However, these platforms generally limit analysis developers to a specific implementation language, propagation semantics or taint label representation, and they provide no tooling support for analysis development. In this paper we present a language-agnostic approach for implementing a dynamic taint analysis independently of the analysis platform that it is executed on. We imple- mented this approach in TruffleTaint, a platform for taint propagation in multiple programming languages. We show how our approach enables TruffleTaint to provide analysis implementers with more control over the semantics and im- plementation language of their taint analysis than current analysis platforms and with a more capable development en- vironment. We further show that our approach enables the development of both tooling infrastructure for taint analysis research and data-flow enabled tools for end-users.
Automatic Array Transformation to Columnar Storage at Run Time
Today’s huge memories make it possible to store and process large data structures in memory instead of in a database. Hence, accesses to this data should be optimized, which is normally relegated either to the runtimes and compilers or is left to the developers, who often lack the knowledge about optimization strategies. As arrays are often part of the language, developers frequently use them as an underlying storage mechanism. Thus, optimization of arrays may be vital to improve performance of data-intensive applications. While compilers can apply numerous optimizations to speed up accesses, it would also be beneficial to adapt the actual layout of the data in memory to improve cache utilization. However, runtimes and compilers typically do not perform such memory layout optimizations. In this work, we present an approach to dynamically per- form memory layout optimizations on arrays of objects to transform them into a columnar memory layout, a storage layout frequently used in analytical applications that enables faster processing of read-intensive workloads. By integration into a state-of-the-art JavaScript runtime, our approach can speed up queries for large workloads by up to 7x, where the initial transformation overhead is amortized over time.
Efficient Property Projections of Graph Queries over Relational Data
Specialized graph data management systems have made significant advances in storing and analyzing graph-structured data. However, a large fraction of the data of interest still resides in relational database systems (RDBMS) due to their maturity and security reasons. Recent studies, in view of composability, show that the execution of graph queries over relational databases, (i.e., a graph layer on top of RDBMS), can provide competitive performance compared to specialized graph databases. While using the standard property graph model for graph querying, one of the main bottlenecks for efficient query processing, under memory constraints, is property projections, i.e., to project properties of nodes along paths matching a given pattern. This is because graph queries produce a large number of matching paths, resulting in a lot of requests to the data storage or a large memory footprint, to access their properties. In this paper, we propose a set of novel techniques exploiting the inherent structure of the graph (aka, a graph projection cache manager) to provide efficient property projections. The controlled memory footprint of our solution makes it practical in multi-tenant database deployments. The empirical results on a social graph show that our solution reduce the number of accesses to the data storage by more than an order of magnitude, resulting in graph queries being up to 3.1X faster than the baseline.
Proof Engineering with Predicate Transformer Semantics
We present a lightweight, open source Agda framework for manually verifying effectful programs using predicate transformer semantics. We represent the abstract syntax trees (AST) of effectful programs with a generalized algebraic datatype (GADT) AST, whose generality enables even complex operations to be primitive AST nodes. Users can then assign bespoke predicate transformers to such operations to aid the proof effort, for example by automatically decomposing proof obligations for branching code. Our framework codifies and generalizes a proof engineering methodology used by the authors to reason about a prototype implementation of LibraBFT, a Byzantine fault tolerant consensus protocol in which code executed by participants may have effects such as updating state and sending messages. Successful use of our framework in this context demonstrates its practical applicability.
FedPerm: Private and Robust Federated Learning by Parameter Permutation
Federated Learning (FL) is a distributed learning paradigm that enables mutually untrusting clients to collaboratively train a common machine learning model. Client data privacy is paramount in FL. At the same time, the model must be protected from poisoning attacks from adversarial clients. Existing solutions address these two problems in isolation. We present FedPerm, a new FL algorithm that addresses both these problems by combining a novel intra-model parameter shuffling technique that amplifies data privacy, with Private Information Retrieval (PIR) based techniques that permit cryptographic aggregation of clients’ model updates. The combination of these techniques further helps the federation server constrain parameter updates from clients so as to cur- tail effects of model poisoning attacks by adversarial clients. We further present FedPerm’s unique hyperparameters that can be used effectively to trade off computation overheads with model utility. Our empirical evaluation on the MNIST dataset demonstrates FedPerm’s effectiveness over existing Differential Privacy (DP) enforcement solutions in FL.
N-1 Experts: Unsupervised Anomaly Detection Model Selection
Manually finding the best combination of machine learning training algorithm, model and hyper-parameters can be challenging. In supervised settings, this burden has been alleviated with the introduction of automated machine learning (AutoML) methods. However, similar methods are noticeably absent for fully unsupervised applications, such as anomaly detection. We introduce one of the first such methods, N-1 Experts, which we compare to a recent state-of-the-art baseline, MetaOD, and show favourable performance.
Experimental Procedures for Exploiting Structure in AutoML Loss Landscapes
Recent observations regarding the structural simplicity of algorithm configuration landscapes have spurred the development of new configurators that obtain provably and empirically better performance. Inspired by these observations, we recently performed a similar analysis of AutoML Loss Landscapes – that is, the relationship between hyper-parameter configurations and machine learning model performance. In this study, we propose two new variations of an existing, state-of-the-art hyper-parameter configuration procedure. We designed each method to exploit a specific property that we observed common among most AutoML loss landscapes; however, we demonstrate that neither are competitive with existing baselines. In light of this result, we construct artificial algorithm configuration scenarios that allow us to show when the two new methods can be expected to outperform their baselines and when they cannot, thereby providing additional insights into AutoML loss landscapes.
N-1 Experts: Unsupervised Anomaly Detection Model Selection
Manually finding the best combination of machine learning training algorithm, model and hyper-parameters can be challenging. In supervised settings, this burden has been alleviated with the introduction of automated machine learning (AutoML) methods. However, similar methods are noticeably absent for fully unsupervised applications, such as anomaly detection. We introduce one of the first such methods, N-1 Experts, which we compare to a recent state-of-the-art baseline, MetaOD, and show favourable performance.
Distinct Value Estimation from a Sample: Statistical Methods vs. Machine Learning
Estimating the number of distinct values (NDV) in a dataset is an important operation in modern database systems for many tasks, including query optimization. In large scale systems, tables often contain billions of rows and wrong optimizer decisions can cause severe deterioration in query performance. Additionally in many situations, such as having large tables or NDV estimation after the application of filters, it is not feasible to scan the entire dataset to compute the number of distinct values. In such cases, the only available option is to use a dataset sample to estimate the NDV. This, however, is not trivial as data properties of the sample usually do not mirror the properties of the full dataset. Approaches in related work have shown that this kind of estimation is connected to large errors. In this paper, we present two novel approaches for the problem of estimating the number of distinct values from a dataset sample. Our first approach presents a novel statistical estimator that shows good and robust results across a broad range of datasets. The second approach is based on Machine Learning (ML), hence being the first time that ML is applied to this problem. Both approaches outperform the state-of-the-art, with the ML approach reducing the average error by 3x for real-world datasets. Beyond pure prediction quality, both our approaches have their own set of advantages and disadvantages, and we show that the right approach actually depends on the specific application scenario.
Pruning Networks During Training via Auxiliary Parameters
Neural networks have perennially been limited by the physical constraints of implementation on real hardware, and the desire for improved accuracy often drives the model size to the breaking point. The task of reducing the size of a neural network, whether to meet memory constraints, inference-time speed, or generalization capabilities, is therefore well-studied. In this work, we present an extremely simple scheme to reduce model size during training, by introducing auxiliary parameters to the inputs of each layer of the neural network, and a regularization penalty that encourages the network to eliminate unnecessary variables from the computation graph. Though related to many prior works, this scheme offers several advantages: it is extremely simple to implement; the network eliminates unnecessary variables as part of training, without requiring any back-and-forth between training and pruning; and it dramatically reduces the number of parameters in the networks while maintaining high accuracy.
Subject Membership Inference Attacks in Federated Learning
Privacy in Federated Learning (FL) is studied at two different granularities - item-level, which protects individual data points, and user-level, which protects each user (participant) in the federation. Nearly all of the private FL literature is dedicated to the study of privacy attacks and defenses alike at these two granularities. More recently, subject-level privacy has emerged as an alternative privacy granularity to protect the privacy of individuals whose data is spread across multiple (organizational) users in cross-silo FL settings. However, the research community lacks a good understanding of the practicality of this threat, as well as various factors that may influence subject-level privacy. A systematic study of these patterns requires complete control over the federation, which is impossible with real-world datasets. We design a simulator for generating various synthetic federation configurations, enabling us to study how properties of the data, model design and training, and the federation itself impact subject privacy risk. We propose three inference attacks for subject-level privacy and examine the interplay between all factors within a federation. Our takeaways generalize to real-world datasets like FEMNIST, giving credence to our findings.
Internship PFE Report by Omar Heddi
The following is a report about the internship I did at Oracle corporation, during which I worked on my graduation project as part of the program organized by the National School of Applied Science of Fez. In this document, we will get to learn about Oracle, graph databases and the solution Oracle provides to with graph databases, that is PGX. But more importantly, this report summarizes the work I done to improve the Oracle PGX compiler, as well as all the challenges I faced and new things I learned.
Synthesis of Java Deserialisation Filters from Examples (Presentation Slides)
Java natively supports serialisation and deserialisation, features that are necessary to enable distributed systems to exchange Java objects. Deserialisation of data from malicious sources can lead to security exploits including remote code execution because by default Java does not validate deserialised data. In the absence of validation, a carefully crafted payload can trigger arbitrary functionality. The state-of-the-art general mitigation strategy for deserialisation exploits in Java is deserialisation filtering that validates the contents of an object input stream before the object is deserialised using user-provided filters. In this paper we describe a novel technique called ds-prefix for automatic synthesis of deserialisation filters (as regular expressions) from examples. We focus on synthesis of allowlists (permitted behaviours) as they provide a better level of security. Ds-prefix is based on deserialisation heuristics and specifically targets synthesis of deserialisation allowlists. We evaluate our approach by executing ds-prefix on popular open-source systems and show that ds-prefix can produce filters preventing real CVEs using a small number of training examples. We also compare our approach with other synthesis tools which demonstrates that ds-prefix outperforms existing tools and achieves better F1-score.
Synthesis of Java Deserialisation Filters from Examples (Conference Video)
Java natively supports serialisation and deserialisation, features that are necessary to enable distributed systems to exchange Java objects. Deserialisation of data from malicious sources can lead to security exploits including remote code execution because by default Java does not validate deserialised data. In the absence of validation, a carefully crafted payload can trigger arbitrary functionality. The state-of-the-art general mitigation strategy for deserialisation exploits in Java is deserialisation filtering that validates the contents of an object input stream before the object is deserialised using user-provided filters. In this paper we describe a novel technique called ds-prefix for automatic synthesis of deserialisation filters (as regular expressions) from examples. We focus on synthesis of allowlists (permitted behaviours) as they provide a better level of security. Ds-prefix is based on deserialisation heuristics and specifically targets synthesis of deserialisation allowlists. We evaluate our approach by executing ds-prefix on popular open-source systems and show that ds-prefix can produce filters preventing real CVEs using a small number of training examples. We also compare our approach with other synthesis tools which demonstrates that ds-prefix outperforms existing tools and achieves better precision.
Synthesis of Java Deserialisation Filters from Examples
Java natively supports serialisation and deserialisation, features that are necessary to enable distributed systems to exchange Java objects. Deserialisation of data from malicious sources can lead to security exploits including remote code execution because by default Java does not validate deserialised data. In the absence of validation, a carefully crafted payload can trigger arbitrary functionality. The state-of-the-art general mitigation strategy for deserialisation exploits in Java is deserialisation filtering that validates the contents of an object input stream before the object is deserialised using user-provided filters. In this paper we describe a novel technique called ds-prefix for automatic synthesis of deserialisation filters (as regular expressions) from examples. We focus on synthesis of allowlists (permitted behaviours) as they provide a better level of security. Ds-prefix is based on deserialisation heuristics and specifically targets synthesis of deserialisation allowlists. We evaluate our approach by executing ds-prefix on popular open-source systems and show that ds-prefix can produce filters preventing real CVEs using a small number of training examples. We also compare our approach with other synthesis tools which demonstrates that ds-prefix outperforms existing tools and achieves better precision.
ONNX and the JVM
Integrating machine learning into enterprises requires building and deploying ML models in the environments enterprises build their software in. Frequently this is in Java, or another language running on the JVM. In this talk we'll cover some of our recent work bringing the ONNX ecosystem to Java. We'll discuss uses of ONNX Runtime from Java, and also our work writing model converters from our Java ML library into ONNX format.
Experience: Model-Based, Feedback-Driven, Greybox Web Fuzzing with BackREST
Slides for the corresponding ECOOP 2022 paper.
Subject Granular Differential Privacy in Federated Learning
This paper introduces subject granular privacy in the Federated Learning (FL) setting, where a subject is an individual whose private information is embodied by several data items either confined within a single federation user or distributed across multiple federation users. We formally define the notion of subject level differential privacy for FL. We propose three new algorithms that enforce subject level DP. Two of these algorithms are based on notions of user level local differential privacy (LDP) and group differential privacy respectively. The third algorithm is based on a novel idea of hierarchical gradient averaging (HiGradAvgDP) for subjects participating in a training mini-batch. We also introduce horizontal composition of privacy loss for a subject across multiple federation users. We show that horizontal composition is equivalent to sequential composition in the worst case. We prove the subject level DP guarantee for all our algorithms and empirically analyze them using the FEMNIST and Shakespeare datasets. Our evaluation shows that, of our three algorithms, HiGradAvgDP delivers the best model performance, approaching that of a model trained using a DP-SGD based algorithm that provides a weaker item level privacy guarantee.
Distinct Value Estimation from a Sample: Statistical Methods vs. Machine Learning
Estimating the number of distinct values (NDV) in a dataset is an important operation in modern database systems for many tasks, including query optimization. In large scale systems, tables often contain billions of rows and wrong optimizer decisions can cause severe deterioration in query performance. Additionally in many situations, such as having large tables or NDV estimation after the application of filters, it is not feasible to scan the entire dataset to compute the number of distinct values. In such cases, the only available option is to use a dataset sample to estimate the NDV. This, however, is not trivial as data properties of the sample usually do not mirror the properties of the full dataset. Approaches in related work have shown that this kind of estimation is connected to large errors. In this paper, we present two novel approaches for the problem of estimating the number of distinct values from a dataset sample. Our first approach presents a novel statistical estimator that shows good and robust results across a broad range of datasets. The second approach is based on Machine Learning (ML), hence being the first time that ML is applied to this problem. Both approaches outperform the state-of-the-art, with the ML approach reducing the average error by 3x for real-world datasets. Beyond pure prediction quality, both our approaches have their own set of advantages and disadvantages, and we show that the right approach actually depends on the specific application scenario.
Automatic Root Cause Quantification for Missing Edges in JavaScript Call Graphs
Building sound and precise static call graphs for real-world JavaScript applications poses an enormous challenge, due to many hard-to-analyze language features. Further, the relative importance of these features may vary depending on the call graph algorithm being used and the class of applications being analyzed. In this paper, we present a technique to automatically quantify the relative importance of different root causes of call graph unsoundness for a set of target applications. The technique works by identifying the dynamic function data flows relevant to each call edge missed by the static analysis, correctly handling cases with multiple root causes and inter-dependent calls. We apply our approach to perform a detailed study of the recall of a state-of-the-art call graph construction technique on a set of framework-based web applications. The study yielded a number of useful insights. We found that while dynamic property accesses were the most common root cause of missed edges across the benchmarks, other root causes varied in importance depending on the benchmark, potentially useful information for an analysis designer. Further, with our approach, we could quickly identify and fix a recall issue in the call graph builder we studied, and also quickly assess whether a recent analysis technique for Node.js-based applications would be helpful for browser-based code. All of our code and data is publicly available, and many components of our technique can be re-used to facilitate future studies.
Experience: Model-Based, Feedback-Driven, Greybox Web Fuzzing with BackREST
Following the advent of the American Fuzzy Lop (AFL), fuzzing had a surge in popularity, and modern day fuzzers range from simple blackbox random input generators to complex whitebox concolic frameworks that are capable of deep program introspection. Web application fuzzers, however, did not benefit from the tremendous advancements in fuzzing for binary programs and remain largely blackbox in nature. In this experience paper, we show how techniques like state-aware crawling, type inference, coverage and taint analysis can be integrated with a black-box fuzzer to find more critical vulnerabilities, faster (speedups between 7.4x and 25.9x). Comparing BackREST against three other web fuzzers on five large ($>$500 KLOC) Node.js applications shows how it consistently achieves comparable coverage while reporting more vulnerabilities than state-of-the-art. Finally, using BackREST, we uncovered eight 0-days, out of which six were not reported by any other fuzzer. All the 0-days have been disclosed and most are now public, including two in the highly popular Sequelize and Mongodb libraries.
Anomaly Detection for Cybersecurity and the Need for Explainable AI
Machine learning is increasingly applied in the cybersecurity domain in order to build solutions capable of protecting against attacks that escape rule-based systems. Attacks are nowadays constantly evolving, since adversaries are always creating new approaches or tweaking existing ones: it is thus not possible to rely exclusively on supervised techniques. This talk will focus on the role of anomaly detection techniques in real-world security applications, and how explainability is necessary in order to translate the anomalies detected by the system into actionable events.
AutoML Loss Landscapes
As interest in machine learning and its applications continues to increase, how to choose the best models and hyper-parameter settings becomes more important. This problem is known to be challenging for human experts, and consequently, a growing number of methods have been proposed for solving it, giving rise to the area of automated machine learning (AutoML). Many of the most popular AutoML methods are based on Bayesian optimization, which makes only weak assumptions about how modifying hyper-parameters effects the loss of a model. This is a safe assumption that yields robust methods, as the AutoML loss landscapes that relate hyper-parameter settings to loss are poorly understood. We build on recent work on the study of one-dimensional slices of algorithm configuration landscapes by introducing new methods that test 𝑛- dimensional landscapes for statistical deviations from uni-modality and convexity, and we use them to show that a diverse set of AutoML loss landscapes are highly structured. We introduce a method for assessing the significance of hyper-parameter partial derivatives, which reveals that most (but not all) AutoML loss landscapes have only a small number of hyper-parameters that interact strongly. To further assess hyper- parameter interactions, we introduce a simplistic optimization procedure that assumes each hyper-parameter can be optimized independently, a single time in sequence, and we show that it obtains configurations that are statistically tied with optimal in all of the 𝑛-dimensional AutoML loss landcsapes that we studied. Our results suggest many possible new directions for substantially improving the state of the art in AutoML.
Towards Formal Verification of HotStuff-based Byzantine Fault Tolerant Consensus in Agda
LibraBFT is a Byzantine Fault Tolerant (BFT) consensus protocol based on HotStuff. We present an abstract model of the pro- tocol underlying HotStuff / LibraBFT, and formal, machine-checked proofs of their core correctness (safety) property and an extended condition that enables non-participating parties to verify committed results. (Liveness properties would be proved for specific implementations, not for the abstract model presented in this paper.) A key contribution is precisely defining assumptions about the behavior of honest peers, in an abstract way, independent of any particular implementation. Therefore, our work is an important step towards proving correctness of an entire class of concrete implementations, without repeating the hard work of proving correctness of the underlying protocol. The abstract proofs are for a single configuration (epoch); extending these proofs across configuration changes is future work. Our models and proofs are expressed in Agda, and are available in open source.
Runtime Prevention of Deserialization Attacks
Untrusted deserialization exploits, where a serialised object graph is used to achieve denial-of-service or arbitrary code execution, have become so prominent that they were introduced in the 2017 OWASP Top 10. In this paper, we present a novel and lightweight approach for runtime prevention of deserialization attacks using Markov chains. The intuition behind our work is that the features and ordering of classes in malicious object graphs make them distinguishable from benign ones. Preliminary results indeed show that our approach achieves an F1-score of 0.94 on a dataset of 264 serialised payloads, collected from an industrial Java EE application server and a repository of deserialization exploits.
Towards formal verification of HotStuff-based BFT consensus in Agda
LibraBFT is a Byzantine Fault Tolerant (BFT) consensus protocol based on HotStuff. We present an abstract model of the protocol underlying HotStuff/LibraBFT, and formal, machine-checked proofs of their core correctness (safety) property and an extended condition that enables non-participating parties to verify committed results. (Liveness properties would be proved for specific implementations, not for the abstract model presented in this paper.) A key contribution is precisely defining assumptions about the behavior of honest peers, in an abstract way, independent of any particular implementation. Therefore, our work is an important step towards proving correctness of an entire class of concrete implementations, without repeating the hard work of proving correctness of the underlying protocol. The abstract proofs are for a single configuration (epoch); extending these proofs across configuration changes is future work. Our models and proofs are expressed in Agda, and are available in open source.
An approach to translating Haskell programs to Agda and reasoning about them
We are using the Agda programming language and proof assistant to formally verify correctness of a Byzantine Fault Tolerant consensus implementation based on HotStuff / DiemBFT. The Agda implementation is a translation of our Haskell implementation, which is based on DiemBFT. This short paper focuses on one aspect of this work. We have developed a library that enables the translated Agda implementation to closely mirror the Haskell code on which it is based, making review and maintenance easier and more efficient, and reducing the risk of translation errors. We also explain how we assign semantics to the syntactic features provided by our library, thus enabling formal reasoning about programs that use them; details of how we reason about the resulting Agda implementation will be presented in a future paper. The library we present is independent of our particular verification project, and is available in open source for others to use and extend.
Upstream Mitigation Is Not All You Need: Testing the Bias Transfer Hypothesis in Pre-Trained Language Models
A few large, homogenous, pre-trained models undergird many machine learning systems — and often, these models contain harmful stereotypes learned from the internet. We investigate the bias transfer hypothesis: the theory that social biases (such as stereotypes) internalized by large language models during pre-training transfer into harmful task-specific behavior after fine-tuning. For two classification tasks, we find that reducing intrin- sic bias with controlled interventions before fine- tuning does little to mitigate the classifier’s dis- criminatory behavior after fine-tuning. Regression analysis suggests that downstream disparities are better explained by biases in the fine-tuning dataset. Still, pre-training plays a role: simple alterations to co-occurrence rates in the fine-tuning dataset are ineffective when the model has been pre-trained. Our results encourage practitioners to focus more on dataset quality and context-specific harms.
Upstream Mitigation Is Not All You Need
A few large, homogenous pre-trained models undergird many machine learning systems — and often, these models contain harmful stereotypes learned from the internet. We investigate the bias transfer hypothesis, the possibility that social biases (such as stereotypes) internalized by large language models during pre-training could also affect task-specific behavior after fine-tuning. For two classification tasks, we find that reducing intrinsic bias with controlled interventions before fine-tuning does little to mitigate the classifier’s discriminatory behavior after fine-tuning. Regression analysis suggests that downstream disparities are better explained by biases in the fine-tuning dataset. Still, pre-training plays a role: simple alterations to co-occurrence rates in the fine-tuning dataset are ineffective when the model has been pre-trained. Our results encourage practitioners to focus more on dataset quality and context-specific harms.
Oracle Cloud Advanced ML Prognostics Innovations for Enterprise Computing Servers
Oracle has a portfolio of Machine Learning (ML) offerings for monitoring time-series telemetry signals for anomaly detection. The product suite is called the Multivariate State Estimation Technique (MSET2) that integrates an advanced prognostic pattern recognition technique with a collection of intelligent data preprocessing (IDP) innovations for high-sensitivity prognostic applications. One of the important application is monitoring dynamic computer power and catching the early incipience of mechanisms that cause servers to fail using the telemetry signals of servers. Telemetry signals in computing servers typically include many physical variables (e.g., voltages, currents, temperatures, fan speeds, and power levels) that correlate with system IO traffic, memory utilization, and system throughput. By utilizing the telemetry signals, MSET2 improve power efficiencies by monitoring, reporting and forecasting energy consumption, cooling requirements and load utilization of servers. However, the common challenge in the computing server industry is that telemetry signals are never perfect. For example, enterprise-class servers have disparate sampling rates and are often not synchronized in time, resulting in a lead-lag phase change among the various signals. In addition, the enterprise computing industry often uses 8-bit A/D conversion chips for physical sensors. This makes it difficult to discern small variations in the physical variables that are severely quantized because of the use of low-resolution chips. Moreover, missing values often exist in the streaming telemetry signals, which can be caused by the saturated system bus or data transmission error. This paper describes some features of key IDP algorithms for optimal ML solutions to the aforementioned challenges across the enterprise computing industry. It assures optimal ML performance for prognostics, optimal energy efficiency of Enterprise Servers, and streaming analytics.
Temporality in General-Domain Entailment Graph Induction
Entailment Graphs based on open relation extraction run the risk of learning spurious entailments (e.g. win against ⊨ lose to) from antonymous predications that are observed with the same entities referring to different times. Previous research has demonstrated the potential of using temporality as a signal to avoid learning these entailments in the sports domain. We investigate whether this extends to the general news domain. Our method introduces a temporal window that is set dynamically for each eventuality using a temporally informed language model. We evaluate our models on a sports-specific dataset, and ANT – a novel general-domain dataset based on Word-Net antonym pairs. We find that whilst it may be useful to reinterpret the Distributional Inclusion Hypothesis to include time for the sports news domain, this does not apply to the general news domain.
Challenges in adopting Machine Learning for Cybersecurity
Machine learning can be a powerful ally in fighting Cybercrime provided that few challenges in its application can be solved. The Keybridge team at Oracle Labs has experience with developing ML solutions for security use cases. In this talk we would like to share those experiences and discuss three challenges - selecting an ML model, handling of input data (specifically system logs) and transferring to security teams. In the latter challenge, we are particularly interested in bridging the two-way gap in understanding between security teams and ML practitioners.
Runtime Prevention of Deserialization Attacks
Untrusted deserialization exploits, where a serialised object graph is used to achieve denial-of-service or arbitrary code execution, have become so prominent that they were introduced in the 2017 OWASP Top 10. In this paper, we present a novel and lightweight approach for runtime prevention of deserialization attacks using Markov chains. The intuition behind our work is that the features and ordering of classes in malicious object graphs make them distinguishable from benign ones. Preliminary results indeed show that our approach achieves an F1-score of 0.94 on a dataset of 264 serialised payloads, collected from an industrial Java EE application server and a repository of deserialization exploits.
Constant Blinding on GraalVM
With the advent of JIT-compilers, code-injection attacks have seen a revival in the form of JIT-spraying. JIT-spraying enables an attacker to inject gadgets into executable memory, effectively bypassing W^X and ASLR. In response to JIT-spraying, constant blinding has emerged as a conceptually simple and performance friendly defense. Unfortunately, a number of increasingly sophisticated attacks has pinpointed the shortcomings of existing constant blinding implementations. In this paper, we present our constant blinding implementation for the GraalVM, taking into account the insights from the last decade regarding the security of constant blinding. We discuss important design decisions and tradeoffs as well as the practical implementation issues encountered when implementing constant blinding for GraalVM. We evaluate the performance impact of our implementation with different configurations and demonstrate its effectiveness by fuzzing for unblinded constants.
"Static Java": The GraalVM Native Image Programming Model
In this talk we will present our vision for “Static Java”: the programming model enabled by GraalVM Native Image. Applications are initialized at image build time, to allow fast startup time and low memory footprint at run time. Counterintuitively, the ahead-of-time compilation of Java bytecode to machine code is not part of the programming model. But since it is an important implementation detail, we will also talk about the benefits and problems of compiling ahead-of-time compilation. We will show where static analysis helps, what the limitations of static analysis are, which compiler optimizations work well both for JIT and AOT compilation, and where additional compiler phases for AOT compilation are necessary.
GraalVM: State of AArch64
While always the de facto choice of the mobile domain, recently machines using Arm's AArch64 ISA have also become prevalent within the laptop, desktop, and server marketplaces. Because of this, it is imperative for the GraalVM ecosystem to not only perform well on AArch64, but to treat AArch64 as an equal peer of AMD64. In my talk, I will give an overview of the current state of GraalVM on AArch64. This includes (i) describing the work involved in creating the GraalVM AArch64 port, (ii) providing an overview of current GraalVM AArch64 features, (iii) explaining the code architecture of the AArch64 backend and how to navigate it, and (iv) presenting some current performance numbers on AArch64. Beyond this overview, I also plan to discuss in detail some of the main challenges in getting AArch64 running on GraalVM, such adding patching support, abiding by the Java Memory Model, and utilizing AArch64's different addressing modes and branch instructions. I'll also present some of our future plans for the continued improvement of the AArch64 backend.
Toward Just-in-time and Language-agnostic Mutation Testing
Mutation Testing is a popular approach to determine the quality of a suite of unit tests. It is based on the idea that introducing faults into a system-under-test (SUT) should cause tests to fail, otherwise, the test suite might be of insufficient quality. In the language of mutation testing, such a fault is referred to as "mutation", and an instance of the SUT's code that contains the mutation is referred to as ``mutant''. Mutation testing is computationally expensive and time-consuming. Reasons for this include, for example, a high number of mutations to consider, interrelations between these mutations, and mutant-associated costs such as the cost of mutant creation or the cost of checking whether any tests fail in response. Furthermore, implementing a reliable tool for automatic mutation testing is a significant effort for any language. As a result, mutation testing is only available for some languages. Present mutation tools often rely on modifying code or binary executables. We refer to this as "ahead-of-time" mutation testing. Oftentimes, they neither take dynamic information that is only available at run-time into account nor alter program behavior at run-time. However, mutating via the latter could save costs on mutant creation: If the corresponding module of code is compiled, only the mutated section of code needs to be recompiled. Additional run-time information (like previous execution results of the mutated section) selected by an initial test run, could also help to determine the utility of a mutant. Skipping mutants of low utility could have an impact on mutation testing efficiency. We propose to refer to this approach as just-in-time mutation testing. In this paper, we provide a proof of concept for just-in-time and language-agnostic mutation testing. We present preliminary results of a feasibility study that explores the implementation of just-in-time mutation testing based on Truffle's instrumentation API. Based on these results, future research can evaluate the implications of just-in-time and language-agnostic mutation testing.
Autonomous Memory Sizing Formularization for Cloud-based IoT ML Customers
Machine learning IoT use cases involve thousands of sensor signals, and the demand on the cloud is high. One challenge for all cloud companies who seek to deal with big data use cases is the fact that the peak memory utilization scales non-linearly with the number of sensors, and sizing cloud shapes properly and autonomously prior to the program run is complicated. To address this issue, Oracle developed an autonomous formularization tool with OCI Anomaly Detection’s patented MSET2 algorithm so RAM capacity and/or VRAM capacity can be optimally sized—which helps developers gain a perception of the required computing resources beforehand and avoid the out-of-memory error. It also avoids excessively conservative RAM pre-allocations which saves cost for customers.
Gelato: Feedback-driven and Guided Security Analysis of Client-side Web Applications
Modern web applications are getting more sophisticated by using frameworks that make development easy, but pose challenges for security analysis tools. New analysis techniques are needed to handle such frameworks that grow in number and popularity. In this paper, we describe Gelato that addresses the most crucial challenges for a security-aware client-side analysis of highly dynamic web applications. In particular, we use a feedback-driven and state-aware crawler that is able to analyze complex framework-based applications automatically, and is guided to maximize coverage of security-sensitive parts of the program. Moreover, we propose a new lightweight client-side taint analysis that outperforms the state-of-the-art tools, requires no modification to browsers, and reports non-trivial taint flows on modern JavaScript applications. Gelato reports vulnerabilities with higher accuracy than existing tools and achieves significantly better coverage on 12 applications of which three are used in production.
Private Federated Learning with Domain Adaptation
Federated learning (FL) was originally motivated by communication bottlenecks in training models from data stored across millions of devices, but the paradigm of distributed training is attractive for models built on sensitive data, even when the number of users is relatively small, such as collaborations between organizations. For example, when training machine learning models from health records, the raw data may be limited in size, too sensitive to be aggregated directly, and concerns about data reconstruction must be addressed. Differential privacy (DP) offers a guarantee about the difficulty of reconstructing individual data points, but achieving reasonable privacy guarantees on small datasets can significantly degrade model accuracy. Data heterogeneity across users may also be more pronounced with smaller numbers of users in the federation pool. We provide a theoretical argument that model personalization offers a practical way to address both of these issues, and demonstrate its effectiveness with experimental results on a variety of domains, including spam detection, named entity recognition on case narratives from the Vaccine Adverse Event Reporting System (VAERS) and image classification using the federated MNIST dataset (FEMNIST).
Industrial Experience of Finding Cryptographic Vulnerabilities in Large-scale Codebases
Enterprise environment often screens large-scale (millions of lines of code) codebases with static analysis tools to find bugs and vulnerabilities. Parfait is a static code analysis tool used in Oracle to find security vulnerabilities in industrial codebases. Recently, many studies show that there are complicated cryptographic vulnerabilities caused by misusing cryptographic APIs in JavaTM1 . In this paper, we describe how we realize a precise and scalable detection of these complicated cryptographic vulnerabilities based on Parfait framework. The key challenge in the detection of cryptographic vulnerabilities is the high false alarm rate caused by pseudo-influences. Pseudo-influences happen if security-irrelevant constants are used in constructing security-critical values. Static analysis is usually unable to distinguish them from hard-coded constants that expose sensitive information. We tackle this problem by specializing the backward dataflow analysis used in Parfait with refinement insights, an idea from the tool CryptoGuard [20]. We evaluate our analyzer on a comprehensive Java cryptographic vulnerability benchmark and eleven large real-world applications. The results show that the Parfait-based cryptographic vulnerability detector can find real-world cryptographic vulnerabilities in large-scale codebases with high true-positive rates and low runtime cost.
I have data and a business problem; now what?
In the last few decades, machine learning has made many great leaps and bounds, thereby substantially improving the state of the art in a diverse range of industry applications. However, for a given dataset and a business use case, non-technical users are faced by many questions that limit the adoption of a machine learning solution. For example: • Which machine learning model should I use? • How should I set its hyper-parameters? • Can I trust what my model learned? • Does my model discriminate against a marginalized, protected group? Even for seasoned data scientists, answering these questions can be tedious and time consuming. To address these barriers, the AutoMLx team at Oracle Labs has developed an automated machine learning (AutoML) pipeline that performs automated feature engineering, preprocessing and selection, and then selects a suitable machine learning model and hyper-parameter configuration. To help users understand and trust their "magic" and opaque machine learning models, the AutoMLx package supports a variety of methods that can help explain what the model has learned. In this talk, we will provide an overview of our current AutoMLx methods; we will comment on open questions and our active areas of research; and we will briefly review the projects of our sister teams at Oracle Labs. Finally, in this talk we will briefly reflect on some of the key differences between research in a cutting-edge industry lab compared with research in an academic setting.
Online Selection with Cumulative Fairness Constraints
We propose and study the problem of online selection with cumulative fairness constraints. In this problem, candidates arrive online, i.e., one at a time, and the decision maker must choose to accept or reject each candidate subject to a constraint on the history of decisions made thus far. We introduce deterministic, randomized, and learned policies for selection in this setting. Empirically, we demonstrate that our learned policies achieve the highest utility. However, we also show—using 700 synthetically generated datasets— that the simple, greedy algorithm is often competitive with the optimal sequence of decisions, obviating the need for complex (and often inscrutable) learned policies, in many cases. Theoretically, we analyze the limiting behavior of our randomized approach and prove that it satisfies the fairness constraint with high probability.
Scalable Static Analysis to Detect Security Vulnerabilities: Challenges and Solutions
Parfait is a static analysis tool originally developed to find defects in C/C++ systems code. It has since been extended to detect injection attacks in Java and PL/SQL applications. Parfait has been deployed internally at Oracle, is used by thousands of developers, and can be integrated at commit-time, in the nightly build or used standalone. Commit-time integration brings security closer to developers, and provides them with the opportunity to fix defects before they are merged. This poster presents some of the challenges we encountered in the process of extending Parfait from a defect analyser for C/C++ to a security analyser for Java and PL/SQL, and the solutions that enabled us to analyse a variety of commercial enterprise applications in a fast and precise way.
Poster: Unacceptable Behavior: Robust PDF Malware Detection Using Abstract Interpretation
The popularity of the PDF format and the rich JavaScript environment that PDF viewers offer make PDF documents an attractive attack vector for malware developers. Because machine learning-based approaches are subject to adversarial attacks that mimic the structure of benign documents, we propose to detect malicious code inside a PDF by statically reasoning about its possible behaviours using abstract interpretation. A comparison with state-of-the-art PDF malware detection tools shows that our conservative abstract interpretation approach achieves similar accuracy, is more resilient to evasion attacks, and provides explainable reports.
Clonefiles
Explores the concept of clonefiles (aka Linux reflinks) and describes various tools and techniques for efficient introspection and processing.
Montsalvat: Intel SGX Shielding for GraalVM Native Images
The rapid growth of the Java programming language has led to its wide adoption in cloud computing infrastructures. However, Java applications running in untrusted clouds are susceptible to various forms of privileged attacks. The emergence of trusted execution environments (TEEs), i.e., Intel SGX, mitigates this problem. TEEs protect code and data in secure enclaves inaccessible to untrusted software, including the kernel or hypervisors. To efficiently use TEEs, developers are required to manually partition their applications into trusted and untrusted parts. This decreases the trusted computing base (TCB) and minimizes security vulnerabilities. However, partitioning Java applications poses two important challenges: (1) ensuring efficient object communication between the partitioned components, and (2) ensuring garbage collection consistency between them. We present Montsalvat, a tool which provides a practical and intuitive annotation-based partitioning approach for Java applications using secure enclaves. Montsalvat provides an RMI-like mechanism to ensure inter-object communication, as well as consistent garbage collection across the partitioned components. We implement Montsalvat with GraalVM Native Image, a tool which ahead-of-time compiles Java applications into standalone native executables which do not require a JVM at runtime. We perform extensive evaluations of Montsalvat using micro and macro benchmarks, and show that our partitioning approach can lead to up to 6.6× and 2.9× performance boosts in real-world applications (i.e., PalDB and GraphChi) respectively as compared to solutions that naively include the entire applications in the enclave.
Distributed Graph Processing with PGX.D
Graph processing is one of the top data analytics trends. In particular, graph processing comprises two main styles of analysis, namely graph algorithms and graph pattern-matching queries. Classic graph algorithms, such as Pagerank, repeatedly traverse the vertices and edges of the graph and calculate some desired (mathematical) function. Graph queries enable the interactive exploration and pattern matching of graphs. For example, queries like `SELECT p1.name, p2.name FROM MATCH (p1:person)-[:friend]->(p2:person) WHERE p1.country = p2.country` combine the classic operations found in SQL with graph patterns. Both algorithms and queries are very challenging workloads, especially in a distributed setting, where very large graphs are partitioned across multiple machines. In this lecture, I will present how the distributed PGX [1] engine (known as PGX.D; developed at Oracle Labs [2] Zurich) implements efficient algorithms and queries and solves problems, such as data skew and intermediate-result explosion. In brief, for graph algorithms, PGX.D offers the functionality to compile simple sequential textbook-style GreenMarl [3] algorithms to efficient distributed execution. For queries, PGX.D includes a depth-first asynchronous computation runtime [4] that enables limiting the amount of intermediate data during query execution to essentially support "any-size" patterns. [1] http://www.oracle.com/technetwork/oracle-labs/parallel-graph-analytix/overview/index.html [2] https://labs.oracle.com [3] Green-Marl: A DSL for easy and efficient graph analysis, ASPLOS'12. [4] aDFS: An Almost Depth-First-Search Distributed Graph-Querying System. USENIX ATC'21.
Neural Rule-Execution Tracking Machine For Transformer-Based Text Generation
Sequence-to-Sequence (Seq2Seq) neural text generation models, especially the pre-trained ones (e.g., BART and T5), have exhibited compelling performance on various natural language generation tasks. However, the black-box nature of these models limits their application in tasks where specific rules (e.g., controllable constraints, prior knowledge) need to be executed. Previous works either design specific model structures (e.g., Copy Mechanism corresponding to the rule “the generated output should include certain words in the source input”) or implement specialized inference algorithms (e.g., Constrained Beam Search) to execute particular rules through the text generation. These methods require the careful design case-by-case and are difficult to support multiple rules concurrently. In this paper, we propose a novel module named Neural Rule-Execution Tracking Machine, i.e., NRETM, that can be equipped into various transformer-based generators to leverage multiple rules simultaneously to guide the neural generation model for superior generation performance in a unified and scalable way. Extensive experiments on several benchmarks verify the effectiveness of our proposed model in both controllable and general text generation tasks.
Security Research at Oracle Labs, Australia
This is a broad-brush overview of the relevant projects (both past and present) at Oracle Labs, Australia. It also outlines some of the security ideas and software engineering principles that are relevant to tool development and deployment.
Bitemporal Property Graphs to Organize Evolving Systems
This work is a summarized view on the results of a one- year cooperation between Oracle Corp. and the University of Leipzig. The goal was to research the organization of relationships within multi- dimensional time-series data, such as sensor data from the IoT area. We showed in this project that temporal property graphs with some exten- sions are a prime candidate for this organizational task that combines the strengths of both data models (graph and time-series). The outcome of the cooperation includes four achievements: (1) a bitemporal property graph model, (2) a temporal graph query language, (3) a conception of continuous event detection, and (4) a prototype of a bitemporal graph database that supports the model, language and event detection.
Diverse Data Augmentation via Unscrambling Text with Missing Words
We present the Diverse Augmentation using Scrambled Seq2Seq (DAugSS) algorithm, a fully automated data augmentation mechanism that leverages a model to generate examples in a semi-controllable fashion. The main component of DAugSS is a training procedure in which the generative model is trained to transform a class label and a sequence of tokens into a well-formed sentence of the specified class that contains the specified tokens. Empirically, we show that DAugSS is competitive with or outperforms state-of-the-art, generative models for data augmentation in terms of test set accuracy on 4 datasets. We show that the flexibility of our approach yields datasets with expansive vocabulary, and that models trained on these datasets are more resilient to adversarial attacks than when trained on datasets augmented by competing methods.
Searching Near and Far for Examples in Data Augmentation
In this work, we demonstrate that augmenting a dataset with examples that are far from the initial training set can lead to significant improvements in test set accuracy. We draw on the similarity of deep neural networks and nearest neighbor models. Like a nearest neighbor classifier, we show that, for any test example, augmentation with a single, nearby training example of the same label--followed by retraining--is often sufficient for a BERT-based model to correctly classify the test example. In light of this result, we devise FRaNN, an algorithm that attempts to cover the embedding space defined by the trained model with training examples. Empirically, we show that FRaNN, and its variant FRaNNK, construct augmented datasets that lead to models with higher test set accuracy than either uncertainty sampling or a random augmentation baseline.
Multivalent Entailment Graphs for Question Answering
Drawing inferences between open-domain natural language predicates is a necessity for true language understanding. There has been much progress in unsupervised learning of entailment graphs for this purpose. We make three contributions: (1) we reinterpret the Distributional Inclusion Hypothesis to model entailment between predicates of different valencies, like DEFEAT(Biden, Trump) |= WIN(Biden); (2) we actualize this theory by learning unsupervised Multivalent Entailment Graphs of open-domain predicates; and (3) we demonstrate the capabilities of these graphs on a novel question answering task. We show that directional entailment is more helpful for inference than non-directional similarity on questions of fine-grained semantics. We also show that drawing on evidence across valencies answers more questions than by using only the same valency evidence.
Open-Domain Contextual Link Prediction and its Complementarity with Entailment Graphs
An open-domain knowledge graph (KG) has entities as nodes and natural language relations as edges, and is constructed by extracting (subject, relation, object) triples from text. The task of open-domain link prediction is to infer missing relations in the KG. Previous work has used standard link prediction for the task. Since triples are extracted from text, we can ground them in the larger textual context in which they were originally found. However, standard link prediction methods only rely on the KG structure and ignore the textual context of the triples. In this paper, we introduce the new task of open-domain contextual link prediction which has access to both the textual context and the KG structure to perform link prediction. We build a dataset for the task and propose a model for it. Our experiments show that context is crucial in predicting missing relations. We also demonstrate the utility of contextual link prediction in discovering out-of-context entailments between relations, in the form of entailment graphs (EG), in which the nodes are the relations. The reverse holds too: out-of-context EGs assist in predicting relations in context.
GraalVM, Python, and Polyglot Programming
Presentation at HPI graduate school, the PhD school at the HPI in Potsdam.
LXM: Better Splittable Pseudorandom Number Generators (and Almost as Fast)
Paper to be submitted to ACM OOPSLA 2021. Abstract: In 2014, Steele, Lea, and Flood presented {\sc SplitMix}, an object-oriented pseudorandom number generator (PRNG) that is quite fast (9 64-bit arithmetic/logical operations per 64 bits generated) and also {\it splittable}. A conventional PRNG object provides a {\it generate} method that returns one pseudorandom value and updates the state of the PRNG; a splittable PRNG object also has a second operation, {\it split}, that replaces the original PRNG object with two (seemingly) independent PRNG objects, by creating and returning a new such object and updating the state of the original object. Splittable PRNG objects make it easy to organize the use of pseudorandom numbers in multithreaded programs structured using fork-join parallelism. This overall strategy still appears to be sound, but the specific arithmetic calculation used for {\it generate} in the {\sc SplitMix} algorithm has some detectable weaknesses, and the period of any one generator is limited to $2^{64}$. Here we present the LXM \emph{family} of PRNG algorithms. The idea is an old one: combine the outputs of two independent PRNG algorithms, then (optionally) feed the result to a mixing function. An LXM algorithm uses a linear congruential subgenerator and an $\mathbf{F}_2$-linear subgenerator; the examples studied in this paper use an LCG of period $2^{16}$, $2^{32}$, $2^{64}$, or $2^{128}$ with one of the multipliers recommended by L'Ecuyer or by Steele and Vigna, and an $\mathbf{F}_2$-linear generator of the \texttt{xoshiro} family or \texttt{xoroshiro} family as described by Blackman and Vigna. Mixing functions studied in this paper include the MurmurHash3 finalizer function, David Stafford's variants, Doug Lea's variants, and the null (identity) mixing function. Like {\sc SplitMix}, LXM provides both a \emph{generate} operation and a \emph{split} operation. Also like {\sc SplitMix}, LXM requires no locking or other synchronization (other than the usual memory fence after instance initialization), and is suitable for use with {\sc simd} instruction sets because it has no branches or loops. We analyze the period and equidistribution properties of LXM generators, and present the results of thorough testing of specific members of this family, using the TestU01 and PractRand test suites, not only on single instances of the algorithm but also for collections of instances, used in parallel, ranging in size from $2$ to $2^{27}$. Single instances of LXM that include a strong mixing function appear to have no major weaknesses, and LXM is significantly more robust than {\sc SplitMix} against accidental correlation in a multithreaded setting. We believe that LXM is suitable for the same sorts of applications as {\sc SplitMix}, that is, ``everyday'' scientific and machine-learning applications (but not cryptographic applications), especially when concurrent threads or distributed processes are involved.
Run-time Data Analysis to Drive Compiler Optimizations
Throughout program execution, types may stabilize, variables may become constant, and code sections may turn out to be redundant - all information that is used by just-in-time (JIT) compilers to achieve peak performance. Yet, since JIT compilation is done on demand for individual code parts, global observations cannot be made. Moreover, global data analysis is an inherently expensive process, that collects information over large data sets. Thus, it is infeasible in dynamic compilers. With this project, we propose integrating data analysis into a dynamic runtime to speed up big data applications. The goal is to use the detailed run-time information for speculative compiler optimizations based on the shape and complexion of the data to improve performance.
Run-Time Data Analysis in Dynamic Runtimes
Databases are typically faster in processing huge amounts of data than applications with hand-coded data access. Even though modern dynamic runtimes optimize applications intensively, they cannot perform certain optimizations that are traditionally used by database systems as they lack the required information. Thus, we propose to extend the capabilities of dynamic runtimes to allow them to collect fine- grained information of the processed data at run time and use it to perform database-like optimizations. By doing so, we want to enable dynamic runtimes to significantly boost the performance of data-processing workloads. Ideally, applications should be as fast as databases in data-processing workloads by detecting the data schema at run time. To show the feasibility of our approach, we are implementing it in a polyglot dynamic runtime.
LXM: Better Splittable Pseudorandom Number Generators (and Almost as Fast)
Video for a conference presentation at ACM OOPSLA 2021. The video file is 1280x720. An associated SRT file contains the subtitle (closed caption) information separately. The corresponding paper is Archivist 2021-0405. The slides are available in PDF and PowerPoint formats as Archivist 2021-1004.
GraalVM Native Image: Large-scale static analysis for Java
GraalVM Native Image combines static analysis, heap snapshotting, and ahead-of-time compilation to produce a highly optimized standalone executable for a Java application. In this talk, we first introduce the overall architecture of GraalVM Native Image: instead of “just” compiling Java bytecode ahead of time, it also initializes part of the application at build time. This reduces the startup time and memory footprint of the application at run time. In the second part of the talk, we dive into details of the points-to analysis. We show which of our original research ideas worked or did not work when analyzing large production applications; and we show the benefits of tightly integrating the static analysis with the ahead-of-time compiler.
Lightweight On-Stack Replacement in Languages with Unstructured Loops
On-stack replacement (OSR) is a popular technique used by just in time (JIT) compilers. A JIT can use OSR to transfer from interpreted to compiled code in the middle of execution, immediately reaping the performance benefits of compilation. This technique typically relies on loop counters, so it cannot be easily applied to languages with unstructured control flow. It is possible to reconstruct the high-level loop structures of an unstructured language using a control flow analysis, but such an analysis can be complicated, expensive, and language-specific. In this paper, we present a more lightweight strategy for OSR in unstructured languages which relies only on detecting backward jumps. We design a simple, language-agnostic API around this strategy for language interpreters. We then discuss our implementation of the API in the Truffle framework, and the design choices we made to make it efficient and correct. In our evaluation, we integrate the API with Truffle’s LLVM bitcode interpreter, and find the technique is effective at improving start-up performance without harming warmed-up performance.
CompGen: Generation of Fast Compilers in a Multi-Language VM
The first Futamura projection enables compilation and high performance code generation of user programs by partial evaluation of language interpreters. Previous work has shown that it is sufficient to leverage profiling information and use partial evaluation directives in interpreters as hints to drive partial evaluation towards compiled code efficiency. However, this comes with the downside of additional application warm-up time: Partial evaluation of language interpreters has to specialize interpreter code on the fly to the dynamic types used at run time to create efficient target code. As a result, the tie spend on partial evaluation itself is a significant contributor to the overall compile time of a method. The second Futamura projection solves this problem by self-applying partial evaluation on the partial evaluation algorithm, effectively generating language-specific compilers from interpreters. This typically reduces compilation time compared to the first projection. Previous work employed the second projection to some extent, however to this day, no generic second Futamura projection approach is used in a state-of-the-art language runtime. Ultimately, the problems of code-size explosion for compiler generation and warm-up time increases are unsolved problems subject to research to this day. To solve the problems of code-size explosion and self-application warm-up this paper proposes \emph{CompGen}, an approach based on code generation of subsets of language interpreters which is loosely based upon the idea of the second Futamura projection. We implemented a prototype of CompGen for \textit{GraalVM} and show that our usage of a novel code-generation algorithm, incorporating interpreter directives allows to generate efficient compilers that emit fast target programs which easily outperform the first Fumatura projection in compilation time. We evaluated our approach with \textit{GraalJS}, an ECMAScript-compliant interpreter, and standard JavaScript benchmarks, showing that our approach achieves $2-3X$ speedups of partial evaluation.
Tribuo: Machine Learning with Provenance in Java
Machine Learning models are deployed across a wide range of industries, performing a wide range of tasks. Tracking these models and ensuring they behave appropriately is be- coming increasingly difficult as the number of models increases. Current ML monitoring systems provide provenance and tracking by layering on top of the library that performs the ML computation, allowing room for developer confusion and mistakes. In this paper we introduce Tribuo, a Java ML library which integrates model training, inference, strong type-safety, runtime checking, and automatic provenance recording into a single framework. All Tribuo’s models and evaluations record the full data pipeline of training and testing data, along with the training algorithms, hyperparameters and data transformation steps automatically. This data lives inside the model object and can be persisted separately using common markup formats. Tribuo implements many popular ML algorithms for classification, regression, clustering, multi-label classification and anomaly detection, along with interfaces to XGBoost, TensorFlow and ONNX Runtime. Tribuo’s source code is available at https://github.com/oracle/tribuo under an Apache 2.0 license with documentation and tutorials available at https://tribuo.org.
Low-Overhead Multi-Language Dynamic Taint Analysis on Managed Runtimes through Speculative Optimization
Conference presentation of the paper http://ol-archivist.us.oracle.com/archivist/document/2021-0512
Searching Near and Far for Examples in Data Augmentation
In this work, we demonstrate that augmenting a dataset with examples that are far from the initial training set can lead to significant improvements in test set accuracy. We draw on the similarity of deep neural networks and nearest neighbor models. Like a nearest neighbor classifier, we show that, for any test example, augmentation with a single, nearby training example of the same label--followed by retraining--is often sufficient for a BERT-based model to correctly classify the test example. In light of this result, we devise FRaNN, an algorithm that attempts to cover the embedding space defined by the trained model with training examples. Empirically, we show that FRaNNk, and its variant FRaNNk, construct augmented datasets that lead to models with higher test set accuracy than either uncertainty sampling or a random augmentation baseline.
Private Cross-Silo Federated Learning for Extracting Vaccine Adverse Event Mentions
Federated Learning (FL) is quickly becoming a goto distributed training paradigm for users to jointly train a global model without physically sharing their data. Users can indirectly contribute to, and directly benefit from a much larger aggregate data corpus used to train the global model. However, literature on successful application of FL in real-world problem settings is somewhat sparse. In this pa- per, we describe our experience applying a FL based solution to the Named Entity Recognition (NER) task for an adverse event detection application in the context of mass scale vaccination programs. We present a comprehensive empirical analysis of various dimensions of benefits gained with FL based training. Furthermore, we investi- gate effects of tighter Differential Privacy (DP) constraints in highly sensitive settings where federation users must enforce Local DP to ensure strict privacy guarantees. We show that local DP can severely cripple the global model’s prediction accuracy, thus disincentivizing users from participating in the federation. In response, we demon- strate how recent innovation on personalization methods can help significantly recover the lost accuracy.
Just-in-Time Compiling Ruby Regexps on TruffleRuby
Just-in-Time Compiling Ruby Regexps on TruffleRuby, a presentation about the performance benefits gained by the adoption of TRegex in TruffleRuby.
ICDAR 2021 Scientific Literature Parsing Competition
Documents in Portable Document Format (PDF) are ubiquitous with over 2.5 trillion documents. PDF format is human readable but not easily understood by machines and the large number of different styles makes it difficult to process the large variety of documents effectively. Our ICDAR 2021 Scientific Literature Parsing Competition offers participants with a large number of training and evaluation examples compared to previous competitions. Top competition results show a significant increase in performance compared to previously reported on the competition data sets. Most of the current methods for document understanding rely on deep learning, which requires a large number of training examples. We have generated large data sets that have been used in this competition. Our competition is split into two tasks to understand document layouts (Task A) and tables (Task B). In Task A, Document Layout Recognition, submissions with the highest performance combine object detection and specialised solutions for the different categories. In Task B, Table Recognition, top submissions rely on methods to identify table components and post-processing methods to generate the table structure and content. Results from both tasks show an impressive performance and opens the possibility for high performance practical applications.
The Future Is Big Graphs: A Community View on Graph Processing Systems
Graphs are, by nature, 'unifying abstractions' that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. Although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the next decade for big graph processing to continue to succeed?
Exploring Time-Space trade-offs for "synchronized" in Lilliput
In the context of project lilliput, which attempts to reduce the size of object header in the HotSpot Java Virtual Machine (JVM), we explore a curated set of synchronization algorithms. Each of the algorithms could serve as a potential replacement implementation for the “synchronized” construct in HotSpot. Collectively, the algorithms illuminate trade-offs in space-time properties. The key design decisions are where to locate synchronization metadata (monitor fields), how to map from an object to those fields, and the lifecycle of the monitor information. The readers is assumed to be familiar with current HotSpot implementation of “synchronized” as well as the the Compact Java Monitors (CJM) design.
Private Cross-Silo Federated Learning for Extracting Vaccine Adverse Event Mentions
Federated Learning (FL) is quickly becoming a goto distributed training paradigm for users to jointly train a global model without physically sharing their data. Users can indirectly contribute to, and directly benefit from a much larger aggregate data corpus used to train the global model. However, literature on successful application of FL in real-world problem settings is somewhat sparse. In this pa- per, we describe our experience applying a FL based solution to the Named Entity Recognition (NER) task for an adverse event detection application in the context of mass scale vaccination programs. We present a comprehensive empirical analysis of various dimensions of benefits gained with FL based training. Furthermore, we investi- gate effects of tighter Differential Privacy (DP) constraints in highly sensitive settings where federation users must enforce Local DP to ensure strict privacy guarantees. We show that local DP can severely cripple the global model’s prediction accuracy, thus disincentivizing users from participating in the federation. In response, we demon- strate how recent innovation on personalization methods can help significantly recover the lost accuracy. We focus our analysis on the Federated Fine-Tuning algorithm, FedFT, and prove that it is not PAC Identifiable, thus making it even more attractive for FL-based training.
Mention Flags (MF): Constraining Transformer-based Text Generators
This paper focuses on Seq2Seq (S2S) constrained text generation where the text generator is constrained to mention specific words which are inputs to the encoder in the generated outputs. Pre-trained S2S models or a Copy Mechanism are trained to copy the surface tokens from encoders to decoders, but they cannot guarantee constraint satisfaction. Constrained decoding algorithms always produce hypotheses satisfying all constraints. However, they are computationally expensive and can lower the generated text quality. In this paper, we propose Mention Flags (MF), which traces whether lexical constraints are satisfied in the generated outputs in a S2S decoder. The MF models are trained to generate tokens until all constraints are satisfied, guaranteeing high constraint satisfaction. Our experiments on the Common Sense Generation task (CommonGen) (Lin et al., 2020), End2end Restaurant Dialog task (E2ENLG) (Duˇsek et al., 2020) and Novel Object Captioning task (nocaps) (Agrawal et al., 2019) show that the MF models maintain higher constraint satisfaction and text quality than the baseline models and other constrained decoding algorithms, achieving state-of-the art performance on all three tasks. These results are achieved with a much lower run-time than constrained decoding algorithms. We also show that the MF models work well in the low-resource setting.
aDFS: An Almost Depth-First-Search Distributed Graph-Querying System
Graph processing is an invaluable tool for data analytics. In particular, pattern-matching queries enable flexible graph exploration and analysis, similar to what SQL provides for relational databases. Graph queries focus on following connections in the data; they are a challenging workload because even seemingly trivial queries can easily produce billions of intermediate results and irregular data access patterns. In this paper, we introduce aDFS: A distributed graphquerying system that can process practically any query fully in memory, while maintaining bounded runtime memory consumption. To achieve this behavior, aDFS relies on (i) almost depth-first (aDFS) graph exploration with some breadth-first characteristics for performance, and (ii) non-blocking dispatching of intermediate results to remote edges. We evaluate aDFS against state-of-the-art graph-querying (Neo4J and GraphFrames for Apache Spark), graph-mining (G-Miner, Fractal, and Peregrine), as well as dataflow joins (BiGJoin), and show that aDFS significantly outperforms prior work on a diverse selection of workloads.
aDFS: An Almost Depth-First-Search Distributed Graph-Querying System (Presentation Slides)
Presentation slides for the paper "aDFS: An Almost Depth-First-Search Distributed Graph-Querying System" accepted at USENIX ATC 2021.
Doing More with Less: Characterizing Dataset Downsampling for AutoML
Automated machine learning (AutoML) promises to democratize machine learning by automatically generating machine learning pipelines with little to no user intervention. Typically, a search procedure is used to repeatedly generate and validate candidate pipelines, maximizing a predictive performance metric, subject to a limited execution time budget. While this approach to generating candidates works well for small tabular datasets, the same procedure does not directly scale to larger tabular datasets with 100,000s of observations, often producing fewer candidate pipelines and yielding lower performance, given the same execution time budget. We carry out an extensive empirical evaluation of the impact that downsampling – reducing the number of rows in the input tabular dataset – has on the pipelines produced by a genetic-programmingbased AutoML search for classification tasks.
Retail markdown price optimization and inventory allocation under demand parameter uncertainty
This paper discusses a prescriptive analytics approach to solving a joint markdown pricing and inventory allocation optimization problem under demand parameter uncertainty. We consider a retailer capable of price differentiation among multiple customer groups with different demand parameters that are supplied from multiple warehouses or fulfillment centers at different costs. In particular, we consider a situation when the retailer has a limited amount of inventory that must be sold by a certain exit date. Since in most practical situations the demand parameters cannot be estimated exactly, we propose an approach to optimize the expected value of the profit based on the given distribution of the demand parameters and analyze the properties of the solution. We also describe a predictive demand model to estimate the distribution of the demand parameters based on the historical sales data. Since the sales data usually include multiple similar products embedded into a hierarchical structure, we suggest an approach to the demand modeling that takes advantage of the merchandise and location hierarchies.
Scalable String Analysis: An Experience Report (Presentation slides)
Presentation slides for the paper "Scalable String Analysis: An Experience Report" accepted at SOAP'21
Towards Intelligent Application Security
Over the past 20 years we have seen application security evolve from analysing application code through Static Application Security Testing (SAST) tools, to detecting vulnerabilities in running applications via Dynamic Application Security Testing (DAST) tools. The past 10 years have seen new flavours of tools to provide combinations of static and dynamic tools via Interactive Application Security Testing (IAST), examination of the components and libraries of the software called Software Composition Analysis (SCA), protection of web applications and APIs using signature-based Web Application Firewalls (WAF), and monitoring the application and blocking attacks through Runtime Application Self Protection (RASP) techniques. The past 10 years has also seen an increase in the uptake of the DevOps model that combines software development and operations to provide continuous delivery of high quality software. As security has become more important, the DevOps model has evolved to the DevSecOps model where software development, operations and security are all integrated. There has also been increasing usage of learning techniques, including machine learning, and program synthesis. Several tools have been developed that make use of machine learning to help developers make quality decisions about their code, tests, or runtime overhead their code produces. However, such techniques have not been applied to application security as yet. In this talk I discuss how to provide an automated approach to integrate security into all aspects of application development and operations, aided by learning techniques. This incorporates signals from the code operations and beyond, and automation, to provide actionable intelligence to developers, security analysts, operations staff, and autonomous systems. I will also consider how malware and threat intelligence can be incorporated into this model to support Intelligent Application Security in a rapidly evolving world.
Scalable String Analysis: An Experience Report
Static string analysis underpins many security-related analyses including detection of SQL injections and cross-site scripting. Even though string analysis received much attention, none of the known techniques are effective on large codebases. In this paper we present OLSA -- a tool for scalable static string analysis of large Java programs. OLSA analysis is based on intra-procedural string value flow graphs connected via call-graph edges. Formally, this uses a context-sensitive grammar to generate the set of possible strings. We evaluate our approach by using OLSA to detect SQL injections and unsafe use of reflection in DaCapo benchmarks and a large internal Java codebase and compare the performance of OLSA with the state-of-the-art string analyser called JSA. The results of this experimentation indicate that our approach can analyse industrial-scale codebases in a matter of hours, whereas JSA does not scale to many DaCapo programs. The set of potential strings generated by our string analysis can be used for checking the validity of the reported potential vulnerabilities.
Compiler-Assisted Object Inlining with Value Fields
Object Oriented Programming has flourished in many areas ranging from web-oriented microservices, data processing, to databases. However, while representing domain entities as objects is appealing to developers, it leads to high data fragmentation as data is loaded into applications as large collections of data objects, resulting in high memory footprint and poor locality. To minimize memory footprint and increase memory locality, embedding the payload of an object into another object (object inlining) has been considered before but existing techniques present severe limitations that prevent it from becoming a widely adopted technique. We argue that object inlining is mostly useful to optimize objects in the application data-path and that such objects have value semantics, unlocking great potential for inlining objects. We propose value fields, an abstraction which allows fields to be marked as having value semantics. We take advantage of the closed-world assumption provided by GraalVM Native Image to implement Object inlining as a compiler phase that modifies both object layouts and accesses to inlined fields. Experimental evaluation shows that using value fields in real-world frameworks such as Apache Spark, Spring Boot, and Micronaut, requires minimal to no effort at all from developers. Results show improvements in throughput of up to 3x, memory footprint reduction of up to 40% and reduced GC pause times of up to 35%
Modeling memory bandwidth patterns on NUMA machines with performance counters
Modern computers used for data analytics are often NUMA systems with multiple sockets per machine, multiple cores per socket, and multiple thread contexts per core. To get the peak performance out of these machines requires the correct number of threads to be placed in the correct positions on the machine. One particularly interesting element of the placement of memory and threads is the way it effects the movement of data around the machine, and the increased latency this can introduce to reads and writes. In this paper we describe work on modeling the bandwidth requirements of an application on a NUMA compute node based on the placement of threads. The model is constructed by sampling performance counters while the application runs with 2 carefully chosen thread placements. The results of this modeling can be used in a number of ways varying from: Performance debugging during development where the programmer can be alerted to potentially problematic memory access patterns; To systems such as Pandia which take an application and predict the performance and system load of a proposed thread count and placement; To libraries of data structures such as Parallel Collections and Smart Arrays that can abstract from the user memory placement and thread placement issues when parallelizing code.
The Flavour of Real World Vulnerability Detection and Intelligent Configuration
The Parfait static code analysis tool focuses on detecting vulnerabilities that matter in C, C++, Java and Python languages. Its focus has been on key items expected out of a commercial tool that lives in a commercial organisation, namely, precision of results (i.e., high true positive rate), scalability (i.e., being able to run quickly over millions of lines of code), incremental analysis (i.e., being able to run over deltas of the code quickly), and usability (i.e., ease of integration into standard build processes, reporting of traces to the vulnerable location, etc). Today, Parfait is used by thousands of developers at Oracle worldwide on a day-to-day basis. In this presentation we’ll sample a flavour of Parfait — we explore some real world challenges faced in the creation of a robust vulnerability detection tool, look into two examples of vulnerabilities that severely affected the Java platform in 2012/2013 and most machines since 2017, and conclude by recounting what matters to developers for integration into today’s continuous integration and continuous delivery (CI/CD) pipelines. Key to deployment of static code analysis tools is configuration of the tool itself - we present our experiences with use of machine learning to automatically configure the tool, providing users with a better out-of-the-box experience.
The Flavour of Real-World Vulnerability Detection and Intelligent Configuration
The Parfait static code analysis tool focuses on detecting vulnerabilities that matter in C, C++, Java and Python languages. Its focus has been on key items expected out of a commercial tool that lives in a commercial organisation, namely, precision of results (i.e., high true positive rate), scalability (i.e., being able to run quickly over millions of lines of code), incremental analysis (i.e., being able to run over deltas of the code quickly), and usability (i.e., ease of integration into standard build processes, reporting of traces to the vulnerable location, etc). Today, Parfait is used by thousands of developers at Oracle worldwide on a day-to-day basis. In this presentation we’ll sample a flavour of Parfait — we explore some real world challenges faced in the creation of a robust vulnerability detection tool, look into two examples of vulnerabilities that severely affected the Java platform in 2012/2013 and most machines since 2017, and conclude by recounting what matters to developers for integration into today’s continuous integration and continuous delivery (CI/CD) pipelines. Key to deployment of static code analysis tools is configuration of the tool itself - we present our experiences with use of machine learning to automatically configure the tool, providing users with a better out-of-the-box experience.
Intelligent Application Security
Over the past 20 years we have seen application security evolve from analysing application code through Static Application Security Testing tools, to detecting vulnerabilities in running applications via Dynamic Application Security Testing tools. The past 10 years have seen new flavours of tools: Software Composition Analysis, Web Application Firewalls, and Runtime Application Self Protection. The past 10 years has also seen an increase in the uptake of the DevOps model that combines software development and operations. Several tools have been developed that make use of machine learning to help developers make quality decisions about their code, tests, or runtime overhead their code produces. However, little has been done to address application security. This talk focuses on a vision for Intelligent Application Security in the context of the DevSecOps model, where security is integrated into DevOps, by informing program analysis with learning techniques including program synthesis, and keeping track of a knowledge base. What is Intelligent Application Security? Intelligent Application Security aims to provide an automated approach to integrate security into all aspects of application development and operation, at scale, using learning techniques that incorporate signals from the code and beyond, to provide actionable intelligence to developers, security analysts, operations staff, and autonomous systems.
RASPunzel for deserialization in 5 min
In this talk, we show how data-driven allowlist synthesis can help prevent deserialization vulnerabilities, which often lead to remote code execution attacks. Serialization is the process of converting an in-memory object to and re-creating it from a persistent format (e.g. byte stream, JSON, XML, binary). Serialization is present in many languages like Java, Python, Ruby, and C# and it is commonly used to exchange data in distributed systems or across different languages. In many cases, however, it can be exploited by crafting serialised payloads that will trigger arbitrary code upon deserialization. The most common, and insufficient, defence against deserialization attacks are blocklists, which prevent deserialization of known malicious code. Allowlists instead restrict deserialization to known benign code, but shift the burden of creating and maintaining the list from security practitioners to developers. In this talk, we show how data-driven allowlist synthesis combined with runtime application self-protection greatly simplifies the creation and enforcement of allowlists while significantly improving security. Through a demo, we will show how a runtime application self-protection (RASP) agent enforcing a synthesized allowlist prevents real-world deserialization attacks without the need to alter or re-compile application code.
PRIVATE CROSS-SILO FEDERATED LEARNING FOR EXTRACTING VACCINE ADVERSE EVENT MENTIONS
Automatically extracting mentions of suspected drug or vaccine adverse events (potential side effects) from unstructured text is critical in the current pandemic, but small amounts of labeled training data remains silo-ed across organizations due to privacy concerns. Federated Learning (FL) is quickly becoming a goto distributed training paradigm for such users to jointly train a more accurate global model without physically sharing their data. However, literature on successful application of FL in real-world problem settings is somewhat sparse. In this pa- per, we describe our experience applying a FL based solution to the Named Entity Recognition (NER) task for an adverse event detection application in the con- text of mass scale vaccination programs. Furthermore, we show that Differential Privacy (DP), which offers stronger privacy guarantees, but severely cripples the global model’s prediction accuracy, thus dis-incentivizing users from participating in the federation. We demonstrate how recent innovation on personalization methods can help significantly recover the lost accuracy.
Automated GPU Out-of-Bound Access Detectionand Prevention in a Managed Environment
GPUs have proven extremely effective at accelerating general-purpose workloads in fields from numerical simulation to deep learning and finance. However, even code written by experienced GPU programmers often offers little robustness, limiting the GPUs’ adoption in critical applications’ acceleration. Out-of-bounds array accesses are one of the most common sources of errors and vulnerabilities on GPUs and can be hard to detect and prevent due to the architectural characteristics of GPUs.This work presents an automated technique ensuring detection and protection against out-of-bounds array accesses inside CUDA GPU kernels. We compile kernels ahead-of-time, invoke them at run time using the Graal polyglot Virtual Machine and execute them on the GPU. Our technique is transparent to the user and operates on the LLVM Intermediate Representation. It adds boundary checks for array accesses based on array size knowledge, available at run time thanks to the managed execution environment, and optimizes the resulting code to minimize the impact of our modifications.We test our technique on 16 different GPU kernels extracted from common GPU workloads and show that we can prevent out-of-bounds array accesses in arbitrary GPU kernels without any statistically significant execution time overhead.
Optimizing Inference Performance of Transformers on CPUs
Slides to be presented at the EuroMLSys'21 workshop
Optimizing Inference Performance of Transformers on CPUs
The Transformer architecture revolutionized the field of natural language processing (NLP). Transformers-based models (e.g., BERT) power many important Web services, such as search, translation, question-answering, etc. While enormous research attention is paid to the training of those models, relatively little efforts are made to improve their inference performance. This paper comes to address this gap by presenting an empirical analysis of scalability and performance of inferencing a Transformer-based model on CPUs. Focusing on the highly popular BERT model, we identify key components of the Transformer architecture where the bulk of the computation happens, and propose an Adaptive Linear Module Optimization (ALMO) to speed them up. The optimization is evaluated using the inference benchmark from HuggingFace, and is shown to achieve the speedup of up to x1.71. Notably, ALMO does not require any changes to the implementation of the models nor affects their accuracy.
Vate: Runtime Adaptable Probabilistic Programming in Java
Inspired by earlier work on Augur, Vate is a probabilistic programming language for the construction of JVM based models with an Object Oriented interface. As a compiled language it is able to examine the dependency graph of the model to produce optimised code that can be dynamically targeted to different platforms.
CLAMH Introduction
The Cross-Language Microbenchmark Harness (CLAMH) provides a unique environment for running software benchmarks. It is unique in that it allows comparison across different platforms and across different languages. For example, it allows the comparison of clang, gcc, llvm, and GraalVM Sulong on the same benchmark, and can also be used to compare the Java counterparts of the same benchmark running on any JVM. CLAMH allows users to verify vendor benchmark performance claims, baseline benchmark performance in their own compute environment, compare with other compute environments, and, by so doing, identify areas where performance can be improved. CLAMH has been released Open Source in the GraalVM repository - https://github.com/graalvm/CLAMH
MSET2 Streaming Prognostics for IoT Telemetry on Oracle Roving Edge Infrastructure
Critical applications needed in real-world environments would be difficult or impossible to execute on the public cloud alone because of the massive bandwidth and latency needed to transmit and process vast amounts of data, as well as offer instant responses to the results of that analysis. Oracle's MSET2 prognostic ML algorithm, implemented on Roving Edge Clusters with NVIDIA Tesla T4 GPUs, attains unprecedented reductions in computational latencies and breakthrough throughput acceleration factors for large-scale ML streaming prognostics from dense-sensor fleets of assets in such fields as U.S. Department of Defense assets, utilities, oil & gas, commercial aviation, and prognostic cybersecurity for data center IT assets as well as DoD supervisory control and data acquisition assets and networks, and smart manufacturing.
Python auf GraalVM – eine vielfältige Welt
Presentation at enterPy conference (https://www.enterpy.de/) a German business-oriented Python conference. The slides are almost the same as those presented at OOW CodeOne 2019 (approved here: http://ol-archivist.us.oracle.com/archivist/document/2019-0905), updated for new features, URLs, performance numbers, and compatibility.
IFDS Taint Analysis With Access Paths
Over the years, static taint analysis emerged as the analysis of choice to detect some of the most common web application vulnerabilities, such as SQL injection (SQLi) and cross-site scripting (XSS). Furthermore, from an implementation perspective, the IFDS dataflow framework stood out as one of the most successful vehicles to implement static taint analysis for real-world Java applications. While existing approaches scale reasonably to medium-size applications (e.g. up to one hour analysis time for less than 100K lines of code), our experience suggests that no existing solution can scale to very large industrial code bases (e.g. more than 1M lines of code). In this paper, we present our novel IFDS-based solution to perform fast and precise static taint analysis of very large industrial Java web applications. Similar to state-of-the-art approaches to taint analysis, our IFDS-based taint analysis uses access paths to abstract objects and fields in a program. However, contrary to existing approaches, our analysis is demand-driven, which restricts the amount of code to be analyzed, and does not rely on a computationally expensive alias analysis, thereby significantly improving scalability.
Generality—or Not—in a Domain-Specific Language (A Case Study)
Slides for an invited keynote at the
Are many heaps better than one?
The recent introduction by Intel of widely available Non-Volatile RAM has reawakened interest in persistence, a hot topic of the 1980s and 90s. The most ambitious schemes of that era were not adopted; I will speculate as to why, and introduce a new approach based on multiple heaps, designed to overcome the problems. I’ll present the main features of the new persistence model, and describe a prototype implementation I’ve been working on for GraalVM Native Image. This purpose of this work-in-progress is to allow experimentation with the new model, so that the community can assess its desirability. I’ll outline the main features of the prototype and some of the remaining challenges.
Fast and Efficient Java Microservices With GraalVM @ Oracle Developer Live
Slides for Oracle Developer Live - Java Innovations conference. This talk will be focused on the benefits Native Image and recent updates
How to program machine learning in Java with the Tribuo library
Tribuo is a new open source library written in Java from Oracle Labs’ Machine Learning Research Group. The team’s goal for Tribuo is to build an ML library for the Java platform that is more in line with the needs of large software systems. Tribuo operates on objects, not primitive arrays, Tribuo’s models are self-describing and reproducible, and it provides a uniform interface over many kinds of prediction tasks.
ColdPress: An Extensible Malware Analysis Platform for Threat Intelligence
Malware analysis is still largely a manual task. This slow and inefficient approach does not scale to the exponential rise in the rate of new unique malware generated. Hence, automating the process as much as possible becomes desirable. In this paper, we present ColdPress – an extensible malware analysis platform that automates the end-to-end process of malware threat intelligence gathering integrated output modules to perform report generation of arbitrary file formats. ColdPress combines state-of-the-art tools and concepts into a modular system that aids the analyst to efficiently and effectively extract information from malware samples. It is designed as a user-friendly and extensible platform that can be easily extended with user-defined modules. We evaluated ColdPress with complex real-world malware samples (e.g., WannaCry), demonstrating its efficiency, performance and usefulness to security analysts. Our demo video is available at https://youtu.be/AwlBo1rxR1U.
Online Post-Processing in Rankings for Fair Utility Maximization
We consider the problem of utility maximization in online ranking applications while also satisfying a pre-defined fairness constraint. We consider batches of items which arrive over time, already ranked using an existing ranking model. We propose online post-processing for re-ranking these batches to enforce adherence to the pre-defined fairness constraint, while maximizing a specific notion of utility. To achieve this goal, we propose two deterministic re-ranking policies. In addition, we learn a re-ranking policy based on a novel variation of learning to search. Extensive experiments on real world and synthetic datasets demonstrate the effectiveness of our proposed policies both in terms of adherence to the fairness constraint and utility maximization. Furthermore, our analysis shows that the performance of the proposed policies depends on the original data distribution w.r.t the fairness constraint and the notion of utility.
Formal Verification of Authenticated, Append-Only Skip Lists in Agda: Extended Version
Authenticated Append-Only Skiplists (AAOSLs) enable maintenance and querying of an authenticated log (such as a blockchain) without requiring any single party to store or verify the entire log, or to trust another party regarding its contents. AAOSLs can help to enable efficient dynamic participation (e.g., in consensus) and reduce storage overhead. In this paper, we formalize an AAOSL originally described by Maniatis and Baker, and prove its key correctness properties. Our model and proofs are machine checked in Agda. Our proofs apply to a generalization of the original construction and provide confidence that instances of this generalization can be used in practice. Our formalization effort has also yielded some simplifications and optimizations.
CSR++: A Fast, Scalable, Update-Friendly Graph Data Structure
The graph model enables a broad range of analysis, thus graph processing is an invaluable tool in data analytics. At the heart of every graph-processing system lies a concurrent graph data structure storing the graph. Such a data structure needs to be highly efficient for both graph algorithms and queries. Due to the continuous evolution, the sparsity, and the scale-free nature of real-world graphs, graph-processing systems face the challenge of providing an appropriate graph data structure that enables both fast analytic workloads and low-memory graph mutations. Existing graph structures offer a hard trade-off between read-only performance, update friendliness, and memory consumption upon updates. In this paper, we introduce CSR++, a new graph data structure that removes these trade-offs and enables both fast read-only analytics and quick and memory-friendly mutations. CSR++ combines ideas from CSR, the fastest read-only data structure, and adjacency lists to achieve the best of both worlds. We compare CSR++ to CSR, adjacency lists from the Boost Graph Library, and LLAMA, a state-of-the-art update-friendly graph structure. In our evaluation, which is based on popular graph-processing algorithms executed over real-world graphs, we show that CSR++ remains close to CSR in read-only concurrent performance (within 10% on average), while significantly outperforming CSR (by an order of magnitude) and LLAMA (by almost 2x) with frequent updates.
A Latina in Tech
Having started my Computer Science degree while growing up in Colombia and later completing it in Australia, I went from being an overrepresented Latina to being an underrepresented one. Further, the female to male ratio in CS in both countries was also rather different.
Being a mum, a wife, a teacher, a researcher, a manager and a leader, in this talk, I provide some of my lessons learnt throughout my career, with examples of successes and failures throughout my PhD, academic life, and industrial research life.
The University of Queensland and Oracle team up to develop world-class cyber security experts
The field of cyber security is coming of age, with more than a million job openings globally, including many in Australia, and a strong move from reactive to preventative security taking form. At The University of Queensland, teaming up with industry specialists like Oracle Labs – the research and development branch of global technology firm Oracle – will ensure both industry and researchers can focus on the real issues that businesses and users care about.
Coding Practices and Recommendations of Spring Security for Enterprise Applications
Spring security is tremendously popular among practitioners for its ease of use to secure enterprise applications. In this paper, we study the application framework misconfiguration vulnerabilities in the light of Spring security, which is relatively understudied in the existing literature. Towards that goal, we identify 6 types of security anti-patterns and 4 insecure vulnerable defaults by conducting a measurement-based approach on 28 Spring applications. Our analysis shows that security risks associated with the identified security anti-patterns and insecure defaults can leave the enterprise application vulnerable to a wide range of high-risk attacks. To prevent these high-risk attacks, we also provide recommendations for practitioners. Consequently, our study has contributed one update to the official Spring security documentation while other security issues identified in this study are being considered for future major releases by Spring security community.
Private Federated Learning with Domain Adaptation
In a federated learning (FL) system, users can collaborate to build a shared model without explicitly sharing data, but model accuracy degrades if differential privacy guarantees are required during training. We hypothesize that domain adaptation techniques can effectively address this problem while increasing per-user prediction accuracy, especially when user data comes from disparate distributions. We present and analyze a mixture of experts (MoE) based domain adaptation approach that allows effective collaboration between users in a differentially private FL setting. Each user contributes to (and benefits from) a general, shared model to perform a common task, while maintaining a private model to adjust their predictions to their particular domain. Using both synthetic and real-world datasets, we empirically demonstrate that these private models can increase accuracy, while protecting against the release of users’ private data.
Example-based Live Programming for Everyone: Building Language-agnostic Tools for Live Programming With LSP and GraalVM
Our community has explored various approaches to improve the programming experience. Although many of them, such as Example-Based Live Programming (ELP), have shown to be effective, they are still not widespread in conventional programming environments. A reason for that is the effort required to provide sophisticated tools that rely on run-time information. To target multiple language ecosystems, it is often necessary to implement the same concepts, but for different languages and runtimes. Two emerging technologies present an opportunity to reduce this effort significantly: the Language Server Protocol (LSP) and language implementation frameworks such as GraalVM's Truffle. In this paper, we show how an ELP system can be built in a language-agnostic way by leveraging these two technologies. Based on our approach, we implemented the Babylonian Programming system, an ELP system that has previously only been implemented for exploratory ecosystems. Our system, on the other hand, brings ELP for all languages supported by the GraalVM to Visual Studio Code (VS Code). Moreover, we outline what a language-agnostic infrastructure needs to provide and how the LSP could be extended to support ELP also independently from programming environments. Further, we demonstrate how our approach enables the use of ELP in the context of polyglot programming. We illustrate the consequences of our approach by discussing its advantages and limitations and by comparing the features of our system to other ELP systems. Moreover, we give an outlook of how tools that rely on run-time information could be built in the future. This in turn might motivate future tool builders and researchers to consider implementing more tools in a language-agnostic way from the start to make them available to a broader audience.
Women in CS panel
While women were among the first programmers in the 20th century, and contributed substantially to the industry, over the years both the CS industry and CS academia got dominated by men. In this social hour, we explore the opportunities and challenges women encounter in Computer Science through a panel discussion. Our panelists are women who have leading roles in industry, academia, and industrial research. By sharing stories via Q&A, we look forward to inspiring younger women to fulfill their highest potentials, understand how women can make it to senior positions, and enjoy their career.
Efficient Multi-word Compare and Swap.
Atomic lock-free multi-word compare-and-swap (MCAS) is a powerful tool for designing concurrent algorithms. Yet, its widespread usage has been limited because lock-free implementations of MCAS make heavy use of expensive compare-and-swap (CAS) instructions. Existing MCAS implementations indeed use at least 2k+1 CASes per k-CAS. This leads to the natural desire to minimize the number of CASes required to implement MCAS. We first prove in this paper that it is impossible to "pack" the information required to perform a k-word CAS (k-CAS) in less than k locations to be CASed. Then we present the first algorithm that requires k+1 CASes per call to k-CAS in the common uncontended case. We implement our algorithm and show that it outperforms a state-of-the-art baseline in a variety of benchmarks in most considered workloads. We also present a durably linearizable (persistent memory friendly) version of our MCAS algorithm using only 2 persistence fences per call, while still only requiring k+1 CASes per k-CAS.
The NEBULA RPC-Optimized Architecture.
Large-scale online services are commonly structured as a network of software tiers, which communicate over the datacenter network using RPCs. Ongoing trends towards software decomposition have led to the prevalence of tiers receiving and generating RPCs with runtimes of only a few microseconds. With such small software runtimes, even the smallest latency overheads in RPC handling have a significant relative performance impact.In particular, we find that growing network bandwidth introduces queuing effects within a server’s memory hierarchy, considerably hurting the response latency of fine-grained RPCs. In this work we introduce NEBULA, an architecture optimized to accelerate the most challenging microsecond-scale RPCs, by leveraging two novel mechanisms to drastically improve server throughput under strict tail latency goals...
UnQuantize: Overcoming Signal Quantization Effects in IoT Time-Series Databases
Low-resolution quantized time-series signals present a challenge to big-data Machine Learning (ML) prognostics in IoT industrial and transportation applications. The challenge for detecting anomalies in monitored sensor signals is compounded by the fact that many industries today use 8-bit sample-and-hold analog-to-digital (A/D) converters for almost all physical transducers throughout the system. This results in the signal values being severely quantized, which adversely affects the predictive power of prognostic algorithms and can elevate empirical false-alarm and missed-alarm probabilities. Quantized signals are dense and indecipherable to the human eye and ML algorithms are challenged to detect the onset of degradation in monitored assets due to the loss of information in the digitization process. This paper presents an autonomous ML framework that detects and classifies quantized signals before instantiates two separate techniques (depending on the levels of quantization) to efficiently unquantize digitized signals, returning high-resolution signals possessing the same accuracy as signals sampled with higher bit A/D chips. This new “UnQuantize” framework works in line with streaming sensor signals, upstream from the core ML anomaly detection algorithm, yielding substantially higher anomaly-detection sensitivity, with much lower false-alarm and missed-alarm probabilities (FAPs/MAPs).
Industrial Experience of Finding Cryptographic Vulnerabilities in Large-scale Codebases
Enterprise environments need to screen large-scale (millions of lines of code) codebases for vulnerability detection, resulting in high requirements for precision and scalability of a static analysis tool. At Oracle, Parfait is one such bug checker, providing precision and scalability of results, including inter-procedural analyses. CryptoGuard is a precise static analyzer for detecting cryptographic vulnerabilities in Java code built on Soot. In this paper, we describe how to integrate CryptoGuard into Parfait, with changing intermediate representation and relying on a demand-driven IFDS framework in Parfait, resulting in a precise and scalable tool for cryptographic vulnerabilities detection. We evaluate our tool on several large real-world applications and a comprehensive Java cryptographic vulnerability benchmark, CryptoAPI-Bench. Initial results show that the new cryptographic vulnerability detection in Parfait can detect real-world cryptographic vulnerabilities in large-scale codebases with few false positives and low runtime.
Scalable, Near-Zero Loss Disaster Recovery for Distributed Data Stores.
This paper presents a new Disaster Recovery (DR) system, called Slogger, that differs from prior works in two principle ways: (i) Slogger enables DR for a linearizable distributed data store, and (ii) Slogger adopts the continuous backup approach that strives to maintain a tiny lag on the backup site relative to the primary site,thereby restricting the data loss window, due to disasters, to milliseconds.
Scalable, Near-Zero Loss Disaster Recovery for Distributed Data Stores
This paper presents a new Disaster Recovery (DR) system, called Slogger, that differs from prior works in two principle ways: (i) Slogger enables DR for a linearizable distributed data store, and (ii) Slogger adopts the continuous backup approach that strives to maintain a tiny lag on the backup site relative to the primary site, thereby restricting the data loss window, due to disasters, to mil- liseconds. These goals pose a significant set of challenges related to consistency of the backup site’s state, failures, and scalability. Slogger employs a combination of asynchronous log replication, intra-data center synchronized clocks, pipelining, batching, and a novel watermark service to address these challenges. Furthermore, Slogger is designed to be deployable as an “add-on” module in an existing distributed data store with few modifications to the origi- nal code base. Our evaluation, conducted on Slogger extensions to a 32-sharded version of LogCabin, an open source key-value store, shows that Slogger maintains a very small data loss window of 14.2 milliseconds which is near the optimal value in our evalua- tion setup. Moreover, Slogger reduces the length of the data loss window by 50% compared to incremental snapshotting technique without having any performance penalty on the primary data store. Furthermore, our experiments demonstrate that Slogger achieves our other goals of scalability, fault tolerance, and efficient failover to the backup data store when a disaster is declared at the primary data store.
Leveraging Extracted Model Adversaries for Improved Black Box Attacks
We present a method for adversarial input generation against black box models for reading comprehension based question answering. Our approach is composed of two steps. First, we approximate a victim black box model via model extraction. Second, we use our own white box method to generate input perturbations that cause the approximate model to fail. These perturbed inputs are used against the victim. In experiments we find that our method improves on the efficacy of the AddAny---a white box attack---performed on the approximate model by 25% F1, and the AddSent attack---a black box attack---by 11% F1.
Simplifying GPU Access: A Polyglot Binding for GPUs with GraalVM
GPU computing accelerates workloads and fuels breakthroughs across industries. There are many GPU-accelerated libraries developers can leverage, but integrating these libraries into existing software stacks can be challenging. Programming GPUs typically requires low-level programming, while high-level scripting languages have become very popular. Accelerated computing solutions are heterogeneous and inherently more complex. We'll present an open-source prototype called grCUDA that leverages Oracle’s GraalVM and exposes GPUs in polyglot environments. While GraalVM can be regarded as the "one VM to rule them all," grCUDA is the "one GPU binding to rule them all." Data is efficiently shared between GPUs and GraalVM languages (R, Python, JavaScript) while GPU kernels can be launched directly from those languages. Precompiled GPU kernels can be used, as well as kernels that are generated at runtime. We'll also show how to access GPU-accelerated libraries such as RAPIDS cuML.
Polyglot Code Finder
With the increasing complexity of software, it becomes even more important to build on the work of others. At the same time, websites, such as Stack Overflow or GitHub, are used by millions of developers to host their code, which could potentially be reused.
The process of finding the right code, however, is often time-consuming. In addition, the right solution may be written in a programming language that does not fit the developer's requirements. Current approaches to automate code search allow users to search for code based on keywords and transformation rules, but they are limited to one programming language.
Our approach enables developers to find code for reuse written in different languages, which is especially useful when building polyglot applications. In addition to conventional search filters, users can filter code by providing example input and expected output. Based on our approach, we have implemented a tool prototype in GraalSqueak. We evaluate both approach and prototype with an experience report.
Toward Presizing and Pretransitioning Strategies for GraalPython
Presizing and pretransitioning are run-time optimizations that reduce reallocations of lists. These two optimizations have previously been implemented (together with pretenuring) using Mementos in the V8 Javascript engine. The design of Mementos, however, relies on the support of the garbage collector (GC) of the V8 runtime system.
In contrast to V8, dynamic language runtimes written for the GraalVM do not have access to the GC. Thus, the prior work cannot be applied directly. Instead, an alternative implementation approach without reliance on the GC is needed and poses different challenges.
In this paper we explore and analyze an approach for implementing these two optimizations in the context of GraalVM, using the Python implementation for GraalVM as an example. We substantiate these thoughts with rough performance numbers taken from our prototype on which we tested different presizing strategies.
User-defined Interface Mappings for the GraalVM
To improve programming productivity, the right tools are crucial. This starts with the choice of the programming language, which often predetermines the libraries and frameworks one can use. Polyglot runtime environments, such as GraalVM, provide mechanisms for exchanging objects and sending messages across language boundaries, which allow developers to combine different languages, libraries, and frameworks with each other. However, polyglot application developers are obligated to properly use the right interfaces for accessing their data and objects from different languages.
To reduce the mental complexity for developers and let them focus on the business logic, we introduce user-defined interface mappings - an approach for adapting cross-language messages at run-time to match an expected interface. Thereby, the translation strategies are defined in an exchangeable and easy-to-edit configuration file. Thus, different stakeholders ranging from library and framework developers up to application developers can use and extend these mappings for their needs.
Microsecond Consensus for Microsecond Applications.
We consider the problem of making apps fault-tolerant through replication, when apps operate at the microsecond scale, as in finance, embedded computing, and microservices apps. These apps need a replication scheme that also operates at the microsecond scale, otherwise replication becomes a burden. We propose Mu, a system that takes less than 1.3 microseconds to replicate a (small) request in memory, and less than a millisecond to fail-over the system - this cuts the replication and fail-over latencies of the prior systems by at least 61% and 90%.
Mu implements bona fide state machine replication/consensus (SMR) with strong consistency for a generic app, but it really shines on microsecond apps, where even the smallest overhead is significant. To provide this performance, Mu introduces a new SMR protocol that carefully leverages RDMA. Roughly, in Mu a leader replicates a request by simply writing it directly to the log of other replicas using RDMA, without any additional communication. Doing so, however, introduces the challenge of handling concurrent leaders, changing leaders, garbage collecting the logs, and more - challenges that we address in this paper through a judicious combination of RDMA permissions and distributed algorithmic design.
We implemented Mu and used it to replicate several systems: a financial exchange app called Liquibook, Redis, Memcached, and HERD. Our evaluation shows that Mu incurs a small replication latency, in some cases being the only viable replication system that incurs an acceptable overhead.
Non-blocking interpolation search trees with doubly-logarithmic running time
Balanced search trees typically use key comparisons to guide their operations, and achieve logarithmic running time. By relying on numerical properties of the keys, interpolation search achieves lower search complexity and better performance. Although interpolation-based data structures were investigated in the past, their non-blocking concurrent variants have received very little attention so far.
In this paper, we propose the first non-blocking implementation of the classic interpolation search tree (IST) data structure. For arbitrary key distributions, the data structure ensures worst-case O(log n + p) amortized time for search, insertion and deletion traversals. When the input key distributions are smooth, lookups run in expected O(log log n + p) time, and insertion and deletion run in expected amortized O(log log n + p) time, where p is a bound on the number of threads. To improve the scalability of concurrent insertion and deletion, we propose a novel parallel rebuilding technique, which should be of independent interest.
We evaluate whether the theoretical improvements translate to practice by implementing the concurrent interpolation search tree, and benchmarking it on uniform and nonuniform key distributions, for dataset sizes in the millions to billions of keys. Relative to the state-of-the-art concurrent data structures, the concurrent interpolation search tree achieves performance improvements of up to 15% under high update rates, and of up to 50% under moderate update rates. Further, ISTs exhibit up to 2X less cache-misses, and consume 1.2 -- 2.6X less memory compared to the next best alternative on typical dataset sizes. We find that the results are surprisingly robust to distributional skew, which suggests that our data structure can be a promising alternative to classic concurrent search structures.
GraalVM Native Image Deep Dive - part I
This is the first part of the meetup, covering mostly the GraalVM ecosystem, intro to native images, frameworks support, how to get started, etc. David Leopoldseder will cover the way native images are built and compare JIT/AOT.
What is a Secure Programming Language? (POPL slides)
Our most sensitive and important software systems are written in programming languages that are inherently insecure, making the security of the systems themselves extremely challenging. It is often said that these systems were written with the best tools available at the time, so over time with newer languages will come more security. But we contend that all of today’s mainstream programming languages are insecure, including even the most recent ones that come with claims that they are designed to be “secure”. Our real criticism is the lack of a common understanding of what “secure” might mean in the context of programming language design. We propose a simple data-driven definition for a secure programming language: that it provides first-class language support to address the causes for the most common, significant vulnerabilities found in real-world software. To discover what these vulnerabilities actually are, we have analysed the National Vulnerability Database and devised a novel categorisation of the software defects reported in the database. This leads us to propose three broad categories, which account for over 50% of all reported software vulnerabilities, that as a minimum any secure language should address. While most mainstream languages address at least one of these categories, interestingly, we find that none address all three. Looking at today’s real-world software systems, we observe a paradigm shift in design and implementation towards service-oriented architectures, such as microservices. Such systems consist of many fine-grained processes, typically implemented in multiple languages, that communicate over the network using simple web-based protocols, often relying on multiple software environments such as databases. In traditional software systems, these features are the most common locations for security vulnerabilities, and so are often kept internal to the system. In microservice systems, these features are no longer internal but external, and now represent the attack surface of the software system as a whole. The need for secure programming languages is probably greater now than it has ever been.
PGX and Graal/Truffle/Active Libraries
A guest lecture in the CS4200 Compiler Construction course at Delft University of Technology (https://tudelft-cs4200-2019.github.io/) about PGX and Graal/Truffle/Active Libraries.
Computationally Easy, Spectrally Good Multipliers for Congruential Pseudorandom Number Generators
Congruential pseudorandom number generators rely on good multipliers, that is, integers that have good performance with respect to the spectral test. We provide lists of multipliers with a good lattice structure up to dimension eight for generators with typical power-of-two moduli, analyzing in detail multipliers close to the square root of the modulus, whose product can be computed quickly.
Scalable Pointer Analysis of Data Structures using Semantic Models
Pointer analysis is widely used as a base for different kinds of static analyses and compiler optimizations. Designing a scalable pointer analysis with acceptable precision for use in production compilers is still an open question. Modern object oriented languages like Java and Scala promote abstractions and code reuse, both of which make it difficult to achieve precision. Collection data structures are an example of a pervasively used component in such languages. But analyzing collection implementations with full context sensitivity leads to prohibitively long analysis times. We use semantic models to reduce the complex internal implementation of, e.g., a collection to a small and concise model. Analyzing the model with context sensitivity leads to precise results with only a modest increase in analysis time. The models must be written manually, which is feasible because a model method usually consists of only a few statements. Our implementation in GraalVM Native Image shows a rise in useful precision (1.35X rise in the number of checkcast statements that can be elided over the default analysis configuration) with a manageable performance cost (19\% rise in analysis time).
Microsecond Consensus for Microsecond Applications.
We consider the problem of making apps fault-tolerant through replication, when apps operate at the microsecond scale, as in finance, embedded computing, and microservices apps. These apps need a replication scheme that also operates at the microsecond scale, otherwise replication becomes a burden. We propose Mu, a system that takes less than 1.3 microseconds to replicate a (small) request in memory, and less than a millisecond to fail-over the system - this cuts the replication and fail-over latencies of the prior systems by at least 61% and 90%.
Mu implements bona fide state machine replication/consensus (SMR) with strong consistency for a generic app, but it really shines on microsecond apps, where even the smallest overhead is significant. To provide this performance, Mu introduces a new SMR protocol that carefully leverages RDMA. Roughly, in Mu a leader replicates a request by simply writing it directly to the log of other replicas using RDMA, without any additional communication. Doing so, however, introduces the challenge of handling concurrent leaders, changing leaders, garbage collecting the logs, and more - challenges that we address in this paper through a judicious combination of RDMA permissions and distributed algorithmic design.
We implemented Mu and used it to replicate several systems: a financial exchange app called Liquibook, Redis, Memcached, and HERD. Our evaluation shows that Mu incurs a small replication latency, in some cases being the only viable replication system that incurs an acceptable overhead.
Efficient Multi-Word Compare and Swap.
Atomic lock-free multi-word compare-and-swap (MCAS) is a powerful tool for designing concurrent algorithms. Yet, its widespread usage has been limited because lock-free implementations of MCAS make heavy use of expensive compare-and-swap (CAS) instructions. Existing MCAS implementations indeed use at least 2k+1 CASes per k-CAS. This leads to the natural desire to minimize the number of CASes required to implement MCAS. We first prove in this paper that it is impossible to "pack" the information required to perform a k-word CAS (k-CAS) in less than k locations to be CASed. Then we present the first algorithm that requires k+1 CASes per call to k-CAS in the common uncontended case. We implement our algorithm and show that it outperforms a state-of-the-art baseline in a variety of benchmarks in most considered workloads. We also present a durably linearizable (persistent memory friendly) version of our MCAS algorithm using only 2 persistence fences per call, while still only requiring k+1 CASes per k-CAS.
AI Decision Support Prognostics for IoT Asset Health Monitoring, Failure Prediction, Time to Failure
This paper presents a novel tandem human-machine cognition approach for human-in-the-loop control of complex business-critical and mission-critical systems and processes that are monitored by Internet-of-Things (IoT) sensor networks and where it is of utmost importance to mitigate and avoid cognitive overload situations for the human operators. We present an advanced pattern recognition system, called the Multivariate State Estimation Technique-2, which possesses functional requirements designed to minimize the possibility of cognitive overload for human operators. These functional requirements include: (1) ultralow false alarm probabilities for all monitored transducers, components, machines, subsystems, and processes; (2) fastest mathematically possible decisions regarding the incipience or onset of anomalies in noisy process metrics; and (3) the ability to unambiguously differentiate between sensor degradation events and degradation in the systems/processes under surveillance. The prognostic machine learning innovation presented herein does not replace the role of the human in operation of complex engineering systems, but augments that role in a manner that minimizes cognitive overload by very rapidly processing, interpreting, and displaying final diagnostic and prognostic information to the human operator in a prioritized format that is readily perceived and comprehended.
ContainerStress: Autonomous Cloud-Node Scoping Framework for Big-Data ML Use Cases
Deploying big-data Machine Learning (ML) services in a cloud environment presents a challenge to the cloud vendor with respect to the cloud container configuration sizing for any given customer use cases. OracleLabs has developed an automated framework that uses nested-loop Monte Carlo simulation to autonomously scale any size customer ML use cases across the range of cloud CPU-GPU “Shapes” (configurations of CPUs and/or GPUs in Cloud containers available to end customers). Moreover, the OracleLabs and NVidia authors have collaborated on a ML benchmark study which analyzes the compute cost and GPU acceleration of any ML prognostic algorithm and assesses the reduction of compute cost in a cloud container comprising conventional CPUs and NVidia GPUs.
A DSL-based framework for performance assessment
Performance assessment is an essential verification practice in both research and industry for software quality assurance. Experiment setups for performance assessment tend to be complex. A typical experiment needs to be run for a variety of involved hardware, software versions, system settings and input parameters. Typical approaches for performance assessment are based on scripts. They do not document all variants explicitly, which makes it hard to analyze and reproduce experiment results correctly. In general they tend to be monolithic which makes it hard to extend experiment setups systematically and to reuse features such as result storage and analysis consistently across experi- ments. In this paper, we present a generic approach and a DSL-based framework for performance assessment. The DSL helps the user to set and organize the variants in an experiment setup explicitly. The Runtime module in our framework executes experiments after which results are stored together with the corresponding setups in a database. Database queries provide easy access to the results of previous experiments and the correct analysis of experiment results in context of the experiment setup. Furthermore, we describe operations for common problems in performance assessment such as outlier detection. At Oracle, we successfully instantiate the framework and use it to nightly assess the performance of PGX [12, 6], a toolkit for parallel graph analytics.
GraalVM Performance Talk
Second talk for Devoxx Ukraine with focus on performance/native images; The first one will be a basic introduction. Heavily relying on the Code One talk by Thomas.
GraalSqueak: Toward a Smalltalk-based Tooling Platform for Polyglot Programming
Polyglot programming provides software developers with a broader choice in terms of software libraries and frameworks available for building applications. Previous research and engineering activities have focused on language interoperability and the design and implementation of fast polyglot runtimes. To make polyglot programming more approachable for developers, novel software development tools are needed that help them build polyglot applications. We believe a suitable prototyping platform helps to more quickly evaluate new ideas for such tools. In this paper we present GraalSqueak, a Squeak/Smalltalk virtual machine implementation for the GraalVM. We report our experience implementing GraalSqueak, evaluate the performance of the language and the programming environment, and discuss how the system can be used as a tooling platform for polyglot programming.
Design Space Exploration of Power Delivery For Advanced Packaging Technologies
***Note: this is work the VLSI Research group did with Prof. Bakir back in 2017. The student is now graduating, and wants to finalize/publish this work.*** In this paper, a design space exploration of power delivery networks is performed for multi-chip 2.5-D and 3-D IC technologies. The focus of the paper is the effective placement of the voltage regulator modules (VRMs) for power supply noise (PSN) suppression. Multiple on-package VRM configurations have been analyzed and compared. Additionally, 3D IC chipon-VRM and backside-of-the-package VRM configurations are studied. From the PSN perspective, the 3D IC chip-on-VRM case suppresses the PSN the most even with high current density hotspots. The paper also studies the impact of different parameters such as VRM-chip distance on the package, on-chip decoupling capacitor density, etc. on the PSN.
GraalVM Slides for JAX London
These are intro-level slides to be presented at https://jaxlondon.com
Computers and Hacking: A 50-Year View
[Slides for a 20-minute keynote talk for the MIT HackMIT hackathon weekend, Saturday, September 14, 2019.] Fifty years ago, computers were expensive, institutional rather than personal, and hard to get access to. Today computers are relatively inexpensive, personal well as institutional, and ubiquitous. Some of the best hacking—and engineering—today involves creative use of relatively limited (and therefore cost-effective) computer resources.
Correct, Fast Remote Persistence.
Persistence of updates to remote byte-addressable persistent memory (PM), using RDMA operations (RDMA updates), is a poorly understood subject. Visibility of RDMA updates on the remote server is not the same as persistence of those updates. The remote server's configuration has significant implications on what it means for RDMA updates to be persistent on the remote server. This leads to significant implications on methods needed to correctly persist those updates. This paper presents a comprehensive taxonomy of system configurations and the corresponding methods to ensure correct remote persistence of RDMA updates. We show that the methods for correct, fast remote persistence vary dramatically, with corresponding performance trade offs, between different remote server configurations.
The Impact of RDMA on Agreement.
Remote Direct Memory Access (RDMA) is becoming widely available in data centers. This technology allows a process to directly read and write the memory of a remote host, with a mechanism to control access permissions. In this paper, we study the fundamental power of these capabilities. We consider the well-known problem of achieving consensus despite failures, and find that RDMA can improve the inherent trade-off in distributed computing between failure resilience and performance. Specifically, we show that RDMA allows algorithms that simultaneously achieve high resilience and high performance, while traditional algorithms had to choose one or another. With Byzantine failures, we give an algorithm that only requires n \geq 2f_P + 1 processes (where f_P is the maximum number of faulty processes) and decides in two (network) delays in common executions. With crash failures, we give an algorithm that only requires n \geq f_P + 1 processes and also decides in two delays. Both algorithms tolerate a minority of memory failures inherent to RDMA, and they provide safety in asynchronous systems and liveness with standard additional assumptions.
Vandal: A scalable security analysis framework for smart contracts
The rise of modern blockchains has facilitated the emergence of smart contracts: autonomous programs that live and run on the blockchain. Smart contracts have seen a rapid climb to prominence, with applications predicted in law, business, commerce, and governance. Smart contracts are commonly written in a high-level language such as Ethereum's Solidity, and translated to compact low-level bytecode for deployment on the blockchain. Once deployed, the bytecode is autonomously executed, usually by a% Turing-complete virtual machine. As with all programs, smart contracts can be highly vulnerable to malicious attacks due to deficient programming methodologies, languages, and toolchains, including buggy compilers. At the same time, smart contracts are also high-value targets, often commanding large amounts of cryptocurrency. Hence, developers and auditors need security frameworks capable of analysing low-level bytecode to detect potential security vulnerabilities.
Renaissance: Benchmarking Suite for Parallel Applications on the JVM
Established benchmark suites for the Java Virtual Machine (JVM), such as DaCapo, ScalaBench, and SPECjvm2008, lack workloads that take advantage of the parallel programming abstractions and concurrency primitives offered by the JVM and the Java Class Library. However, such workloads are fundamental for understanding the way in which modern applications and data-processing frameworks use the JVM's concurrency features, and for validating new just-in-time (JIT) compiler optimizations that enable more efficient execution of such workloads. We present Renaissance, a new benchmark suite composed of modern, real-world, concurrent, and object-oriented workloads that exercise various concurrency primitives of the JVM. We show that the use of concurrency primitives in these workloads reveals optimization opportunities that were not visible with the existing workloads. We use Renaissance to compare performance of two state-of-the-art, production-quality JIT compilers (HotSpot C2 and Graal), and show that the performance differences are more significant than on existing suites such as DaCapo and SPECjvm2008. We also use Renaissance to expose four new compiler optimizations, and we analyze the behavior of several existing ones. We use Renaissance to compare performance of two state-of-the-art, production-quality JIT compilers (HotSpot C2 and Graal), and show that the performance differences are more significant than on existing suites such as DaCapo and SPECjvm2008. We also use Renaissance to expose four new compiler optimizations, and we analyze the behavior of several existing ones.
Commit-time incremental analysis
Most changes to large systems that have been deployed are quite small compared to the size of the entire system. While standard summary-based analyses reduce the code that is reanalysed, they, nevertheless, analyse code that is not changed. For example, a backward summary-based analysis, will examine all the callers of the changed code even if the callers themselves have not changed. In this paper we present a novel approach of having summaries of the callers (called forward summaries) that enables one to analyse only the changed code. An evaluation of this approach on two representative examples, demonstrates that the overheads associated with the generation of the forward summaries is recovered by performing just one or two incremental analyses. Thus this technique can be used at commit-time where only the changed code is available.
Commit-time incremental analysis
Most changes to large systems that have been deployed are quite small compared to the size of the entire system. While standard summary-based analyses reduce the code that is reanalysed, they, nevertheless, analyse code that is not changed. For example, a backward summary-based analysis, will examine all the callers of the changed code even if the callers themselves have not changed. In this paper we present a novel approach of having summaries of the callers (called forward summaries) that enables one to analyse only the changed code. An evaluation of this approach on two representative examples, demonstrates that the overheads associated with the generation of the forward summaries is recovered by performing just one or two incremental analyses. Thus this technique can be used at commit-time where only the changed code is available.
What is a Secure Programming Language? (lecture + tutorial)
Lecture and tutorial using GraalVM and Simple Language.
What is a Secure Programming Language?
Our most sensitive and important software systems are written in programming languages that are inherently insecure, making the security of the systems themselves extremely challenging. It is often said that these systems were written with the best tools available at the time, so over time with newer languages will come more security. But we contend that all of today's mainstream programming languages are insecure, including even the most recent ones that come with claims that they are designed to be "secure". Our real criticism is the lack of a common understanding of what "secure" might mean in the context of programming language design. We propose a simple data-driven definition for a secure programming language: that it provides first-class language support to address the causes for the most common, significant vulnerabilities found in real-world software. To discover what these vulnerabilities actually are, we have analysed the National Vulnerability Database and devised a novel categorisation of the software defects reported in the database. This leads us to propose three broad categories, which account for over 50% of all reported software vulnerabilities, that as a minimum any secure language should address. While most mainstream languages address at least one of these categories, interestingly, we find that none address all three. Looking at today's real-world software systems, we observe a paradigm shift in design and implementation towards service-oriented architectures, such as microservices. Such systems consist of many fine-grained processes, typically implemented in multiple languages, that communicate over the network using simple web-based protocols, often relying on multiple software environments such as databases. In traditional software systems, these features are the most common locations for security vulnerabilities, and so are often kept internal to the system. In microservice systems, these features are no longer internal but external, and now represent the attack surface of the software system as a whole. The need for secure programming languages is probably greater now than it has ever been.
What is a Secure Programming Language?
Our most sensitive and important software systems are written in programming languages that are inherently insecure, making the security of the systems themselves extremely challenging. It is often said that these systems were written with the best tools available at the time, so over time with newer languages will come more security. But we contend that all of today's mainstream programming languages are insecure, including even the most recent ones that come with claims that they are designed to be ``secure''. Our real criticism is the lack of a common understanding of what ``secure'' might mean in the context of programming language design. We propose a simple data-driven definition for a secure programming language: that it provides first-class language support to address the causes for the most common, significant vulnerabilities found in real-world software. To discover what these vulnerabilities actually are, we have analysed the National Vulnerability Database and devised a novel categorisation of the software defects reported in the database. This leads us to propose three broad categories, which account for over 50\% of all reported software vulnerabilities, that \emph{as a minimum} any secure language should address. While most mainstream languages address at least one of these categories, interestingly, we find that none address all three. Looking at today's real-world software systems, we observe a paradigm shift in design and implementation towards service-oriented architectures, such as microservices. Such systems consist of many fine-grained processes, typically implemented in multiple languages, that communicate over the network using simple web-based protocols, often relying on multiple software environments such as databases. In traditional software systems, these features are the most common locations for security vulnerabilities, and so are often kept internal to the system. In microservice systems, these features are no longer internal but external, and now represent the attack surface of the software system as a whole. The need for secure programming languages is probably greater now than it has ever been.
Non-Volatile Memory and Java: Part 3
A series of short articles about the impact of non-volatile memory (NVM) on the Java platform. In the first two articles I described the main hardware and software characteristics of Intel’s new Optane persistent memory. In this article I will discuss the implications of these characteristics on how we build software.
Non-volatile memory and Java, part 2: the view from software
In the first article I described the main hardware characteristics of Intel’s new Optane persistent memory. In this article I will discuss several software issues.
Non-volatile memory and Java: 1. Introducing NVM
Non-volatile RAM (NVRAM) has arrived into the computing mainstream. This development is likely to be highly disruptive: it will change the economics of the memory hierarchy by providing a new, intermediate level between DRAM and flash, but fully exploiting the new technology will require widespread changes in how we architect and write software. Despite this, there is surprisingly little awareness on the part of programmers (and their management) of the technology and its likely impact, and relatively little activity in academia (compared to the magnitude of the paradigm shift) in developing techniques and tools which programmers will need to respond to the change. In this series I will discuss the possible impact of NVRAM on the Java ecosystem. Java is the most widely used programming language: there are millions of Java developers and billions of lines of Java code in daily use.
PolyJuS: A Squeak/Smalltalk-based Polyglot Notebook System for the GraalVM
Jupyter notebooks are used by data scientists to publish their research in an executable format. These notebooks are usually limited to a single programming language. Current polyglot notebooks extend this concept by allowing multiple languages per notebook, but this comes at the cost of having to externalize and to import data across languages. Our approach for polyglot notebooks is able to provide a more direct programming experience by executing notebooks on top of a polyglot execution environment, allowing each code cell to directly access foreign data structures and to call foreign functions and methods. We implemented this approach using GraalSqueak, a Squeak/Smalltalk implementation for the GraalVM. To prototype the programming experience and experiment with further polyglot tool support, we build a Squeak/Smalltalk-based notebook UI that is compatible with the Jupyter notebook file format. We evaluate PolyJuS by demonstrating an example polyglot notebook and discuss advantages and limitations of our approach.
An Optimization-Driven Incremental Inline Substitution Algorithm for Just-in-Time Compilers
Inlining is one of the most important compiler optimizations. It reduces call overheads and widens the scope of other optimizations. But, inlining is somewhat of a black art of an optimizing compiler, and was characterized as a computationally intractable problem. Intricate heuristics, tuned during countless hours of compiler engineering, are often at the core of an inliner implementation. And despite decades of research, well established inlining heuristics are still missing. In this paper, we describe a novel inlining algorithm for JIT compilers that incrementally explores a program's call graph, and alternates between inlining and optimizations. We devise three novel heuristics that guide our inliner: adaptive decision thresholds, callsite clustering, and deep inlining trials. We implement the algorithm inside Graal, a dynamic JIT compiler for the HotSpot JVM. We evaluate our algorithm on a set of industry-standard benchmarks, including Java DaCapo, Scalabench, Spark-Perf, STMBench7 and other benchmarks, and we conclude that it significantly improves performance, surpassing state-of-the-art inlining approaches with speedups ranging from 5% up to 3×.
Private Federated Learning with Domain Adaptation.
Federated Learning (FL) is a distributed machine learning (ML) paradigm that enables multiple parties to jointly re-train a shared model without sharing their data with any other parties, offering advantages in both scale and privacy. We propose a framework to augment this collaborative model-building with per-user domain adaptation. We show that this technique improves model accuracy for all users, using both real and synthetic data, and that this improvement is much more pronounced when differential privacy bounds are imposed on the FL model.
Telemetry Parameter Synthesis System for Enhanced Tuning and Validation of Machine Learning Algorithmics
Advanced machine learning (ML) prognostics are leading to increasing Return-on-Investment (ROI) for dense-sensor Internet-of-Things (IoT) applications across multiple industries including Utilities, Oil-and-Gas, Manufacturing, Transportation, and for business-critical assets in enterprise and cloud data centers. For all of these IoT prognostic applications, a nontrivial challenge for data scientists is acquiring enough time series data from executing assets with which to evaluate, tune, optimize, and validate important prognostic functional requirements that include false-alarm and missed-alarm probabilities (FAPs, MAPs), time-to-detect (TTD) metrics for early-warning of incipient issues in monitored components and systems, and overhead compute cost (CC) for real-time stream ML prognostics. In this paper we present a new data synthesis methodology called the Telemetry Parameter Synthesis System (TPSS) that can take any limited chunk of real sensor telemetry from monitored assets, decompose the sensor signals into deterministic and stochastic components, and then generate millions of hours of high-fidelity synthesized telemetry signals that possess exactly the same serial correlation structure and statistical idiosyncrasies (resolution, variance, skewness, kurtosis, auto-correlation content, and spikiness) as the real telemetry signals from the IoT monitored critical assets. The synthesized signals bring significant value-add for ML data science researchers for evaluation and tuning of candidate ML algorithmics and for offline validation of important prognostic functional requirements including sensitivity, false alarm avoidance, and overhead compute cost. The TPSS has become an indispensable tool in Oracle’s ongoing development of innovative diagnostic/prognostic algorithms for dense-sensor predictive maintenance applications in multiple industries.
Real Time Empirical Synchronization of IoT Signals for Improved AI Prognostics
A significant challenge for Machine Learning (ML) prognostic analyses of large-scale time series databases is variable clock skew between/among multiple data acquisition (DAQ) systems across assets in a fleet of monitored assets, and even inside individual assets, where the sheer numbers of sensors being deployed are so large that multiple individual DAQs, each with their own internal clocks, can create significant clock-mismatch issues. For Big Data prognostic anomaly detection, we have discovered and amply demonstrated that variable clock skew issues in the timestamps for time series telemetry signatures cause poor performance for ML prognostics, resulting in high false-alarm and missed-alarm probabilities (FAPs and MAPs). This paper describes a new Analytical Resampling Process (ARP) that embodies novel techniques in the time domain and frequency domain for interpolative online normalization and optimal phase coherence so that all system telemetry time series outputs are available in a uniform format and aligned with a common sampling frequency. More importantly, the “optimality” of the proposed technique gives end users the ability to select between “ultimate accuracy” or “lowest overhead compute cost”, for automated coherence synchronization of collections of time series signatures, whether from a few sensors, or hundreds of thousands of sensors, and regardless of the sampling rates and signal-to-noise (S/N) ratios for those sensors.
Brief Announcement: Persistent Multi-Word Compare-and-Swap
This brief announcement presents a fundamental concurrent primitive for persistent memory – a persistent atomic multi-word compare-and-swap (PMCAS).We present a novel algorithm carefully crafted to ensure that atomic updates to a multitude of words modified by the PMCAS are persisted correctly. Our algorithm leverages hardware transactional memory (HTM) for concurrency control, and has a total of 3 persist barriers in its critical path. We also overview variants based on just the compare-and-swap (CAS) instruction and a hybrid of CAS and HTM.
A Two-List Framework for Accurate Detection of Frequent Items in Data Streams
The problem of detecting the most frequent items in large data sets and providing accurate frequency estimates for those items is becoming more and more important in a variety of domains. We propose a new two-list framework for addressing this problem, which extends the state-of-the-art Filtered Space-Saving (FSS) algorithm. An algorithm called FSSA giving an efficient array-based implementation of this framework is presented. An adaptive version of this algorithm is also presented, which adjusts the relative sizes of the two lists based on the estimated number of distinct keys in the data set. Analytical comparison with the FSS algorithm showed that FSSA has smaller expected frequency esti-mation errors, and experiments on both artificial and real workloads confirm this result. A theoretical analysis of space and time complexity for FSSA and its benchmark algorithms was performed. Finally, we showed that FSS2L frame-work can be naturally parallelized, leading to a linear decrease in the maximum frequency estimation error.
Closing the Performance Gap Between Volatile and Persistent Key-Value Stores Using Cross-Referencing Logs
Key-Value (K-V) stores are an integral building block of modern datacenter applications. With byte addressable persistent memory (PM) technologies, such as Intel/Micron’s 3D XPoint, on the horizon, there has been an influx of new high performance K-V stores that leverage PM for performance. However, there remains a significant performance gap between PM optimized K-V stores and DRAM resident ones, largely reflecting the gap between projected PM latency relative to that of DRAM. We address that performance gap with Bullet, a K-V store that leverages both the byte-addressability of PM and the lower latency of DRAM, using a technique called cross-referencing logs (CRLs) to keep most PM updates off the critical path. Bullet delivers performance approaching that of DRAM resident K-V stores by maintaining two hash tables, one in the slower (backend) PM and the other in the faster (frontend) DRAM. CRLs are a scalable persistent logging mechanism that keeps the two copies mutually consistent. Bullet also incorporates several critical optimizations, such as dynamic load balancing between frontend and backend threads, support for nonblocking Gets, and opportunistic omission of stale updates in the backend. This combination of implementation techniques delivers performance within 5% of that of DRAM-only key-value stores for realistic (read-heavy) workloads. Our general approach, based on CRLs, is “universal” in that it can be used to turn any volatile K-V store into a persistent one (or vice-versa, provide a fast cache for a persistent K-V store).
Placement of Virtual Containers on NUMA systems: A Practical and Comprehensive Model
Our work addresses the problem of placement of threads, or virtual cores, onto physical cores in a multicore NUMA system. Different placements result in varying degrees of contention for shared resources, so choosing the right placement can have a large effect on performance. Prior work has studied this problem, but either addressed hardware with specific properties, leaving us unable to generalize the models to other systems, or modeled much simpler effects than the actual performance in different placements. Our contribution is a general framework for reasoning about workload placement on machines with shared resources. It enables us to build an accurate performance model for any machine with a hierarchy of known shared resources automatically, with only minimal input from the user. Using our methodology, data center operators can minimize the number of NUMA (CPU+memory) nodes allocated for an application or a service, while ensuring that it meets performance objectives.
PerfIso: Performance Isolation for Commercial Latency-Sensitive Services
Large commercial latency-sensitive services, such as web search, run on dedicated clusters provisioned for peak load to ensure responsiveness and tolerate data center outages. As a result, the average load is far lower than the peak load used for provisioning, leading to resource under-utilization. The idle resources can be used to run batch jobs, completing useful work and reducing overall data center provisioning costs. However, this is challenging in practice due to the complexity and stringent tail-latency requirements of latency-sensitive services. Left unmanaged, the competition for machine resources can lead to severe response-time degradation and unmet service-level objectives (SLOs). This work describes PerfIso, a performance isolation framework which has been used for nearly three years in Microsoft Bing, a major search engine, to colocate batch jobs with production latency-sensitive services on over 90,000 servers. We discuss the design and implementation of PerfIso, and conduct an experimental evaluation in a production environment. We show that colocating CPU-intensive jobs with latency-sensitive services increases average CPU utilization from 21% to 66% for off-peak load without impacting tail latency.
An early look at the LDBC social network benchmark's business intelligence workload
In this short paper, we provide an early look at the LDBC Social Network Benchmark's Business Intelligence (BI) workload which tests graph data management systems on a graph business analytics workload. Its queries involve complex aggregations and navigations (joins) that touch large data volumes, which is typical in BI workloads, yet they depend heavily on graph functionality such as connectivity tests and path finding. We outline the motivation for this new benchmark, which we derived from many interactions with the graph database industry and its users, and situate it in a scenario of social network analysis. The workload was designed by taking into account technical "chokepoints" identified by database system architects from academia and industry, which we also describe and map to the queries. We present reference implementations in openCypher, PGQL, SPARQL, and SQL, and preliminary results of SNB BI on a number of graph data management systems.
Analytics with smart arrays: adaptive and efficient language-independent data
This paper introduces smart arrays, an abstraction for providing adaptive and efficient language-independent data storage. Their smart functionalities include NUMA-aware data placement across sockets and bit compression. We show how our single C++ implementation can be used efficiently from both native C++ and compiled Java code. We experimentally evaluate smart arrays on a diverse set of C++ and Java analytics workloads. Further, we show how their smart functionalities affect performance and lead to differences in hardware resource demands on multi-core machines, motivating the need for adaptivity. We observe that smart arrays can significantly decrease the memory space requirements of analytics workloads, and improve their performance by up to 4×. Smart arrays are the first step towards general smart collections with various smart functionalities that enable the consumption of hardware resources to be traded-off against one another.
An NVM Carol
Around 2010, we observed significant research activity around the development of non-volatile memory technologies. Shortly thereafter, other research communities began considering the implications of non-volatile memory on system design, from storage systems to data management solutions to entire systems. Finally, in July 2015, Intel and Micron Technology announced 3D XPoint. It’s now 2018; Intel is shipping its technology in SSD packages, but we’ve not yet seen the widespread availability of byte-addressable non-volatile memory that resides on the memory bus. We can view non-volatile memory technology and its impact on systems through an historical lens revealing it as the convergence of several past research trends starting with the concept of single-level store, encompassing the 1980s excitement around bubble memory, building upon persistent object systems, and leveraging recent work in transactional memory. We present this historical context, recalling past ideas that seem particularly relevant and potentially applicable and highlighting aspects that are novel.
Live Multi-language Development and Runtime Environments
Context: Software development tools should work and behave consistently across different programming languages, so that developers do not have to familiarize themselves with new tooling for new languages. Also, being able to combine multiple programming languages in a program increases reusability, as developers do not have to recreate software frameworks and libraries in the language they develop in and can reuse existing software instead.
Inquiry: However, developers often have a broad choice of tools, some of which are designed for only one specific programming language. Various Integrated Development Environments have support for multiple languages, but are usually unable to provide a consistent programming experience due to different language-specific runtime features. With regard to language integrations, common mechanisms usually use abstraction layers, such as the operating system or a network connection, which are often boundaries for tools and hence negatively affect the programming experience.
Approach: In this paper, we present a novel approach for tool reuse that aims to improve the experience with regard to working with multiple high-level dynamic, object-oriented programming languages. As part of this, we build a multi-language virtual execution environment and reuse Smalltalk’s live programming tools for other languages.
Knowledge: An important part of our approach is to retrofit and align runtime capabilities for different languages as it is a requirement for providing consistent tools. Furthermore, it provides convenient means to reuse and even mix software libraries and frameworks written in different languages without breaking tool support.
Grounding: The prototype system Squimera is an implementation of our approach and demonstrates that it is possible to reuse both development tools from a live programming system to improve the development experience as well as software artifacts from different languages to increase productivity.
Importance: In the domain of polyglot programming systems, most research has focused on the integration of different languages and corresponding performance optimizations. Our work, on the other hand, focuses on tooling and the overall programming experience.
A Parallel and Scalable Processor for JSON Data.
Increasing interest in JSON data has created a need for its efficient processing. Although JSON is a simple data exchange format, its querying is not always effective, especially in the case of large repositories of data. This work aims to integrate the JSONiq extension to the XQuery language specification into an existing query processor (Apache VXQuery) to enable it to query JSON data in parallel. VXQuery is built on top of Hyracks (a framework that generates parallel jobs) and Algebricks (a language-agnostic query algebra toolbox) and can process data on the fly, in contrast to other well-known systems which need to load data first. Thus, the extra cost of data loading is eliminated. In this paper, we implement three categories of rewrite rules which exploit the features of the above platforms to efficiently handle path expressions along with introducing intra-query parallelism. We evaluate our implementation using a large (803GB) dataset of sensor readings. Our results show that the proposed rewrite rules lead to efficient and scalable parallel processing of JSON data.
A Parallel and Scalable Processor for JSON Data.
Increasing interest in JSON data has created a need for its efficient processing. Although JSON is a simple data exchange format, its querying is not always effective, especially in the case of large repositories of data. This work aims to integrate the JSONiq extension to the XQuery language specification into an existing query processor (Apache VXQuery) to enable it to query JSON data in parallel. VXQuery is built on top of Hyracks (a framework that generates parallel jobs) and Algebricks (a language-agnostic query algebra toolbox) and can process data on the fly, in contrast to other well-known systems which need to load data first. Thus, the extra cost of data loading is eliminated. In this paper, we implement three categories of rewrite rules which exploit the features of the above platforms to efficiently handle path expressions along with introducing intra-query parallelism. We evaluate our implementation using a large (803GB) dataset of sensor readings. Our results show that the proposed rewrite rules lead to efficient and scalable parallel processing of JSON data.
Persistent Memory Transactions
This paper presents a comprehensive analysis of performance trade offs between implementation choices for transaction runtime systems on persistent memory. We compare three implementations of transaction runtimes: undo logging, redo logging, and copy-on-write. We also present a memory allocator that plugs into these runtimes. Our microbenchmark based evaluation focuses on understanding the interplay between various factors that contribute to performance differences between the three runtimes -- read/write access patterns of workloads, size of the persistence domain (portion of the memory hierarchy where the data is effectively persistent), cache locality, and transaction runtime bookkeeping overheads. No single runtime emerges as a clear winner. We confirm our analysis in more realistic settings of three "real world" applications we developed with our transactional API: (i) a key-value store we implemented from scratch, (ii) a SQLite port, and (iii) a persistified version of memcached, a popular key-value store. These findings are not only consistent with our microbenchmark analysis, but also provide additional interesting insights into other factors (e.g. effects of multithreading and synchronization) that affect application performance.
Dominance-Based Duplication Simulation (DBDS)
Compilers perform a variety of advanced optimizations to improve the quality of the generated machine code. However, optimizations that depend on the data flow of a program are often limited by control flow merges. Code duplication can solve this problem by hoisting, i.e. duplicating, instructions from merge blocks to their predecessors. However, finding optimization opportunities enabled by duplication is a non-trivial task that requires compile-time intensive analysis. This imposes a challenge on modern (just-in-time) compilers: Duplicating instructions tentatively at every control flow merge is not feasible because excessive duplication leads to uncontrolled code growth and compile time increases. Therefore, compilers need to find out whether a duplication is beneficial enough to be performed. This paper proposes a novel approach to determine which duplication operations should be performed to increase performance. The approach is based on a duplication simulation that enables a compiler to evaluate different success metrics per potential duplication. Using this information, the compiler can then select the most promising candidates for optimization. We show how to map duplication candidates into an optimization cost model that allows us to trade-off between different success metrics including peak performance, code size and compile time. We implemented the approach on top of the GraalVM and evaluated it with the benchmarks Java DaCapo, Scala DaCapo, JavaScript Octane and a micro-benchmark suite, in terms of performance, compilation time and code size increase.
Generic Concurrency Restriction - slides for the Dagstuhl seminar
The slides provide an overview of our work on generic concurrency restriction, which has been previously cleared for publication.
Sulong, and Thanks For All the Bugs: Finding Errors in C Programs by Abstracting from the Native Execution Model
In C, memory errors such as buffer overflows are among the most dangerous software errors; as we show, they are still on the rise. Current dynamic bug finding tools that try to detect such errors are based on the low-level execution model of the machine. They insert additional checks in an ad-hoc fashion, which makes them prone to forgotten checks for corner-cases. To address this issue, we devised a novel approach to find bugs during the execution of a program. At the core of this approach lies an interpreter that is written in a high-level language that performs automatic checks (such as bounds checks, NULL checks, and type checks). By mapping C data structures to data structures of the high-level language, accesses are automatically checked and bugs are found. We implemented this approach and show that our tool (called Safe Sulong) can find bugs that have been overlooked by state-of-the-art tools, such as out-of-bounds accesses to the main function arguments. Additionally, we demonstrate that the overheads are low enough to make our tool practical, both during development and in production for safety-critical software projects.
It's Time for Secure Languages (slides)
Slides summarising data from the National Vulnerability Database for the past 4 years pointing at the need for better language design.
It's Time for Secure Languages (SPLASH-I slides)
Language designers and developers want better ways to write good code– languages designed with simpler, more powerful abstractions accessible to a larger community of developers. However, language design does not seem to take into account security, leaving developers with the onerous task of writing attack-proof code. In 20 years, we have gone from 25 reported vulnerabilities to 6,000+ vulnerabilities reported in a year. The top two types of vulnerabilities for the past few years have been known for over 15+ years. I’ll summarise data on vulnerabilities during 2013-2015 and argue that our languages must take security seriously. Languages need security-oriented constructs, and compilers must let developers know when there is a problem with their code. We need to empower developers with the concept of “security for the masses” by making available languages that do not necessarily require an expert in order to determine whether the code being written is vulnerable to attack or not.
Making collection operations optimal with aggressive JIT compilation
Functional collection combinators are a neat and widely accepted data processing abstraction. However, their generic nature results in high abstraction overheads -- Scala collections are known to be notoriously slow for typical tasks. We show that proper optimizations in a JIT compiler can widely eliminate overheads imposed by these abstractions. Using the open-source Graal JIT compiler, we achieve speedups of up to 20x on collection workloads compared to the standard HotSpot C2 compiler. Consequently, a sufficiently aggressive JIT compiler allows the language compiler, such as Scalac, to focus on other concerns. In this paper, we show how optimizations, such as inlining, polymorphic inlining, and partial escape analysis, are combined in Graal to produce collections code that is optimal with respect to manually written code, or close to optimal. We argue why some of these optimizations are more effectively done by a JIT compiler. We then identify specific use-cases that most current JIT compilers do not optimize well, warranting special treatment from the language compiler.
Design Considerations of Monolithically Integrated Voltage Regulators for Multicore Processors
Presented in this paper are design considerations for a Monolithically Integrated Voltage Regulator (MIVR) targeting a 42mm2 multicore processor test chip taped-out in TSMC 28nm process. This is the first work discussing the utilization of on-die magnetic core inductors to support >50A of load current. 64 inductors with switching frequency of 140MHz are strategically grouped into 8 interleaving phases to achieve 85% efficiency and minimize on-die voltage drop.
Evaluating quality of security testing of the JDK.
In this position paper we describe how mutation testing can be used to evaluate the quality of test suites from a security viewpoint. Our focus is on measuring the quality of the test suite associated with the Java Development Kit (JDK) because it provides the core security properties for all applications. We describe the challenges associated with identifying security-specific mutation operators that are specific to the Java model and ensuring that our solution can be automated for large code-bases like the JDK.
Behavior Based Approach to Misuse Detection of a Simulated SCADA System
This paper presents the initial findings in applying a behavior-based approach for detection of unauthorized activities in a simulated Supervisory Control and Data Acquisition (SCADA) system. Misuse detection of this type utilizes fault-free system telemetry to develop empirical models that learn normal system behavior. Future monitored telemetry sources that show statistically significant deviations from this learned behavior may indicate an attack or other unwanted actions. The experimental test bed consists of a set of Linux based enterprise servers that were isolated from a larger university research cluster. All servers are connected to a private network and simulate several components and tasks seen in a typical SCADA system. Telemetry sources included kernel statistics, resource usages and internal system hardware measurements. For this study, the Auto Associative Kernel Regression (AAKR) and Auto Associative Multivariate State Estimation Technique (AAMSET) are employed to develop empirical models. Prognostic efficacy of these methods for computer security used several groups of signals taken from available telemetry classes. The Sequential Probability Ratio Test (SPRT) is used along with these models for intrusion detection purposes. The different intrusion types shown include host/network discovery, DoS, brute force login, privilege escalation and malicious exfiltration actions. For this study, all intrusion types tested displayed alterations in the residuals of much of the monitored telemetry and were able to be detected in all signal groups used by both model types. The methods presented can be extended and implemented to industries besides nuclear that use SCADA or business-critical networks.
Simulation-based Code Duplication for Enhancing Compiler Optimizations
Compiler optimizations are often limited by control flow, which prohibits optimizations across basic block boundaries. Duplicating instructions from merge blocks to their prede- cessors enlarges basic blocks and can thus enable further optimizations. However, duplicating too many instructions leads to excessive code growth. Therefore, an approach is necessary that avoids code explosion and still finds beneficial duplication candidates. We present a novel approach to determine which code should be duplicated to improve peak performance. There- fore, we analyze duplication candidates for subsequent op- timizations by simulating a duplication and analyzing its impact on the compilation unit. This allows a compiler to find those duplication candidates that have the maximum optimization potential.
SIMULATE & DUPLICATE
Poster about Simulation based Code Duplication (abstract from associated DocSymp paper) The scope of compiler optimizations is often limited by con- trol flow, which prohibits optimizations across basic block boundaries. Code duplication can solve this problem by ex- tending basic block sizes, thus enabling subsequent opti- mizations. However, duplicating code for every optimization opportunity may lead to excessive code growth. Therefore, a holistic approach is required that is capable of finding optimization opportunities and classifying their impact. This paper presents a novel approach to determine which code should be duplicated in order to improve peak perfor- mance. The approach analyzes duplication candidates for subsequent optimizations opportunities. It does so by simu- lating a duplication operation and analyzing its impact on other optimizations. This allows a compiler to weight up multiple success metrics in order to choose the code duplica- tion operations with the maximum optimization potential. We further show how to map code duplication opportunities to an optimization cost model that allows us to maximize performance while minimizing code size increase.
Detecting Malicious JavaScript in PDFs Using Conservative Abstract Interpretation
To mitigate the risk posed by JavaScript-based PDF malware, we propose a static analysis technique based on abstract interpretation. Our evaluation shows that our approach can identify 100% of malware with a low rate of false positives.
Improving Parallelism in Hardware Transactional Memory
Hardware transactional memory (HTM) is supported by recent processors from Intel and IBM. HTM is attractive because it can enhance concurrency while simplifying programming. Today's HTM systems rely on existing coherence protocols, which implement a requester-wins strategy. This, in turn, leads to very poor performance when transactions frequently conflict, causing them to resort to a non-speculative fallback path. Often, such a path severely limits concurrency. In this paper, we propose very simple architectural changes to the existing requester-wins HTM implementations. The idea is to support a special mode of execution in HTM, called power mode, which can be used to enhance conflict resolution between regular and so-called power transactions. A power transaction can run concurrently with regular transactions that do not conflict with it. This permits higher levels of concurrency in cases when a (regular) transaction cannot make progress due to conflicts and would require a non-speculative fallback path otherwise. Our idea is backward-compatible with existing HTM systems, imposing no additional cost on transactions that do not use the power mode. Furthermore, using power transactions requires no changes to target applications that employ traditional lock synchronization. Using extensive evaluation of micro- and STAMP benchmarks in a transactional memory simulator and real hardware-based emulation, we show that our technique significantly improves the performance of the baseline that does not use power mode, and performs comparably with state-of-the-art related proposals that require more substantial architectural changes.
Persistent Memcached: Bringing Legacy Code to Byte-Addressable Persistent Memory
We report our experience building and evaluating pmemcached, a version of memcached ported to byte-addressable persistent memory. Persistent memory is expected to not only improve overall performance of applications’ persistence tier, but also vastly reduce the “warm up” time needed for applications after a restart. We decided to test this hypothesis on memcached, a popular key-value store. We took the extreme view of persisting memcached’s entire state, resulting in a virtually instantaneous warm up phase. Since memcached is already optimized for DRAM, we expected our port to be a straightforward engineering effort. However, the effort turned out to be surprisingly complex during which we encountered several non-trivial problems that challenged the boundaries of memcached’s architecture. We detail these experiences and corresponding lessons learned.
FastR update: Interoperability, Graphics, Debugging, Profiling, and other hot topics
This talk present an overview of the current progress in FastR in a number of areas that saw significant progress in the last year, e.g., Interoperability, Graphics, Debugging, Compatibility, etc.
BDgen: A Universal Big Data Generator
This paper introduces BDgen, a generator of Big Data targeting various types of users, implemented as a general and easily extensible framework. It is divided into a scalable backend designed to generate Big Data on clusters and a frontend for user-friendly definition of the structure of the required data, or its automatic inference from a sample data set. In the first release we have implemented generators of two commonly used formats (JSON and CSV) and the support for general grammars. We have also performed preliminary experimental comparisons confirming the advantages and competitiveness of the solution.
Zero-overhead R and C/C++ integration with FastR
Traditionally, C and C++ are often used to improve performance for R applications and packages. While this is usually not necessary when using FastR, because it can run R code at near-native performance, there is a large corpus of existing code that implements critical pieces of functionality in native code. Alternative implementations of R need to simulate the R native API, which is a complex API that exposes many implementation details. They spend significant effort and performance overhead to simulate the API, and there is a compilation and optimization barrier between languages. FastR can employ the Truffle framework to run native code, available as LLVM bitcode, inside the optimization scope of the polyglot environment, and thus have it integrated with no optimization and integration barriers.
Trace Register Allocation Policies: Compile-time vs. Performance Trade-offs
Register allocation has to be done by every compiler that targets a register machine, regardless of whether it aims for fast compilation or optimal code quality. State-of-the-art dynamic compilers often use global register allocation approaches such as linear scan. Recent results suggest that non-global trace-based register allocation approaches can compete with global approaches in terms of allocation quality. Instead of processing the whole compilation unit at once, a trace-based register allocator divides the problem into linear code segments, called traces. In this work, we present a register allocation framework that can exploit the additional flexibility of traces to select different allocation strategies based on the characteristics of a trace. This allows fine-grained control over the compile time vs. peak performance trade-off. Our framework features three allocation strategies, a linear-scan-based approach that achieves good code quality, a single-pass bottom-up strategy that aims for short allocation times, and an allocator for trivial traces. We present 6 allocation policies to decide which strategy to use for a given trace. The evaluation shows that this approach can reduce allocation time by 3-43% at a peak performance penalty of about 0-9% on average. For systems that do not mainly focus on peak performance, our approach allows adjusting the time spent for register allocation, and therefore the overall compilation timer, finding the optimal balance between compile time and peak performance according to an application’s requirements.
Practical partial evaluation for high-performance dynamic language runtimes
Most high-performance dynamic language virtual machines duplicate language semantics in the interpreter, compiler, and runtime system. This violates the principle to not repeat yourself. In contrast, we define languages solely by writing an interpreter. The interpreter performs specializations, e.g., augments the interpreted program with type information and profiling information. Compiled code is derived automatically using partial evaluation while incorporating these specializations. This makes partial evaluation practical in the context of dynamic languages: It reduces the size of the compiled code while still compiling all parts of an operation that are relevant for a particular program. When a speculation fails, execution transfers back to the interpreter, the program re-specializes in the interpreter, and later partial evaluation again transforms the new state of the interpreter to compiled code. We evaluate our approach by comparing our implementations of JavaScript, Ruby, and R with best-in-class specialized production implementations. Our general-purpose compilation system is competitive with production systems even when they have been heavily optimized for the one language they support. For our set of benchmarks, our speedup relative to the V8 JavaScript VM is 0.83x, relative to JRuby is 3.8x, and relative to GNU R is 5x.
SOAP 2017 Presentation - An Efficient Tunable Selective Points-to Analysis for Large Codebases
Points-to analysis is a fundamental static program analysis technique for tools including compilers and bug-checkers. Although object-based context sensitivity is known to improve precision of points-to analysis, scaling it for large Java codebases remains an challenge. In this work, we develop a tunable, client-independent, object-sensitive points-to analysis framework where heap cloning is applied selectively. This approach is aimed at large codebases where standard analysis is typically expensive. Our design includes a pre-analysis that determines program points that contribute to the cost of an object-sensitive points-to analysis. A subsequent analysis then determines the context depth for each allocation site. While our framework can run standalone, it is also possible to tune it – the user of the framework can use the knowledge of the codebase being analysed to influence the selection of expensive program points as well as the process to differentiate the required context-depth. Overall, the approach determines where the cloning is beneficial and where the cloning is unlikely to be beneficial. We have implemented our approach using Souffl ́e (a Datalog compiler) and an extension of the DOOP framework. Our experiments on large programs, including OpenJDK, show that our technique is efficient and precise. For the OpenJDK, our analysis reduces 27% of runtime and 18% of memory usage for a negligible loss of precision, while for Jython from the DaCapo benchmark suite, the same analysis reduces 91% of runtime for no loss of precision.
Lenient Execution of C on a JVM -- How I Learned to Stop Worrying and Execute the Code
Most C programs do not strictly conform to the C standard, and often show undefined behavior, e.g., on signed integer overflow. When compiled by non-optimizing compilers, such programs often behave as the programmer intended. However, optimizing compilers may exploit undefined semantics for more aggressive optimizations, thus possibly breaking the code. Analysis tools can help to find and fix such issues. Alternatively, one could define a C dialect in which clear semantics are defined for frequent program patterns whose behavior would otherwise be undefined. In this paper, we present such a dialect, called Lenient C, that specifies semantics for behavior that the standard left open for interpretation. Specifying additional semantics enables programmers to safely rely on otherwise undefined patterns. Lenient C aims towards being executed on a managed runtime such as the JVM. We demonstrate how we implemented the dialect in Safe Sulong, a C interpreter with a dynamic compiler that runs on the JVM.
Towards Understandable Smart Contracts
Blockchains and smart contracts can facilitate trustworthy business processes between participants who do not trust each other. However, understanding smart contracts typically requires programming skills. We outline research towards enabling smart contracts that are understandable by humans too.
Inference of Security-Sensitive Entities in Libraries
Programming languages such as Java and CRL execute code with different levels of trust in the same process, and rely on an access control model with fine-grained permissions to protect program code. Permissions are checked programmatically, and rely on programmer discipline. This can lead to subtle errors. To enable automatic security analysis about unauthorised access or information flow, it is necessary to reason about security-sensitive entities in libraries that must be protected by appropriate sanitization/declassification via permission checks. Unfortunately, security-sensitive entities are not clearly identified. In this paper, we investigate security-sensitive entities used in Java-like languages, and develop a static program analysis technique to identify them in large codebases by analysing the patterns of permission checks. Although the technique is generic, our focus is on Java where checkPermission calls are used to guard potential security-sensitive entities. Our inference analysis uses two parameters called proximity and coverage to reduce false positive/negative reports. The usefulness of the analysis is illustrated by the results obtained while checking the OpenJDK7-b147 for conformance to Java Secure Coding Guidelines that relate to the confidentiality and integrity requirements.
Transactional Lock Elision Meets Combining
Flat combining (FC) and transactional lock elision (TLE) are two techniques that facilitate efficient multi-thread access to a sequentially implemented data structure protected by a lock. FC allows threads to delegate their operations to another (combiner) thread, and benefit from executing multiple operations by that thread under the lock through combining and elimination optimizations tailored to the specific data structure. TLE employs hardware transactional memory (HTM) that allows multiple threads to apply their operations concurrently as long as they do not conflict. This paper explores how these two radically different techniques can complement one another, and introduces the HTM-assisted Combining Framework (HCF). HCF leverages HTM to allow multiple combiners to run concurrently with each other, as well as with other, non-combiner threads. This makes HCF a good fit for data structures and workloads in which some operations may conflict with each other while others may run concurrently without conflicts. HCF achieves all that with changes to the sequential code similar to those required by TLE and FC, and in particular, without requiring the programmer to reason about concurrency.
Polyglot Native: Scala, Kotlin, and Other JVM-Based Languages with Instant Startup and low Footprint
Execution of JVM-based programs uses bytecode loading and interpretation, just-in-time compilation, and monolithic heaps. This causes JVM-based programs to startup slowly with a high memory footprint. In recent years, different projects were developed to address these issues: ahead-of-time compilation for the JVM (JEP 295) improves on JVM startup time while Scala Native and Kotlin/Native provide language-specific solutions by compiling code with LLVM and providing language-specific runtimes. We present Polyglot Native: an ahead-of-time compiler for Java bytecode combined with a low-footprint VM. With Polyglot Native, programs written in Kotlin, Scala, and other JVM-based languages have minimal startup time as they are compiled to native executables. Footprint of compiled programs is minimized by using a chunked heap and reducing necessary program metadata. In this talk, we show the architecture of Polyglot Native and compare it to existing projects. Then, we live-demo a project that compiles code from Kotlin, Scala, Java, and C into a single binary executable. Finally, we discuss intricacies of interoperability between Polyglot Native and C.
Evaluating Quality of Security Testing of the JDK
The document outlines the main challenges in evaluating test suites that check for security properties. Specifically, it considers testing the security properties of the JDK.
Dynamic Adaptation of User Migration Policies in Distributed Virtual Environments
A distributed virtual environment (DVE) consists of multiple network nodes (servers), each of which can host many users that consume CPU resources on that node and communicate with users on other nodes. Users can be dynamically migrated between the nodes, and the ultimate goal for the migration policy is to minimize the average system response time perceived by the users. In order to achieve this, the user migration policy should minimize network communication while balancing the load among the nodes so that CPU resources of the individual nodes are not overloaded. This paper considers a multi-player online game as an example of a DVE and presents an adaptive distributed user migration policy, which uses Reinforcement Learning to tune itself so as to minimize the average system response time perceived by the users. Performance of the self-tuning policy was compared on a simulator with the standard benchmark non-adaptive migration policy and with the optimal static user allocation policy in a variety of scenarios, and the self-tuning policy was shown to greatly outperform both benchmark policies, with performance difference increasing as the network became more overloaded.
Truffle: your favorite language on JVM
Graal/Truffle is a project that aims to build multi-language, multi-tenant, multi-threaded, multi-node, multi-tooling and multi-system environment on top of JVM. Imagine that in order to develop a (dynamic) language implementation all you need is to write its interpreter in Java and immediately you get amazing peek performance, choice of several carefully tuned garbage collectors, tooling support, high speed interoperability with other languages and more. In this talk we'll take a look at how Truffle and Graal can achieve this and demonstrate the results on Ruby, JavaScript and R. Particular attention will be given to FastR the Truffle based R language implementation, its performance compared to GNU R and its support for Java interoperability including graphics.
Persistent Memcached: Bringing Legacy Code to Byte-Addressable Persistent Memory
We report our experience building and evaluating pmemcached, a version of memcached ported to byteaddressable persistent memory. Persistent memory is expected to not only improve overall performance of applications’ persistence tier, but also vastly reduce the “warm up” time needed for applications after a restart. We decided to test this hypothesis on memcached, a popular key-value store. We took the extreme view of persisting memcached’s entire state, resulting in a virtually instantaneous warm up phase. Since memcached is already optimized for DRAM, we expected our port to be a straightforward engineering effort. However, the effort turned out to be surprisingly complex during which we encountered several non-trivial problems that challenged the boundaries of memcached’s architecture. We detail these experiences and corresponding lessons learned.
UMASS Data Science Talks
I'll be giving two talks at the UMASS data science event. The first talk is on our multilingual word embedding work. The second talk is on our constrained-inference approach for sequence-to-sequence neural networks. Relevant IP is covered in two patents and both pieces of work have previously been approved for publication (patent ref numbers and archivist ids provided below).
A General Model for Placement of Workloads on Multicore NUMA Systems
The problem of mapping threads, or virtual cores, to physical cores on multicore systems has been studied for over a decade. Despite this effort, there is still no method that will help us decide in real time and for arbitrary workloads the relative impact of different mappings on performance. Prior work has made large strides in this field, but these solutions addressed a limited set of concerns (e.g., only shared caches and memory controllers, or only asymmetric interconnects), assuming hardware with specific properties and leaving us unable to generalize the model to other systems. Our contribution is an abstract machine model that enables us to automatically build a performance prediction model for any machine with a hierarchy of shared resources. In the process of developing the methodology for building predictive models we discovered pitfalls of using hardware performance counters, a de facto technique embraced by the community for decades. Our new methodology does not rely on hardware counters at the expense of trying a handful of additional workload mappings (out of many possible) at runtime. Using this methodology data center operators can decide on the smallest number of NUMA (CPU+memory) nodes to use for the target application or service (which we assume to be encapsulated into a virtual container so as to match the reality of the modern cloud systems such as AWS), while still meeting performance goals. More broadly, the methodology empowers them to efficiently “pack” virtual containers on the physical hardware in a data center.
Pandia: comprehensive contention-sensitive thread placement.
Pandia is a system for modelling the performance of in memory parallel workloads. It generates a description of a workload from a series of profiling runs, and combines this with a description of the machine's hardware to model the workload's performance over different thread counts and different placements of those threads.
The approach is “comprehensive” in that it accounts for contention at multiple resources such as processor functional units and memory channels. The points of contention for a workload can shift between resources as the degree of parallelism and thread placement changes. Pandia accounts for these changes and provides a close correspondence between predicted performance and actual performance. Testing a set of 22 benchmarks on 2 socket Intel machines fitted with chips ranging from Sandy Bridge to Haswell we see median differences of 1.05% to 0% between the fastest predicted placement and the fastest measured placement, and median errors of 8% to 4% across all placements.
Pandia can be used to optimize the performance of a given workload for instance, identifying whether or not multiple processor sockets should be used, and whether or not the workload benefits from using multiple threads per core. In addition, Pandia can be used to identify opportunities for reducing resource consumption where additional resources are not matched by additional performance for instance, limiting a workload to a small number of cores when its scaling is poor.
Better Splittable Pseudorandom Number Generators (and Almost As Fast)
We have tested and analyzed the {\sc SplitMix} pseudorandom number generator algorithm presented by Steele, Lea, and Flood \citeyear{FAST-SPLITTABLE-PRNG}, and have discovered two additional classes of gamma values that produce weak pseudorandom sequences. In this paper we present a modification to the {\sc SplitMix} algorithm that avoids all three classes of problematic gamma values, and also a completely new algorithm for splittable pseudorandom number generators, which we call {\sc TwinLinear}. Like {\sc SplitMix}, {\sc TwinLinear} provides both a \emph{generate} operation that returns one (64-bit) pseudorandom value and a \emph{split} operation that produces a new generator instance that with very high probability behaves as if statistically independent of all other instances. Also like {\sc SplitMix}, {\sc TwinLinear} requires no locking or other synchronization (other than the usual memory fence after instance initialization), and is suitable for use with {\sc simd} instruction sets because it has no branches or loops. The {\sc TwinLinear} algorithm is the result of a systematic exploration of a substantial space of nonlinear mixing functions that combine the output of two independent generators of (perhaps not very strong) pseudorandom number sequences. We discuss this design space and our strategy for exploring it. We used the PractRand test suite (which has provision for failing fast) to filter out poor candidates, then used TestU01 BigCrush to verify the quality of candidates that withstood PractRand. We present results of analysis and extensive testing on {\sc TwinLinear} (using both TestU01 and PractRand). Single instances of {\sc TwinLinear} have no known weaknesses, and {\sc TwinLinear} is significantly more robust than {\sc SplitMix} against accidental correlation in a multithreaded setting. It is slightly more costly than {\sc SplitMix} (10 or 11 64-bit arithmetic operations per 64 bits generated, rather than 9) but has a shorter critical path (5 or 6 operations rather than 8). We believe that {\sc TwinLinear} is suitable for the same sorts of applications as {\sc SplitMix}, that is, ``everyday'' scientific and machine-learning applications (but not cryptographic applications), especially when concurrent threads or distributed processes are involved.
Polyglot programming in the cloud
Graal polyglot vision presentation
LabelBank: Revisiting Global Perspectives for Semantic Segmentation
Semantic segmentation requires a detailed labeling of image pixels by object category. Information derived from local image patches is necessary to describe the detailed shape of individual objects. However, this information is ambiguous and can result in noisy labels. Global inference of image content can instead capture the general semantic concepts present. We advocate that holistic inference of image concepts provides valuable information for detailed pixel labeling. We propose a generic framework to leverage holistic information in the form of a LabelBank for pixel-level segmentation. We show the ability of our framework to improve semantic segmentation performance in a variety of settings. We learn models for extracting a holistic LabelBank from visual cues, attributes, and/or textual descriptions. We demonstrate improvements in semantic segmentation accuracy on standard datasets across a range of state-of-the-art segmentation architectures and holistic inference approaches.
A Many-core Architecture for In-Memory Data Processing
We live in an information age, with data and analytics guiding a large portion of our daily decisions. Data is being generated at a tremendous pace from connected cars, connected homes and connected workplaces, and extracting useful knowledge from this data is a quickly becoming an impractical task. Single-threaded performance has become saturated in the last decade, and there is a growing need for custom solutions to keep pace with these workloads in a scalable and efficient manner. A big portion of the power in analytics workloads involves bringing data to the processing cores, and we aim to optimize that. We present the Database Processing Unit or DPU, a shared memory many-core that is specifically designed for in-memory analytics workloads. The DPU contains a unique Data Movement System (DMS), which provides hardware acceleration for data movement and preprocessing operations. The DPU also provides acceleration for core to core com- munication via a unique hardware RPC mechanism called the Atomic Transaction Engine or ATE. Comparison of a fabricated DPU chip with a variety of state of the art x86 applications shows a performance/Watt advantage of 3x to 16x.
PGX.UI: Visual Construction and Exploration of Large Property Graphs
Transforming existing data into graph formats and visualizing large graphs in a comprehensible way are two key areas of interest of information visualization. Addressing these issues requires new visualization approaches for large graphs that support users with graph construction and exploration. In addition, graph visualization is becoming more important for existing graph processing systems, which are often based on the property graph model. Therefore this paper presents concepts for visually constructing property graphs from data sources and a summary visualization for large property graphs. Furthermore, we introduce the concept of a graph construction time line that keeps track of changes and provides branching and merging, in a version control like fashion. Finally, we present a tool that visually guides users through the graph construction and exploration process.
Building Reusable, Low-Overhead Tooling Support into a High-Performance Polyglot VM
Software development tools that interact with running programs, for instance debuggers, are presumed to demand di cult tradeo s among performance, functionality, implementation complexity, and user convenience. A fundamental change in thinking obsoletes that presumption and enables the delivery of e ective tools as a forethought, no longer an afterthought.
A Two-List Framework for Accurate Detection of Frequent Items in Large Data Sets
The problem of detecting the most frequent items in large data sets and providing accurate frequency estimates for those items is becoming more and more important in a variety of domains. We propose a new two-list framework for addressing this problem, which extends the state-of-the-art Filtered Space-Saving (FSS) algorithm. Two algorithms implementing this framework are presented: FSSAL and FSSA. An adaptive version of these algorithms is presented, which adjusts the relative sizes of the two lists based on the estimated number of distinct keys in the data set. Analytical comparison with the FSS algorithm showed that FSS2L algorithms have smaller expected frequency estimation errors, and experiments on both artificial and real workloads confirm this result. A theoretical analysis of space and time complexity for all considered algorithms was performed. Finally, we showed that FSS2L algorithms can be naturally parallelized, leading to a linear decrease in the maximum frequency estimation error.
Self-managed collections: Off-heap memory management for scalable query-dominated collections
Explosive growth in DRAM capacities and the emergence of language-integrated query enable a new class of man- aged applications that perform complex query processing on huge volumes of data stored as collections of objects in the memory space of the application. While more flexible in terms of schema design and application development, this approach typically experiences sub-par query execution per- formance when compared to specialized systems like DBMS. To address this issue, we propose self-managed collections, which utilize off-heap memory management and dynamic query compilation to improve the performance of querying managed data through language-integrated query. We eval- uate self-managed collections using both microbenchmarks and enumeration-heavy queries from the TPC-H business intelligence benchmark. Our results show that self-managed collections outperform ordinary managed collections in both query processing and memory management by up to an order of magnitude and even outperform an optimized in- memory columnar database system for the vast majority of queries.
Language-Independent Information Flow Tracking Engine for Program Comprehension Tools
Program comprehension tools are often developed for a specific programming language. Developing such a tool from scratch requires significant effort. In this paper, we report on our experience developing a language-independent framework that enables the creation of program comprehension tools, specifically tools gathering insight from deep dynamic analysis, with little effort. Our framework is language independent, because it is built on top of Truffle, an open-source platform, developed in Oracle Labs, for implementing dynamic languages in the form of AST interpreters. Our framework supports the creation of a diverse variety of program comprehension techniques, such as query, program slicing, and back-in-time debugging, because it is centered around a powerful information-flow tracking engine. Tools developed with our framework get access to the information-flow through a program execution. While it is possible to develop similarly powerful tools without our framework, for example by tracking information-flow through bytecode instrumentation, our approach leads to information that is closer to source code constructs, thus more comprehensible by the user. To demonstrate the effectiveness of our framework, we applied it to two of Truffle-based languages namely Simple Language and TruffleRuby, and we distill our experience into guidelines for developers of other Truffle-based languages who want to develop program comprehension tools for their language.
An Efficient Tunable Selective Points-to Analysis for Large Codebases
Points-to analysis is a fundamental static program analysis technique for tools including compilers and bug-checkers. Although object-based context sensitivity is known to im-prove precision of points-to analysis, scaling it for large Java codebases remains an challenge. In this work, we develop a tunable, client-independent, object-sensitive points-to analysis framework where heap cloning is applied selectively. This approach is aimed at large codebases where standard analysis is typically expensive. Our design includes a pre-analysis that determines program points that contribute to the cost of an object-sensitive points-to analysis. A subsequent analysis then determines the context depth for each allocation site. While our framework can run standalone, it is also possible to tune it – the user of the framework can use the knowledge of the codebase being analysed to influence the selection of expensive program points as well as the process to differentiate the required context-depth. Overall, the approach determines where the cloning is beneficial and where the cloning is unlikely to be beneficial. We have implemented our approach using Souffl ́e (a Datalog compiler) and an extension of the DOOP framework. Our experiments on large programs, including OpenJDK, show that our technique is efficient and precise. For the OpenJDK, our analysis reduces 27% of runtime and 18% of memory usage for a negligible loss of precision, while for Jython from the DaCapo benchmark suite, the same analysis reduces 91% of runtime for no loss of precision.
Composing Durable Data Structures
This paper presents techniques for composing per- sistent data structure operations on machines with nonvolatile byte addressable memory. The techniques are applicable to a wide class of nonblocking algorithms.
Composing Durable Data Structures (poster)
Prior solutions for crash consistency on NVM have focused on two major areas. Data structures designed for NVM ensure that their metadata and contents are consistent after a crash, and operations become persistent in a well regulated manner (e.g. they meet the correctness condition durable linearizability [1]). In contrast, transactional systems guarantee that all changes from failure atomic sections of code (e.g. transactions) are entirely visible or entirely dropped after a crash. This work investigates composing operations on durably linearizable data structures into larger failure atomic sections (e.g. transactions). This goal can be seen as an extension of transactional boosting, a technique used in traditional (transient) transactional memory.
Dynamic Compilation and Run-Time Optimization
Lecture Slides about "Dynamic Compilation and Run-Time Optimization" - held at University of Augsburg. Contents: - Technology invented by Self (Inline Caching, Deoptimization, ...) - Truffle and Graal Tutorial
HeadacheCoach: Towards Headache Prevention by Sensing and Making Sense of Personal Lifestyle Data
Estimates are that almost half of the world’s population has an active primary headache disorder, i.e. with no illness as an underlying cause. These can start manifesting in early adulthood and can last until the rest of the sufferer’s life. Most specialists concur that sudden changes in daily lifestyle, such are sleep rhythm, nutrition behavior or stress experience, can be valid triggers for headache sufferers. Health care professionals recommend leading a diary to self-monitor personal headache triggers in order to learn to avoid headache attacks. However, making sense out of this data is difficult. Despite existing smartphone approaches in literature that have evaluated behavior change support systems for headaches, they have failed to provide appropriate feedback on the collected daily data to showcase what causes or prevents an individual’s headache attacks. In this paper, we present HeadacheCoach, a smartphone app that tracks headache-triggering lifestyle data and headache attacks on a daily basis and propose a mixed-method approach to examine which feedback method( s) can strive the behavior change most in order to prevent future headache attacks.
FAD.js: Fast JSON Data Access Using JIT-based Speculative Optimizations
JSON is one of the most popular data encoding formats, with wide adoption in Databases and BigData frameworks, and native support in popular programming languages such as JavaScript/Node.js, Python, and R. Nevertheless, JSON data manipulation can easily become a performance bottleneck in modern language runtimes due to parsing and object materialization overheads. In this pa- per, we introduce Fad.js, a runtime system for fast manipulation of JSON objects in data-intensive applications. Fad.js is based on speculative just-in-time compilation and on direct access to raw data. Experiments show that applications using Fad.js can achieve speedups up to 2.7x for encoding and 9.9x for decoding JSON data when compared to state-of-the art JSON manipulation libraries.
Machine Learning for Finding Bugs: An Initial Report
Static program analysis is a technique to analyse code without executing it, and can be used to find bugs in source code. Many open source and commercial tools have been developed in this space over the past 20 years. Scalability and precision are of importance for the deployment of static code analysis tools - numerous false positives and slow runtime both make the tool hard to be used by development, where integration into a nightly build is the standard goal. This requires one to identify a suitable abstraction for the static analysis which is typically a manual process and can be expensive. In this paper we report our findings on using machine learning techniques to detect defects in C programs. We use three offthe- shelf machine learning techniques and use a large corpus of programs available for use in both the training and evaluation of the results. We compare the results produced by the machine learning technique against the Parfait static program analysis tool used internally at Oracle by thousands of developers. While on the surface the initial results were encouraging, further investigation suggests that the machine learning techniques we used are not suitable replacements for static program analysis tools due to low precision of the results. This could be due to a variety of reasons including not using domain knowledge such as the semantics of the programming language and lack of suitable data used in the training process.
Using Butterfly-Patterned Partial Sums to Draw from Discrete Distributions
We describe a SIMD technique for drawing values from multiple discrete distributions, such as sampling from the random variables of a mixture model, that avoids computing a complete table of partial sums of the relative probabilities. A table of alternate (``butterfly-patterned'') form is faster to compute, making better use of coalesced memory accesses; from this table, complete partial sums are computed on the fly during a binary search. Measurements using CUDA 7.5 on an NVIDIA Titan Black GPU show that for double-precision data, this technique makes an entire LDA machine-learning application about 25% faster than doing a straightforward matrix transposition after using coalesced accesses.
Persistent Memory Transactions
This paper presents a comprehensive analysis of implementation choices for transaction runtime systems optimized for persistent memory. Our work focuses on performance implications of persist barriers, primitives required to persist writes on persistent memory. In the process we introduce a new taxonomy of persistence domains (portion of the memory hierarchy where the data is effectively persistent) that has a significant impact on persist barrier latencies. We present algorithms for undo logging, redo logging, and copy-on-write based transactions, as well as a memory allocator, all optimized to reduce persist barriers per transaction. Our microbenchmarking does a comprehensive sweep of read-write mix ratios in transactions, showing performance trends of the transaction runtimes under different assumptions about persist barrier latencies. No single runtime dominates the rest across the board. However, we pinpoint approximate readwrite ratio ranges where specific runtimes outperform the rest. Our analysis hightlights the significant influence of multiple factors on performance – transaction runtime specific bookkeeping overheads, persist barrier latencies, and cache locality. We find similar performance trade offs on three “real world” workloads – a key-value store of our own, SQLite, and memcached.
Increasing the Robustness of C Libraries and Applications through Run-time Introspection
In C, low-level errors such as buffer overflow and use-after-free are a major problem since they cause security vulnerabilities and hard-to-find bugs. Libraries cannot apply defensive programming techniques since objects (e.g., arrays or structs) lack run-time information such as bounds, lifetime, and types. To address this issue, we devised introspection functions that empower C programmers to access run-time information about objects and variadic function arguments. Using these functions, we implemented a more robust, source-compatible version of the C standard library that validates parameters to its functions. The library functions react to otherwise undefined behavior; for example, when detecting an invalid argument, its functions return a special value (such as -1 or NULL) and set the errno, or attempt to still compute a meaningful result. We demonstrate by examples that using introspection in the implementation of the C standard library and other libraries prevents common low-level errors, while also complementing existing approaches.
SLIDES: It's Time for a New Old Language
Slides for an invited keynote talk at PPoPP
It's Time for a New Old Language
The most popular programming language in computer science has no compiler or interpreter. Its definition is not written down in any one place. It has changed a lot over the decades, and those changes have introduced ambiguities and inconsistencies. Today, dozens of variations are in use, and its complexity has reached the point where it needs to be re-explained, at least in part, every time it is used. Much effort has been spent in hand-translating between this language and other languages that do have compilers. The language is quite amenable to parallel computation, but this fact has gone unexploited. In this talk we will summarize the history of the language, highlight the variations and some of the problems that have arisen, and propose specific solutions. We suggest that it is high time that this language be given a complete formal specification, and that compilers, IDEs, and proof-checkers be created to support it, so that all the best tools and techniques of our trade may be applied to it also.
Dynamic Symbolic Execution for Polymorphism
Symbolic execution is an important program analysis technique that provides auxiliary execution semantics to execute programs with symbolic rather than concrete values. There has been much recent interest in symbolic execution for automatic test case generation and security vulnerability detection, resulting in various tools being deployed in academia and industry. Nevertheless, (subtype or dynamic) polymorphism of object-oriented program analysis has been neglected: existing symbolic execution techniques can explore different targets of conditional branches but not different targets of method invocations. We address the problem of how this polymorphism can be expressed in a symbolic execution framework. We propose the notion of symbolic types, which make object types symbolic. With symbolic types, various targets of a method invocation can be explored systematically by mutating the type of the receiver object of the method during automatic test case generation. To the best of our knowledge, this is the first attempt to address polymorphism in symbolic execution. Mutation of method invocation targets is critical for effectively testing object-oriented programs, especially libraries. Our experimental results show that symbolic types are significantly more effective than existing symbolic execution techniques in achieving test coverage and finding bugs and security vulnerabilities in OpenJDK.
What makes TruffleRuby run Optcarrot 9 times faster than MRI?
TruffleRuby runs Optcarrot 9 times faster than MRI 2. TruffleRuby is new optimizing implementation of Ruby. Optcarrot is a NES emulator. MRI 3 targets to run Optcarrot 3 times faster than MRI 2. We will explore the techniques which allow TruffleRuby to achieve high performance in Optcarrot. We’ll discuss splitting, inlining, array strategies, Proc elimination, etc.
Using Butterfly-Patterned Partial Sums to Draw from Discrete Distributions
Slides for a talk to be given at ACM PPoPP on February 8, 2017. This 25-minute talk builds on the paper as accepted by PPoPP (Archivist 2016-057) and a previous version of the slides presented at NVIDIA GTC 2016 (Archivist 2016-0055). *** We describe a SIMD technique for drawing values from multiple discrete distributions, such as sampling from the random variables of a mixture model, that avoids computing a complete table of partial sums of the relative probabilities. A table of alternate ("butterfly-patterned") form is faster to compute, making better use of coalesced memory accesses; from this table, complete partial sums are computed on the fly during a binary search. Measurements using CUDA 7.5 on an NVIDIA Titan Black GPU show that this technique makes an entire machine-learning application that uses a Latent Dirichlet Allocation topic model with 1024 topics is about 13% faster (when using single-precision floating-point data) or about 35% faster (when using double-precision floating-point data) than doing a straightforward matrix transposition after using coalesced accesses.
Business Process Optimization via Reinforcement Learning
This presentation describes the theory of reinforcement learning and our first results on applying it to simulated business processes. Hopefully, it will impress some clients to give us data from some real business process, for which we will then be able to learn an improved action policy. We would like to present these slides on Feb 2 at the http://www.biwasummit.org/
Towards Scalable Provenance Generation From Points-To Information: An Initial Experiment}
Points-to analysis is often used to identify potential defects in code. The usual points-to analysis does not store the justification for the presence of a specific value in the points-to relation. But for points-to analysis to meet the needs of the programmer, the analysis needs to provide the justification for its results. Programmers will use such justification to identify the cause of defect the code. In this paper we describe an approach to generate provenance informationi n the context of points-to analysis. Our solution is to define an abstract notion of data-flow traces that is computed as a post-analysis using points-to information that has already been computed. We implemented our approach in conjunction with the DOOP framework that computes points-to information. We use four benchmarks derived from two versions of the JDK, and use two realistic clients to demonstrate the effectiveness of our solution. For instance, we show that the overhead to compute these data-flow traces is only 25\% when compared to the time to compute the original points-to analysis. We also discuss some of the limitations of approach especially in generating precise traces.
Machine Learning For Finding Bugs: An Initial Report
Static program analysis is a technique to analyse code without executing it, and can be used to find bugs in source code. Many open source and commercial tools have been developed in this space over the past 20 years. Scalability and precision are of importance for the deployment of static code analysis tools - numerous false positives and slow runtime both make the tool hard to be used by development, where integration into a nightly build is the standard goal. This requires one to identify a suitable abstraction for the static analysis which is typically a manual process and can be expensive. In this paper we report our findings on using machine learning techniques to detect defects in C programs. We use three off-the-shelf machine learning techniques and use a large corpus of programs available for use in both the training and evaluation of the results. We compare the results produced by the machine learning technique against the Parfait static program analysis tool used internally at Oracle by thousands of developers. While on the surface the initial results were encouraging, further investigation suggests that the machine learning techniques we used are not suitable replacements for static program analysis tools due to low precision of the results. This could be due to a variety of reasons including not using domain knowledge such as the semantics of the programming language and lack of suitable data used in the training process.
Fast, Flexible, Polyglot Instrumentation Support for Debuggers and other Tools
Software development tools that interact with running programs, for instance debuggers, are presumed to demand difficult tradeoffs among performance, functionality, implementation complexity, and user convenience. A fundamental change in thinking obsoletes that presumption and enables the delivery of effective tools as a forethought, no longer an afterthought. We have extended the open source multi-language \platform{} with a language-agnostic Instrumentation Framework, including (1) low-level, extremely low-overhead execution event interposition, built directly into the high-performance runtime; (2) shared language-agnostic instrumentation services, requiring minimal per-language specialization; and (3) versatile APIs for constructing many kinds of client tools without modifying the VM. A new design uses this framework to implement debugging services for arbitrary languages (possibly in combination) with little effort from language implementor. We show that, when optimized, the service has no measurable overhead and generalizes to other kinds of tools. It is now possible for a client in a production environment, with thread safety, to dynamically insert into an executing program an instrumentation probe that incurs near zero performance cost until actually used to access (or modify) execution state. Other applications include tracing and stepping required by some languages, as well as platform requirements such as the need to timebox script executions. Finally, opening public API access to runtime state encourages advanced tool development and experimentation with much reduced effort.
Secure Information Flow by Access Control: A Security Type System of Dual-Access Labels
Programming languages such as Java and C# execute code with different levels of trust in the same process, and rely on a fine-grained access control model for users to manage the security requirements of program code from different sources. While such a security model is simple enough to be used in practice to protect systems from many hostile programs downloaded over a network, it does not guard against information-based attacks, such as confidentiality and integrity violations. We introduce a novel security model, called Dual-Access Label (DAL), to capture information-based security requirements of programs written in these languages. DAL labels extend the access control model by specifying both the accessibility and capability of program code, and use them to constrain information flows between code from different sources. Accessibility specifies the privileges necessary to access the code while capability indicates the privileges held by the code. DAL's security policy places a two-way obligation on both ends of information flow so that they must have sufficient capability to meet the accessibility of each other. Unlike traditional lattice-based security models, our security model offers more flexible information flow relations induced by the security policy that does not have to be transitive. It provides both confidentiality and integrity guarantees while allowing cyclic information flows among code with different security labels, as desired in many applications. We present a generic security type system to enforce possibly intransitive information flow polices, including DAL, statically at compile-time. Such security type system provides a new notion of intransitive noninterference that generalizes the standard notion of transitive noninterference in lattice-based security models.
Machine Learning For Finding Bugs in Source Code: An Initial Report
Static program analysis is a technique to analyse code without executing it, and can be used to find bugs in source code. Many open source and commercial tools have been developed in this space over the past 20 years. Of importance for the deployment of static code analysis tools is the precision of the technique and its scalability – numerous false positives and slow runtime both make the tool hard to be used by development, where integration into a nightly build is the standard goal. In this paper we report our findings on using machine learning techniques to detect defects in C programs. We use three off the shelf machine learning techniques and use a large corpus of programs available for use in both the training and evaluation of the results. We compare the results produced by the machine learning technique against the Parfait static program analysis tool used internally at Oracle by thousands of developers. While on the surface the initial results were encouraging, further investigation suggests that the machine learning techniques we used are not suitable replacements for static program analysis tools due to low precision of the results. This could be due to a variety of reasons including not using domain knowledge and lack of suitable data used in the training process. Time: Dec 23, 07:10 GMT
Distributed Join Algorithms on Thousands of Cores
Traditional database operators such as joins are relevant not only in the context of database engines but also as a first step in many computational and machine learning algorithms. With the advent of big data, there is an increasing demand for efficient join algorithms that can scale with the available hardware resources. In this paper, we explore the implementation of distributed join algorithms in systems with several thousand cores connected by a low-latency network as used in high performance computing systems or data centers. We compare radix hash join to sort-merge join algorithms and discuss their implementation at this scale. In the paper, we explain how to use MPI to implement joins, show the impact and advantages of RDMA, discuss the importance of network scheduling, and study the relative performance of sorting vs. hashing. The experimental results show that the algorithms we present scale well with the number of cores, reaching a throughput of 48.7 billion input tuples per second on 4096 cores. Furthermore, we identify opportunities for improvements, opening up important directions for future research.
Biscotti and Cannoli: An Initial Exploration into Machine Learning for the Purposes of Finding Bugs in Source Code
Initial exploration experiments and preliminary results from the collaboration with Queensland University of Technology.
Improving the Scalability of Automatic Linearizability Checking in SPIN
Concurrency in data structures is crucial to the performance of multithreaded programs in shared-memory multiprocessor environments. However, greater concurrency also increases the difficulty of verifying correctness of the data structure. Model checking has been used for verifying concurrent data structures satisfy the correctness condition ‘linearizability’. In particular, ‘automatic’ tools achieve verification without requiring user-specified linearization points. This has several advantages, but is generally not scalable. We examine the automatic checking used by Vechev et al. in [VYY09] to understand the scalability issues of automatic checking in SPIN. We then describe a new, more scalable automatic technique based on these insights, and present the results of a proof-of-concept implementation.
Just-In-Time GPU Compilation of Interpreted Programs with Profile-Driven Specialization
Computer systems are increasingly featuring powerful parallel devices with the advent of manycore CPUs, GPUs and FPGAs. This offers the opportunity to solve large computationally-intensive problems at a fraction of the time of traditional CPUs. However, exploiting this heterogeneous hardware requires the use of low-level programming languages such as OpenCL, which is incredibly challenging, even for advanced programmers. On the application side, interpreted dynamic languages are increasingly becoming popular in many emerging domains for their simplicity, expressiveness and flexibility. However, this creates a wide gap between the nice high-level abstractions offered to non-expert programmers and the low-level hardware-specific interface. Currently, programmers have to rely on specialized high performance libraries or are forced to write parts of their application in a low-level language like OpenCL. Ideally, programmers should be able to exploit heterogeneous hardware directly from their interpreted dynamic languages. In this paper, we present a technique to transparently and automatically offload computations from interpreted dy- namic languages to heterogeneous devices. Using just-in- time compilation, we automatically generate OpenCL code at runtime which is specialized to the actual observed data types using profiling information. We demonstrate our technique using R which is a popular interpreted dynamic lan- guage predominately used in big data analytics. Our experimental results show execution on a GPU yields speedups of over 150x when compared to the sequential FastR im- plementation and performance is competitive with manually written GPU code. We also show that when taking into ac- count startup time, large speedups are achievable, even when the applications runs for as little as a few seconds.
Composing Durable Data Structures
This paper presents techniques for composing persistent data structures on machines with nonvolatile byte addressable memory. The techniques are applicable to a wide class of nonblocking algorithms.
Defense against Cache-Based Side Channel Attacks for Secure Cloud Computing
Cloud computing is a combination of various established technologies like virtualization, dynamic elasticity,
broad band Internet, etc. to provide configurable computer resources as a service to the users. Resources are shared among
many distrusting clients by abstracting the underlying infrastructure using virtualization. While cloud computing has many
practical benefits, resource sharing in cloud computing raises a threat of Cache-Based Side Channel Attack (CSCA). In this
paper a solution is proposed to detect and prevent guest Virtual Machines (VM) from CSCA. Cache miss patterns were
analyzed in this solution to detect side channel attack. Notification channel between client and cloud service provider
(CSP) is introduced to notify CSP about the consent of client for running the prevention mechanism. Cache decay
mechanism with random decay interval is used as a prevention mechanism in the proposed solution. The performance of
the proposed solution is compared with previous solutions and the result indicates that this solution possess least
performance overhead with a constant detection rate and compatible with existing cloud computing model.
On Dynamic Information-Flow Analysis for Object-Oriented Programs
Information-flow security vulnerabilities, such as confidentiality and integrity violations, are real and serious problems found commonly in real-world software. Static analyses for information-flow control have the advantage that provides full coverage compared to dynamic analyses, as all possible security violations in the program need to be identified. On the other hand, dynamic information-flow analyses can offer distinct advantages in precision because it is less conservative than static analyses, by rejecting only insecure executions instead of whole programs, and providing additional accuracy via flow- and path-sensitivity compared to static analyses. This talk will highlight some of our attempts to detect information-based security vulnerabilities in Java programs. In particular, we will discuss our investigation on dynamic program analysis for enforcing information-flow security in object-oriented programs. Even though we are able to obtain a soundness result for the analysis by formalising a core language and a generalised operational semantics that tracks explicit and implicit information propagations at runtime, we find it is fundamentally limited and practically infeasible to develop a purely dynamic analysis for information-flow security in the presence of shared objects and aliases.
Practical Partial Evaluation for High-Performance Dynamic Language Runtimes
Most high-performance dynamic language virtual machines duplicate language semantics in the interpreter, compiler, and runtime system, violating the principle to not repeat yourself. In contrast, we define languages solely by writing an interpreter. Compiled code is derived automatically using partial evaluation (the first Futamura projection). The interpreter performs specializations, e.g., augments the interpreted program with type information and profiling information. Partial evaluation incorporates these specializations. This makes partial evaluation practical in the context of dynamic languages, because it reduces the size of the compiled code while still compiling in all parts of an operation that are relevant for a particular program. Deoptimization to the interpreter, re-specialization in the interpreter, and recompilation embrace the dynamic nature of languages. We evaluate our approach comparing newly built JavaScript, Ruby, and R runtimes with current specialized production implementations of those languages. Our general purpose compilation system is competitive with production systems even when they have been heavily specialized and optimized for one language.
Primus Inter Pares: Improving Parallelism in Hardware Transactional Memory
Hardware transactional memory (HTM) is supported by recent processors from Intel and IBM. HTM is attractive because it can enhance concurrency while simplifying programming. Today's HTM systems rely on existing coherence protocols, which implement a requester-wins strategy. This, in turn, leads to very poor performance when transactions frequently conflict, causing them to resort to a non-speculative fallback path. Often, such a path severely limits concurrency. In this paper, we propose very simple architectural changes to the existing requester-wins HTM architectures. These changes permit higher levels of concurrency when transactions cannot make progress and require a fallback path. The idea is to support a special mode of execution in HTM, called power mode, which can be used to enhance conflict resolution between regular and so-called power transactions. Our idea is backward-compatible with existing HTM code, imposing no additional cost on transactions that do not use the power mode. In addition, it supports dynamic undesired data sharing detection, indicating when transactions whose data sets should be disjoint, are not. Using extensive evaluation of micro- and STAMP benchmarks in a transactional memory simulator and real hardware-based emulation, we show that our technique significantly improves performance of the baseline that does not use power mode, and performs similarly or better than state-of-the-art related proposals that require mode substantial architectural changes.
Intrusion Detection of a Simulated SCADA System using Data-Driven Modeling
Supervisory Control and Data Acquisition (SCADA) systems have become integrated into many industries that have a need for control and automation. Examples of these industries include energy, water, transportation, and petroleum. A typical SCADA system consists of field equipment for process actuation and control, along with proprietary communication protocols. These protocols are used to communicate between the field equipment and the monitoring equipment located at a central facility. Given that distribution of vital resources is often controlled by this type of system, there is a need to secure the networked compute and control elements from users with malicious intent. This paper investigates the use of data-driven modeling techniques to identify various types of intrusions tested against a simulated SCADA system. The test bed uses three enterprise servers that were part of a university engineering linux cluster. These were isolated so that job queries on the cluster would not be reflected in the normal behavior of the test bed, and to ensure that intrusion testing would not affect other components of the cluster. One server acts as a Master Terminal Unit (MTU), which simulates control and data acquisition processes. The other two act as Remote Terminal Units (RTUs), these simulate monitoring and telemetry transmission. All servers use Ubuntu 14.04 as the OS. A separate workstation using Kali Linux acts as a Human Machine Interface (HMI), this is used to monitor the simulation and perform intrusion testing. Monitored telemetry included network traffic, hardware and software digitized time series signatures. The models used in this research include the Auto Associative Kernel Regression (AAKR) and Multivariate State Estimation Technique (AAMSET) [1, 2]. This type of intrusion detection can be classified as a behavior-based technique, wherein data collected when the system exhibits normal behavior is first used to train and optimize the previously mentioned machine learning models. Any future monitored telemetry that deviates from this normal behavior can be treated as anomalous, and may indicate an attack against the system. Models were tested to evaluate the prognostic effectiveness when monitoring clusters of signals from four classes of telemetry: combination of all telemetry signals, memory and CPU usage, disk usage, and TCP/IP statistics. Anomaly detection is performed by using the Sequential Probability Ratio Test (SPRT), which is a binary sequential statistical test developed by Wald [3]. This test determines whether the monitored observation has mean or variance shifted from defined normal behavior [4]. For the prognostic security experiments reported in this paper, we established rigorous quantitative functional requirements for evaluating the outcome of the intrusion-signature fault injection experiments. These were a high accuracy for model predictions of dynamic telemetry metrics, and ultralow False Alarm and Missed Alarm Probabilities (FAPs and MAPS)...
SimSPRT-II: Monte Carlo Simulation of Sequential Probability Ratio Test Algorithms for Optimal Prognostic Performance
New prognostic AI innovations are being developed, optimized, and productized for enhancing the reliability, availability, and serviceability of enterprise servers and data centers, known as Electronic Prognostics (EP). EP prognostic innovations are now being spun off for prognostic cyber-security applications, and for Internet-of-Things (IoT) prognostic applications in the industrial sectors of manufacturing, transportation, and utilities. For these applications, the function of prognostic anomaly detection is achieved by predicting what each monitored signal “should be” via highly accurate empirical nonlinear nonparametric (NLNP) regression algorithms, and then differencing the optimal signal estimates from the real measured signals to produce “residuals”. The residuals are then monitored with a Sequential Probability Ratio Test (SPRT). The advantage of the SPRT, when tuned properly, is that it provides the earliest mathematically possible annunciation of anomalies growing into time series signals for a wide range of complex engineering applications. SimSPRT-II is a comprehensive parametric monte-carlo simulation framework for tuning, optimization, and performance evaluation of SPRT algorithms for any types of digitized time-series signals. SimSPRT-II enables users to systematically optimize SPRT performance as a multivariate function of Type-I and Type-II errors, Variance, Sampling Density, and System Disturbance Magnitude, and then quickly evaluate what we believe to be the most important overall prognostic performance metrics for real-time applications: Empirical False and Missed-alarm Probabilities (FAPs and MAPs), SPRT Tripping Frequency as a function of anomaly severity, and Overhead Compute Cost as a function of sampling density. SimSPRT-II has become a vital tool for tuning, optimization, and formal validation of SPRT based AI algorithms for applications in a broad range of engineering and security prognostic applications.
SimML Framework: Monte Carlo Simulation of Statistical Machine Learning Algorithms for IoT Prognostic Applications
Advanced statistical machine learning (ML) algorithms are being developed, trained, tuned, optimized, and validated for real-time prognostics for internet-of-things (IoT) applications in the fields of manufacturing, transportation, and utilities. For such applications, we have achieved greatest prognostic success with ML algorithms from a class of pattern recognition known as nonlinear, nonparametric regression. To intercompare candidate ML algorithmics to identify the “best” algorithms for IoT prognostic applications, we use three quantitative performance metrics: false alarm probability (FAP), missed alarm probability (MAP), and overhead compute cost (CC) for real-time surveillance. This paper presents a comprehensive framework, SimML, for systematic parametric evaluation of statistical ML algorithmics for IoT prognostic applications. SimML evaluates quantitative FAP, MAP, and CC performance as a parametric function of input signals’ degree of cross-correlation, signal-to-noise ratio, number of input signals, sampling rates for the input signals, and number of training vectors selected for training. Output from SimML is provided in the form of 3D response surfaces for the performance metrics that are essential for comparing candidate ML algorithms in precise, quantitative terms.
Ruby’s C Extension Problem and How We're Solving It
Ruby’s C extensions have so far been the best way to improve the performance of Ruby code. Ironically, they are now holding performance back, because they expose the internals of Ruby and mean we aren’t free to make major changes to how Ruby works. In JRuby+Truffle we have a radical solution to this problem – we’re going to interpret the source code of your C extensions, like how Ruby interprets Ruby code. Combined with a JIT this lets us optimise Ruby but keep support for C extensions.
Points-To Analysis: Provenance Generation
The usual points-to analysis does not store the justification for the presence of a tuple in the points-to result. However, this is required for many client driven queries as the provenance information provides information to the client which can be used in other contexts such as debugging. In this presentation, we describe our approach to generate provenance information using the results of a context-sensitive points-to analysis. This has been implemented using the DOOP framework and the Souffle Datalog engine. Our uses cases demand that the approach scale to large code-bases. We use four benchmarks derived from two versions of the JDK and use two realistic clients to demonstrate the effectiveness of our approach.
SLIDES: How to Tell a Compiler What We Think We Know?
Slides for an invited keynote talk at 2016 ACM SPLASH-I
Optimizing R Language Execution via Aggressive Speculation
The R language, from the point of view of language design and implementation, is a unique combination of various programming language concepts. It has functional characteristics like lazy evaluation of arguments, but also allows expressions to have arbitrary side effects. Many runtime data structures, for example variable scopes and functions, are accessible and can be modified while a program executes. Several different object models allow for structured programming, but the object models can interact in surprising ways with each other and with the base operations of R. R works well in practice, but it is complex, and it is a challenge for language developers trying to improve on the current state-of-the-art, which is the reference implementation – GNU R. The goal of this work is to demonstrate that, given the right approach and the right set of tools, it is possible to create an implementation of the R language that provides significantly better performance while keeping compatibility with the original implementation. In this paper we describe novel optimizations backed up by aggressive speculation techniques and implemented within FastR, an alternative R language implementation, utilizing Truffle – a JVM-based language development framework developed at Oracle Labs. We also provide experimental evidence demonstrating effectiveness of these optimizations in comparison with GNU R, as well as Renjin and TERR implementations of the R language.
Optimizing R language execution via aggressive speculation
The R language, from the point of view of language design and implementation, is a unique combination of various programming language concepts. It has functional characteristics like lazy evaluation of arguments, but also allows expressions to have arbitrary side effects. Many runtime data structures, for example variable scopes and functions, are accessible and can be modified while a program executes. Several different object models allow for structured programming, but the object models can interact in surprising ways with each other and with the base operations of R. R works well in practice, but it is complex, and it is a challenge for language developers trying to improve on the current state-of-the-art, which is the reference implementation -- GNU R. The goal of this work is to demonstrate that, given the right approach and the right set of tools, it is possible to create an implementation of the R language that provides significantly better performance while keeping compatibility with the original implementation. In this paper we describe novel optimizations backed up by aggressive speculation techniques and implemented within FastR, an alternative R language implementation, utilizing Truffle -- a JVM-based language development framework developed at Oracle Labs. We also provide experimental evidence demonstrating effectiveness of these optimizations in comparison with GNU R, as well as Renjin and TERR implementations of the R language.
smalltalkCI: A Continuous Integration Framework for Smalltalk Projects
Continuous integration (CI) is a programming practice that reduces the risk of project failure by integrating code changes multiple times a day. This has always been important to the Smalltalk community, so custom integration infrastructures are operated that allow CI testing for Smalltalk projects shared in Monticello repositories or traditional changesets.
In the last few years, the open hosting platform GitHub has become more and more popular for Smalltalk projects. Unfortunately, there was no convenient way to enable CI testing for those projects.
We present smalltalkCI, a continuous integration framework for Smalltalk. It aims to provide a uniform way to load and test Smalltalk projects written in different Smalltalk dialects. smalltalkCI runs on Linux, macOS, and on Windows and can be used locally as well as on a remote server. In addition, it is compatible with Travis CI and AppVeyor, which allows developers to easily set up free CI testing for their GitHub projects without having to run a custom integration infrastructure.
Matriona: Class Nesting with Parameterization in Squeak/Smalltalk
We present Matriona, a module system for Squeak, a Smalltalk dialect. It supports class nesting and parameterization and is based on a hierarchical name lookup mechanism. Matriona solves a range of modularity issues in Squeak. Instead of a flat class organization, it provides a hierarchical namespace, that avoids name clashes and allows for shorter local names. Furthermore, it provides a way to share behavior among classes and modules using mixins and class hierarchy inheritance (a form of inheritance that subclasses an entire class family), respectively. Finally, it allows modules to be externally configurable, which is a form of dependency management decoupling a module from the actual implementation of its dependencies. Matriona is implemented on top of Squeak by introducing a new keyword for run-time name lookups through a reflective mechanism, without modifying the underlying virtual machine. We evaluate Matriona with a series of small applications and will demonstrate how its features can benefit modularity when porting a simple application written in plain Squeak to Matriona.
Bringing Low-Level Languages to the JVM: Efficient Execution of LLVM IR on Truffle
Although the Java platform has been used as a multi-language platform, most of the low-level languages (such as C, Fortran, and C++) cannot be executed efficiently on the JVM. We propose Sulong, a system that can execute LLVM- based languages on the JVM. By targeting LLVM IR, Sulong is able to execute C, Fortran, and other languages that can be compiled to LLVM IR. Sulong combines LLVM’s static optimizations with dynamic compilation to reach a peak performance that is near to the performance achievable with static compilers. For C benchmarks, Sulong’s peak runtime performance is on average 1.39x slower (0.79x to 2.45x) compared to the performance of executables compiled by Clang O3. For Fortran benchmarks, Sulong is 2.63 x slower (1.43x to 4.96x) than the performance of executables com- piled by GCC O3. This low overhead makes Sulong an alter-native to Java’s native function interfaces. More importantly, it also allows other JVM language implementations to use Sulong for implementing their native interfaces.
How to Tell a Compiler What We Think We Know?
I have been repeatedly quoted (and tweeted) as having remarked more than once over the last decade, "If it's worth telling yourself (or another programmer), it's worth telling the compiler." In this talk, I will try to explain in more detail what I meant by this. In particular, I have noticed that programming languages provide lots of ways to annnotate one thing, but not very many good ways to talk about relationships among multiple things (other than regard to one as a "server" to which an annotation is attached and the others as "clients"). As a very simple example, we don't even yet have a relatively standard way to say such simple things as "Thus-and-so value is an identity for this binary operation" or "this operation distributes over that operation". Algebraic constraints are one way to express some such constraints, but where in a program should they be placed? How can they be generalized and abstracted? Does object-oriented design make this task easier or harder? I am particularly interested in what we might want to say in the future to a compiler that incorporates a full-blown theorem prover. This talk will be a sort of oral essay, raising more questions than it answers.
Become Polyglot by learning Java!
In a world running at breakneck speed to embrace JavaScript, it is refreshing to see a project that embraces Java to provide a solution that deals with the new world and even improves it. I describe Truffle, a project that aim to build multi-language, multi-tenant, multi-threaded, multi-node, multi-tooling and multi-system environment on top of Java virtual machine with the goal to form the fastest and most flexible execution environment on the planet! Learn about Truffle and its Java APIs to become real polyglot, use the best language for a task and never ask again: Do I really have to use that crummy language?
FastR - Optimizing and Enhancing R Language Implementation
The current reference implementation of the R language, namely GNU R, is very mature and extremely popular. Nevertheless, alternative implementations are under development with the goal of improving and enhancing the current state-of-the-art. FastR is one such implementation created by Oracle Labs in collaboration with academic partners. FastR aims to deliver a fully compatible R language implementation compiling R programs to efficient native code, but which at the same time constitutes an experimentation platform for enhancing some of the existing R capabilities, for example with respect to parallel execution. FastR is built upon an infrastructure consisting of an optimizing compiler called Graal and of Truffle framework which simplifies creation of new language runtimes that can then interface with Graal. The infrastructure is specifically designed to support creation of dynamic languages, such as R, by taking advantage of runtime execution profiling and aggressive optimistic optimizations during the compilation process. In this talk I will describe how the Graal/Truffle infrastructure enables some of the optimizations in FastR's runtime and demonstrate how effective these optimizations are in practice based on an experimental performance evaluation. I will also present our work on enhancing R, in particular with respect to parallel computation capabilities by supplanting GNU R’s process-based model (as defined in the parallel or snowfall packages) with an API-compatible thread-based model where communication between different parts of parallel computation occurs over shared-memory channels.
Formal Verification of Division and Square Root Implementations, an Oracle Report
These are the slides that go with OL 2016-0771, a conference paper with the same title.
Dynamic Adaptation of User Migration Policies in Distributed Virtual Environments
A distributed virtual environment (DVE) consists of multiple network nodes (servers), each of which can host many users that consume CPU resources on that node and communicate with users on other nodes. Users can be dynamically migrated between the nodes, and the ultimate goal for the migration policy is to minimize the average system response time perceived by the users. In order to achieve this, the user migration policy should minimize network communication while balancing the load among the nodes so that CPU resources of the individual nodes are not overloaded. This paper considers a multi-player online game as an example of a DVE and presents an adaptive distributed user migration policy, which uses Reinforcement Learning to tune itself so as to minimize the average system response time perceived by the users. Performance of the self-tuning policy was compared on a simulator with the standard benchmark non-adaptive migration policy and with the optimal static user allocation policy in a variety of scenarios, and the self-tuning policy was shown to greatly outperform both benchmark policies, with performance difference increasing as the network becomes more overloaded.
Polyglot on the JVM with Graal
Polyglot on the JVM with Graal
What Went Wrong? Automatic Triage of Precision Loss During Static Analysis of JavaScript
Static analysis tools tend to have insufficient means to debug a complex notion such as precision, which in our experience leads to time-consuming human analysis. We propose to augment the analysis framework in such a way, so that it keeps track of the loss of precision throughout the analysis. This precision tracking information brings us one step closer to pinpointing the reason why our analysis fails. In this talk, we will detail our motivation for precision tracking and our experience with it, in the context of static analysis with the SAFE framework and aimed at real-world JavaScript applications.
Flying and Decoupling Capacitance Optimization for Area-Constrained On-Chip Switched-Capacitor Voltage Regulators
Switched-capacitor (SC) voltage regulators are widely used in on-chip power management, due to the high efficiency at integer-ratio step-down and feasibility of integration. Theoretical analysis and optimization for SC DC-DC converters have been presented in prior works, however optimization of different capacitors, namely flying and input/output decoupling capacitors, in SC voltage regulators (SCVRs) under an area constraint has not been addressed. In this work, we propose a methodology to optimize flying and decoupling capacitance for area-constrained on-chip SCVRs to achieve the highest system-level power efficiency. Considering both conversion efficiency and droop voltage against fast load transients, the proposed model determines the optimal ratio between flying and decoupling capacitance for fixed total area. These models are validated with integrated 2:1 SCVR implementations in both 65nm and 32nm CMOS. Experiments show high model accuracy on efficiency and droop modeling for a broad range of flying and decoupling capacitance. The maximum and average error of the predicted optimal ratio between flying and decoupling capacitance is 5% and 1.7%, respectively.
Who reordered my code?!
There is a hidden problem waiting as Ruby becomes 3x faster and starts to support parallel computation - reordering by JIT compilers and CPUs. In this talk, we’ll start by trying to optimize a few simple Ruby snippets. We’ll play the role of a JIT and a CPU and order operations as the rules of the system allow. Then we add a second thread to the snippets and watch it as it breaks horribly. In the second part, we’ll fix the unwanted reorderings by introducing a memory model to Ruby. We’ll discuss in detail how it fixes the snippets and how it can be used to write faster code for parallel execution.
A Tale of Two String Representations
Strings are used pervasively in Ruby. If we can make them faster, we can make many apps faster. In this talk, I will be introducing ropes: an immutable tree-based data structure for implementing strings. While an old idea, ropes provide a new way of looking at string performance and mutability in Ruby. I will describe how we replaced a byte array-oriented string representation with a rope-based one in JRuby+Truffle. Then we’ll look at how moving to ropes affects common string operations, its immediate performance impact, and how ropes can have cascading performance implications for apps.
Using LLVM and Sulong for Language C Extensions
Many languages such as Ruby, Python and JavaScript support extension modules written in C, either for speed or to create interfaces to native libraries. Ironically, these extensions can hold back performance of the languages themselves because the native interfaces expose implementation details about how the language was first implemented, such as the layout of data structures. In JRuby+Truffle, an implementation of Ruby, we are using the Sulong LLVM bitcode interpreter to run C extensions on the JVM. By combining LLVM's static optimizations with dynamic compilation, Sulong is fast, but Sulong also gives us a powerful new tool - it allows us to abstract from normal C semantics and to appear to provide the same native API while actually mapping it to our own alternative data structures and implementation. We'll demonstrate Sulong and how we're using it to implement Ruby C extensions.
One Compiler: Deoptimization to Optimized Code
Deoptimization enables speculative compiler optimizations, which are an essential part in nearly every high-performance virtual machine (VM). But it comes with a cost: a separate first-tier interpreter or baseline compiler in addition to the optimizing compiler. Because such a first-tier execution uses a fixed stack frame layout, this affects all VM components that need to walk the stack. We propose to use the optimizing compiler also to compile deoptimization target code, i.e., the non-speculative code where execution continues after a deoptimization. Deoptimization entry points are described with the same scope descriptors used to describe the origin of the deoptimization, i.e., deoptimization is a two-way matching of two scope descriptors describing the same abstract frame. We use this deoptimization approach in a high-performance JavaScript VM written in Java. It strictly uses a one-compiler approach, i.e., all frames on the stack (VM runtime, first-tier execution in an JavaScript AST interpreter, dynamic compilation, deoptimization entry points) originate from the same compiler. Code with deoptimization entry points generated by the optimizing compiler imposes a much smaller overhead than a traditional first-tier execution.
Frappé Bug Trace Overview Slides for Prof Sukyoung Ryu (KAIST University)
Overview slides of the bug trace extensions to Frappé and the Frappé architecture.
Using Domain-Specific Languages for Analytic Graph Databases
Recently graph has been drawing lots of attention both as a natural data model that captures fine-grained relationships between data entities and as a tool for powerful data analysis that considers such relationships. In this paper, we present a new graph database system that integrates a robust graph storage with an efficient graph analytics engine. Primarily, our system adopts two domain-specific languages, one for describing graph analysis algorithms and the other for graph pattern matching queries. Compared to the API-based approaches in conventional graph processing systems, the DSL-based approach provides users with more flexible and intuitive ways of expressing algorithms and queries. Moreover, the DSL-based approach has significant performance benefits as well, by skipping (remote) API invocation overhead and by applying high-level optimization from the compiler.
Ahead-of-time Compilation of FastR Functions Using Static Analysis
The FastR project delivers high peak-performance through the use of JIT-compilation, but cannot currently provide this performance for methods on first call. This especially affects startup-performance and performance of applications that only call functions once, possibly with large inputs (i.e. data processing). This project presents an approach and the necessary patterns for implementing an AOT-compilation facility within FastR, enabling compilation of call targets just before being first called. The AOT-compilation produces code that has profiling and specialization information tailored to the expected function argument values for the first call, without needing to execute the function in full. The performance results show a clear and unambiguous performance gain for first-call performance of AOT-compiled functions (up to 4x faster, excluding compilation time). Due to constant compilation time there is the potential for overall startup performance improvement for long-running functions even when compilation time is included. While the static analysis itself imposes almost no overhead, compilation times are up to 1.4x higher than with regularly compiled code, due to the inherent imprecision of the current analysis. Although peak performance is reduced, AOT-compilation can be the solution where faster first-call performance, the possibility of offloading/remote execution, and more performance predictability are important.
Asynchronous Memory Access Chaining
In-memory databases rely on pointer-intensive data struc- tures to quickly locate data in memory. A single lookup op- eration in such data structures often exhibits long-latency memory stalls due to dependent pointer dereferences. Hid- ing the memory latency by launching additional memory ac- cesses for other lookups is an e ective way of improving per- formance of pointer-chasing codes (e.g., hash table probes, tree traversals). The ability to exploit such inter-lookup par- allelism is beyond the reach of modern out-of-order cores due to the limited size of their instruction window. Instead, re- cent work has proposed software prefetching techniques that exploit inter-lookup parallelism by arranging a set of inde- pendent lookups into a group or a pipeline, and navigate their respective pointer chains in a synchronized fashion. While these techniques work well for highly regular access patterns, they break down in the face of irregularity across lookups. Such irregularity includes variable-length pointer chains, early exit, and read/write dependencies. This work introduces Asynchronous Memory Access Chaining (AMAC), a new approach for exploiting inter- lookup parallelism to hide the memory access latency. AMAC achieves high dynamism in dealing with irregular- ity across lookups by maintaining the state of each lookup separately from that of other lookups. This feature en- ables AMAC to initiate a new lookup as soon as any of the in- ight lookups complete. In contrast, the static ar- rangement of lookups into a group or pipeline in existing techniques precludes such adaptivity. Our results show that AMAC matches or outperforms state-of-the-art prefetch- ing techniques on regular access patterns, while delivering up to 2.3x higher performance under irregular data struc- ture lookups. AMAC fully utilizes the available micro- architectural resources, generating the maximum number of memory accesses allowed by hardware in both single- and multi-threaded execution modes
Adaptive Detection Technique for Cache Based Side Channel Attack using Bloom Filter for Secure Cloud
Security is the one of the main concern in the field of cloud computing. Different users sharing the same physical machines or even software on frequent basis make cloud vulnerable to many security threats. Side channel attacks are the most probable attacks in cloud because of physical resource sharing. In cloud, where multiple Virtual Machines (VM) share same physical machine creates a great opportunity to carry out Cache-based Side Channel Attack (CSCA). In this paper, a novel detection technique using Bloom Filter (BF) for CSCA is designed. This technique treats cache miss sequence as a signature of CSCA and uses a difference mean calculator to generate these signatures. This technique is adaptive, which makes it possible to detect the CSCA with new patterns, which are not observed yet. Bloom filter is used in this technique to reduce the performance overhead to minimum level. The solution is implemented with a cache simulator and proved very effective as it has very less execution time in comparison to the execution time of CSCA.
The lights in the Tunnel: Coverage Analysis for Formal Verification
ABSTRACT As formal verification engineers, the authors always face challenges to know the current status of test benches. Many questions need to be answered at certain stages of a project. E.g., do we need more assertions? Did you over-constrain inputs that drop an important design scenario? Are proof bounds for bounded proofs good enough to catch potential design bugs? For the properties that are fully proven, do they cover the design logic that were intended to cover? These four most-asked questions don’t have answers without extracting information from formal engines, which is not feasible for general users. However, like coverages from simulation-based verification, formal verification coverages can be defined and used as matrices to measure formal verification progress and completeness. In this paper, the authors will introduce formal verification coverage models and their usages by real-life examples. The four most-asked questions finally have reasonable and acceptable answers supported by matrices.
Self-Specialising Interpreters and Partial Evaluation
Abstract syntax trees are a simple way to represent programs and to implement language interpreters. They can also be an easy way to produce high performance dynamic compilers through combining then with self-specialisation and partial evaluation. Self-specialisation allows the nodes in a program tree to rewrite themselves with more specialised variants in order to increase performance, such as replacing methods calls with inline caches or to replace stronger operations with weaker ones based on profiled types. Partial evaluation can then take this specialised abstract syntax tree and produce optimised machine code based on it. We’ll show how these two techniques work and how they have been implemented by Oracle Labs in Truffle and Graal and used in implementations of languages including JavaScript, C, Ruby, R and more.
Malthusian Locks
Applications running in modern multithreaded environments are sometimes overthreaded. The excess threads do not improve performance, and in fact may act to degrade performance via scalability collapse. Often, such software also has highly contended locks. We opportunistically leverage the existence of such locks by modifying the lock admission pol- icy so as to intentionally limit the number of distinct threads circulating over the lock in a given period. Specifically, if there are more threads circulating than are necessary to keep the lock saturated (continuously held), our approach will selectively cull and passivate some of those excess threads. We borrow the concept of swapping from the field of memory management and impose concurrency restriction (CR) if a lock is oversubscribed. The resultant admission order is un- fair over the short term but we explicitly provide long-term fairness by periodically shifting threads between the set of passivated threads and those actively circulating. Our approach is palliative, but often effective, and in the worst case does no harm.
Efficient analysis using Soufflé - An experience report
Souffle is an open-source programming framework for static program analysis. It enables the analysis designer to express static program analysis on very large code bases such as a points-to analysis for the Java Development Kit (JDK) which has more than 1.5 million variables and 600 thousand call sites. Souffle employs a Datalog-like language as a domain specific language for static program analysis. Its finite domain semantics lends to efficient execution on parallel hardware using various levels of program specialisations. A specialization hierarchy is applied to a Datalog program. As a result, highly specialized and optimised C++ code is produced that harvests the computational power of modern shared-memory/multi-core computer architectures. We have been using Souffle to explore and develop vulnerability detection analyses on the Java platform, using JDK 7, 8 and 9. These vulnerability detection analyses make use of points-to analysis (reusing parts of the DOOP framework), taint analysis, escape analysis, and other data flow-based analyses. In this talk we report on the types of analyses used, the sizes of the input relations and computed relations, as well as the the runtime and memory requirements for the analyses of such large codebases. For the program specialization, we use several translation steps. In each translation step, new optimisation opportunities open up that would not be able to exploit in the previous translation step. The first translation uses a Futamura projection to translate a declarative Datalog program to an imperative relational program for an abstract machine which we call the Relational Algebra Machine (RAM). The RAM program contains relational algebra operations to compute results produced by clauses, relation management operations to keep track of previous, current and new knowledge in the semi-naive evaluation, and imperative constructs including statement composition for sequencing the operations, and loop construction with loop exit condition to express fixed-points computations for recursively-defined relations. It also has support for parallelism. The next translation steps, translates the optimized RAM program into a C++ program that uses meta-programming techniques with templates. The last translation step, is performed by a C++ program that compiles the C++ program to a executable binary. Operations for emptiness and existence checks, range queries, insertions and unions are highly efficient because portions of the operations are pushed from runtime to compile-time using meta-programming techniques. We now outline some of the novel aspects that are in the implementation of Souffle. The first is related to indices. Since indices are costly, a minimal set of indices for a given relation is desired. We employ a discrete optimization problem to minimize indices creating only the required indices for the execution is required and hence avoiding redundancies. The second is the choice of data-structures to represent large relations...
One Compiler
The stack of a running Java HotSpot VM has stack frames from multiple compilers (the C compiler, the client compiler, and the server compiler) as well as bytecode interpreter stack frames. That complicates essential VM tasks (stack walking, garbage collection, and deoptimization), increases maintenance costs, and makes porting to new hardware architectures difficult. We argue that a single compiler is sufficient: Using the Graal compiler in different configurations, we can execute Java, JavaScript, and many other languages. The stack only contains a single kind of stack frame: frames from ahead-of-time compiled code, interpreter frames (from an ahead-of-time compiled AST interpreter), frames just-in-time compiled code, and deoptimized frames (ahead-of-time compiled code with deoptimization entry points). In this talk, we outline the necessary components of such a streamlined system: deoptimization to compiled frames (in contrast to deoptimization to interpreter frames), access to low-level OS data structures directly from Java, and writing the whole runtime system (including the garbage collector) in Java.
Testing Security Properties in Java
In this paper we describe our initial experience of using mutation testing of Java programs to evaluate the quality of test suites from a security viewpoint. Our focus is on measuring the quality of the test suite associated with the Java Development Kit (JDK) because it provides the core security properties for all applications. We define security-specific mutation operators and determine their usefulness by executing some of the test suites that are publicly available. We summarise our findings and also outline some of the key challenges that remain before mutation testing can be used in practice.
Are We Ready for Secure Languages? (CurryOn presentation)
Language designers and developers want better ways to write good code — languages designed with simpler, more powerful abstractions accessible to a larger community of developers. However, language design does not seem to take into account security, leaving developers with the onerous task of writing attack-proof code. In 20 years, we have gone from 25 reported vulnerabilities to 6,883 vulnerabilities. We see some of the most common vulnerabilities happening in commonly used software — cross-site scripting, SQL injections, and buffer overflows. Attacks are becoming sophisticated, often exploitation three or four weaknesses; making it harder for developers to reason about the source of the problem. I’ll overview some recent attacks and argue our languages must take security seriously. Languages need security-oriented constructs, and compiler must let developers know when there is a problem with their code. We need to empower developers with the concept of “security for the masses” by making available languages that do not necessarily require an expert in order to determine whether the code being written is vulnerable to attack or not.
Efficient and Thread-Safe Objects for Dynamically-Typed Languages
We are in the multi-core era. Dynamically-typed languages are in widespread use, but their support for multithreading still lags behind. One of the reasons is that the sophisticated techniques they use to efficiently represent their dynamic object models are often unsafe in multithreaded environments. This paper defines safety requirements for dynamic object models in multithreaded environments. Based on these requirements, a language-agnostic and thread-safe object model is designed that maintains the efficiency of sequential approaches. This is achieved by ensuring that field reads do not require synchronization and field updates only need to synchronize on objects shared between threads. Basing our work on JRuby+Truffle, we show that our safe object model has zero overhead on peak performance for thread-local objects and only 3% average overhead on parallel benchmarks where field updates require synchroniza- tion. Thus, it can be a foundation for safe and efficient multithreaded VMs for a wide range of dynamic languages.
Gems: shared-memory parallel programming for Node.JS
JavaScript is the most popular programming language for client-side Web applications, and Node.js has popularized the language for server-side computing, too. In this domain, the minimal support for parallel programming remains however a major limitation. In this paper we introduce a novel parallel programming abstraction called Generic Messages (GEMS). GEMS allow one to combine message passing and shared memory parallelism, extending the classes of parallel applications that can be built with Node.js. GEMS have customizable semantics and enable several forms of thread safety, isolation, and concurrency control. GEMS are designed as convenient JavaScript abstractions that expose high-level and safe parallelism models to the developer. Experiments show that GEMS outperform equivalent Node.js applications thanks to their usage of shared memory.
Investigating the Performance of Hardware Transactions on a Multi-Socket Machine
The introduction of hardware transactional memory (HTM) into commercial processors opens a door for designing and implementing scalable synchronization mechanisms. One example for such a mechanism is transactional lock elision (TLE), where lock-based critical sections are executed concurrently using hardware transactions. So far, the effectiveness of TLE and other HTM-based mechanisms has been assessed mostly on small, single-socket machines. This paper investigates the behavior of hardware transactions on a large two-socket machine. Using TLE as an example, we show that a system can scale as long as all threads run on the same socket, but a single thread running on a different socket can wreck performance. We identify the reason for this phenomenon, and present a simple adaptive technique that overcomes this problem by throttling threads as necessary to optimize system performance. Using extensive evaluation of multiple microbenchmarks and real applications, we demonstrate that our technique achieves the full performance of the system for workloads that scale across sockets, and avoids the performance degradation that cripples TLE for workloads that do not.
Are We Ready For Secure Languages? (CurryOn slides)
Language designers and developers want better ways to write good code — languages designed with simpler, more powerful abstractions accessible to a larger community of developers. However, language design does not seem to take into account security, leaving developers with the onerous task of writing attack-proof code. In 20 years, we have gone from 25 reported vulnerabilities to 6,883 vulnerabilities. We see some of the most common vulnerabilities happening in commonly used software — cross-site scripting, SQL injections, and buffer overflows. Attacks are becoming sophisticated, often exploitation three or four weaknesses; making it harder for developers to reason about the source of the problem. I’ll overview some recent attacks and argue our languages must take security seriously. Languages need security-oriented constructs, and compiler must let developers know when there is a problem with their code. We need to empower developers with the concept of “security for the masses” by making available languages that do not necessarily require an expert in order to determine whether the code being written is vulnerable to attack or not.
Toward a More Carefully Specified Metanotation
POPL is known for, among other things, papers that present formal descriptions and rigorous analyses of programming languages. But an important language has been neglected: the \emph{metanotation} of inference rules and BNF that has been used in over 40\% of all POPL papers to describe all the other programming languages. This metanotation is not completely described in any one place; rather, it is a folk language that has grown over the years, as paper after paper tries out variations and extensions. We believe that it is high time that the tools of the POPL trade be applied to the tools themselves. Examination of many POPL papers suggests that as the metanotation has grown, it has diversified to the point that problems are surfacing: different notations are in use for the same operation (substitution); the same notation is in use for different operations; and in some cases, notations for repetition are ambiguous, or require the reader to apply knowledge of semantics to interpret the syntax. All three problems present substantial potential for confusion. No individual paper is at fault; rather, this is the natural result of language growth in a community, producing incompatible dialects. We back these claims by presenting statistics from a survey of all past POPL papers, 1973--2016, and examples drawn from those papers. We propose a set of design principles for metanotation, and then propose a specific version of the metanotation that can be always interpreted in a purely formal, syntactic manner and yet is reasonably compatible with past use. Our goal is to lay a foundation for complete formalization and mechanization of the metanotation.
Modeling and Design of System-in-Package Integrated Voltage Regulator with Thermal Effects
This paper demonstrates a new approach to model the impact of thermal effects on the efficiency of integrated voltage regulators (IVRs) by combining analytical efficiency evaluations with coupled electrical and thermal simulations. An application of the approach shows that a system-in-package solution avoids thermal problems typically observed in other IVR designs. While the evaluation in this paper focuses on the thermal impact on loss in the inductor wiring and the PDN, the developed approach is general enough to also model thermal impacts on the power dissipation in the inductor cores and the buck converter chip.
Parfait Lessons Learnt
Slides for presentation at DECAF'16.
FastR presentation at RIOT 2016 workshop
This is the presentation about FastR at the RIOT 2016 workshop (which is organized by us). The audience consists of members of the core R group, developers of other implementations of the R language, and people developing tooling for R. The main focus of our presence in this workshop is to build credibility and show that we know what we're doing. The contents of this presentation are a combination of the usual Truffle interoperability and Graal introduction, a bit of compiler 101 (the audience does not have a CC background), some bits from our recent paper (2016-0523) and the presentation I gave at useR! (2016-0540).
High-performance R with FastR
R is a highly dynamic language that employs a unique combination of data type immutability, lazy evaluation, argument matching, large amount of built-in functionality, and interaction with C and Fortran code. While these are straightforward to implement in an interpreter, it is hard to compile R functions to efficient bytecode or machine code. Consequently, applications that spend a lot of time in R code often have performance problems. Common solutions are to try to apply primitives to large amounts of data at once and to convert R code to a native language like C. FastR is a novel approach to solving R’s performance problem. It makes extensive use of the dynamic optimization features provided by the Truffle framework to remove the abstractions that the R language introduces, and can use the Graal compiler to create optimized machine code on the fly. This talk introduces FastR and the basic concepts behind Truffle’s optimization features. It provides examples of the language constructs that are particularly hard to implement using traditional compiler techniques, and shows how to use FastR to improve performance without compromising on language features.
Audio/Video recording of "Zero-Overhead Integration of R, JS, Ruby and C/C++"
Presentation about FastR and language interoperability at the useR! 2016 conference. Stanford is asking for permission to record presentations and publish those recordings.
Zero-Overhead Integration of R, JS, Ruby and C/C++
Presentation about FastR and language interoperability at the useR! 2016 conference.
EPA: A Precise and Scalable Object-Sensitive Points-to Analysis for Large Programs
Points-to analysis is a fundamental static program analysis technique for tools including compilers and bug-checkers. There are several kinds of points-to analyses that trade-off precision with runtime. For object-oriented languages including Java, ``context-sensitivity'' is key to obtain sufficient precision. A context may be parameterizable, and may consider calls, objects, types for its construction. Although points-to analysis research has received a lot of attention in the past, scaling object-sensitive points-to analysis for large Java code bases still remains an open research challenge. In this paper, we develop an Eclectic Points-To Analysis (EPA) framework that computes an efficient, selective, object-sensitive points-to analysis that is client independent. This framework parameterizes context sensitivities for different allocation sites in the program. The level of required sensitivity is determined by a preanalysis. We have implemented our approach using Souffle (a Datalog compiler) and an extension of the DOOP framework. Our experiments on large programs including OpenJDKand Jython show that our technique is efficient and highly precise. For the OpenJDK, an instance of the EPA-based analysis reduces 27% of runtime for a slight loss of precision, while for Jython, the same analysis reduces 82% of runtime for almost no loss of precision.
Model Checking Cache Coherence in System-Level Code
Cache coherence is a key consistency requirement between the shared main memory and individual caches for a multiprocessor framework. Several months ago, we started a project to verify the cache coherence of a system-level C codebase (50,000+ lines), which runs in an environment that does not provide hardware-level guarantees, requiring programmers to ensure correct cache coherence manually through explicit FLUSH and INVALIDATE operations. After initial evaluation and comparison of many model checking tools, we believe that SPIN is the most suitable one. However, pure model checking is not sufficiently scalable to verify such a large codebase. Therefore, we are currently investigating a hybrid model checking solution with some static analysis techniques to reduce the model size via abstraction and program slicing, and restrict the interleavings explored. In this talk, we will share our model checking experiences. In particular, we will discuss (1) our evaluation of different model checking tools, (2) the Promela model we use to verify the cache coherence, (3) initial model checking experience for verifying the coherence in concurrent quicksort algorithm, and (4) the automatic model extraction from large codebase in C.
Fortress Features and Lessons Learned
Slides for an invited keynote talk on June 22, 2016, at the 2016 JuliaCon conference to be held at MIT at the Stata Center. This is an overview of the Fortress programming language, with some comparison to Scala. Many of the slides are taken from two previously approved slide sets (Archivist 2012-0104 and 2012-0284), but some have been updated, and some new slides have been created.
Theorem Proving with ACL2 for Industry Artifacts
This is a set of slides to be presented at PSSV 2016, a workshop in St Petersburg, Russia. The slides present a small part of our formal verification work on SPARC processors and Java programs.
Fast non-intrusive memory reclamation for highly-concurrent data structures
Current memory reclamation mechanisms for highly-concurrent data structures present an awkward trade-off. Techniques such as epoch-based reclamation perform well when all threads are running on dedicated processors, but the delay or failure of a single thread will prevent any other thread from reclaiming memory. Alternatives such as hazard pointers are highly robust, but they are expensive because they require a large number of memory barriers. This paper proposes three novel ways to alleviate the costs of the memory barriers associated with hazard pointers and related techniques. These new proposals are backward-compatible with existing code that uses hazard pointers. They move the cost of memory management from the principal code path to the infrequent memory reclamation procedure, significantly reducing or eliminating memory barriers executed on the principal code path. These proposals include (1) exploiting the operating system's memory protection ability, (2) exploiting certain x86 hardware features to trigger memory barriers only when needed, and (3) a novel hardware-assisted mechanism, called a hazard lookaside buffer (HLB) that allows a reclaiming thread to query whether there are hazardous pointers that need to be flushed to memory. We evaluate our proposals using a few fundamental data structures (linked lists and skiplists) and libcuckoo, a recent high-throughput hash-table library, and show significant improvements over the hazard pointer technique.
Truffle Tutorial: One VM to Rule Them All
Forget “this language is fast”, “this language has the libraries I need”, and “this language has the tool support I need”. The Truffle framework for implementing managed languages in Java gives you native performance, multi-language integration with all other Truffle languages, and tool support - all of that by just implementing an abstract syntax tree (AST) interpreter in Java. Truffle applies AST specialization during interpretation, which enables partial evaluation to create highly optimized native code without the need to write a compiler specifically for a language. The Java VM contributes high-performance garbage collection, threads, and parallelism support. This tutorial is both for newcomers who want to learn the basic principles of Truffle, and for people with Truffle experience who want to learn about recently added features. It presents the basic principles of the partial evaluation used by Truffle and the Truffle DSL used for type specializations, as well as features that were added recently such as the language-agnostic object model, language integration, and debugging support. Oracle Labs and external research groups have implemented a variety of programming languages on top of Truffle, including JavaScript, Ruby, R, Python, and Smalltalk. Several of them already exceed the best implementation of that language that existed before.
Using an Accurate Multi-Mode Chip Power Model to Analyze Power Integrity Differences between On-Board Voltage Regulator Modules (VRMs) and In-Package Integrated Voltage Converts (IVRs)
This is a poster for Design Automation Conference (DAC) 2016. No Oracle propriety information is shared. This poster is based on the slides in OL# 2016-0016 that has already been approved for external publishing.
Unifying Access Control & Information Flow: A Security Model for Programs Consisting of Trusted and Untrusted Code
We introduce a security model based on dual access control labels (called DAC) that enables to have both confidentiality and integrity in the same program. This is developed in the context of object-oriented languages and considers implicit flows arising from both branching as well dynamic dispatch. Our DAC model overcomes the limitations of the classical access control models such as those based on stack inspection. Our security model is, in general, neither transitive nor reflexive and it considers both confidentiality and integrity. Traditional lattice-based security models are a special case for our security model. We show that our model satisfies a non-interference theorem. The theorem simultaneously guarantees a) from a confidentiality perspective, an attacker cannot distinguish the low level values associated with two computations that have the different high level inputs b) from an integrity perspective, an attacker cannot distinguish the high level values associated with two computations that have different low level inputs. We also show that one can give the necessary security guarantees via a static program analysis.
Sulong: Memory Safe and Efficient Execution of LLVM-Based Languages
Memory errors in C/C++ can allow an attacker to read sensitive data, corrupt the memory, or crash the executing process. The renowned top 25 of most dangerous software errors as published by the SANS Institute, as well as recent security disasters such as Heartbleed show how important it is to tackle memory safety for C/C++. We present Sulong, an efficient interpreter for LLVM-based languages that runs on the JVM. Sulong guarantees memory safety for C/C++ and other LLVM-based languages by using managed allocations and automatic memory management. Through dynamic compilation, Sulong will achieve peak performance close to state of the art compilers such as GCC or Clang, which do not produce memory-safe code. By efficiently implementing memory safety, Sulong strives to be a real-world solution for mitigating software security problems.
Sulong - Execution of LLVM-Based Languages on the JVM
For the last decade, the Java Virtual Machine (JVM) has been a popular platform to host languages other than Java. Language implementation frameworks like Truffle allow the implementation of dynamic languages such as JavaScript or Ruby with competitive performance and completeness. However, statically typed languages are still rare under Truffle. We present Sulong, an LLVM IR interpreter that brings all LLVM-based languages including C, C++, and Fortran in one stroke to the JVM. Executing these languages on the JVM enables a wide area of future research, including high-performance interoperability between high-level and low-level languages, combination of static and dynamic optimizations, and a memory-safe execution of otherwise unsafe and unmanaged languages.
An Experience Report: Efficient Analysis using Souffle
This abstract summarizes the key aspects of Souffle, which is an open-source Datalog engine used for static program analysis. It describes the overall approach of translating Datalog to C++ using an abstract machine and staged compilation. The novel aspects in Souffle include auto-index generation, representation of large relations, and techniques to exploit caches and parallel cores. It also identifies the issues of query planning and improved parallelism that need further exploration. The presentation will also include our experience in using Souffle in the context of vulnerability detection using points-to and other data flow based analyses.
The Parfait Static Code Analysis Framework -- Lessons Learnt
The Parfait static code analyser was conceived at Sun Labs, now Oracle Labs, in 2007. At the time, the project focused on the detection of defects in C/C++ code. Over the next five years, Parfait matured to include detection of vulnerabilities (not just defects) in C/C++ and JavaTM while meeting the performance and precision standards expected of a commercial tool: Parfait can analyse 39 of the most common defects in the C language over an operating system codebase of 11 million lines of C code in 1.5 hours with a false positive rate of 10%. Today, Parfait is maintained by Oracle as an internal product and is used by thousands of developers at Oracle worldwide.
Finding Glitches Using Formal Methods
Slides for presentation in Industrial Track at Async 2016 Conference (IEEE International Symposium on Asynchronous Circuits and Systems, May 2016). Authors: Yan Peng, Ian W. Jones, and Mark Greenstreet. The increasing scale and complexity of integrated circuits leads to many departures from a pure, synchronous design methodology. Clock-domain crossings, multi-cycle paths, and circuits for test with long combinational logic delays introduce vulnerabilities for glitch-related failures. Conventional simulation techniques can miss glitches because of the large number of value and timing scenarios. We have tried several commercially available tools but have not found a comprehensive solution. This paper presents a concise statement of what it means for a logic circuit to be “glitch free”. This property can be verified using satisfiability solvers. We present our implementation using the ACL2 theorem proving system and some experimental results. This is the final version of the slides presented at Async 2016. They are a slightly trimmed-down version of the slides that have previously been cleared, OL# 2016-0105.
ICT 3612/7204 Database Systems - Graph Databases
Slides for a lecture on Graph Databases at Griffith University (Brisbane, Australia) for third year undergraduate course on Database Management (ICT 3612/7204).
Specializing Ropes for Ruby
Ropes are a data structure for representing character strings via a binary tree of operation-labeled nodes. Both the nodes and the trees constructed from them are immutable, making ropes a persistent data structure. Ropes were designed to perform well with large strings, and in particular, concatenation of large strings. We present our findings in using ropes to implement mutable strings in JRuby+Truffle, an implementation of the Ruby programming language using a self-specializing abstract syntax tree interpreter and dynamic compilation. We extend ropes to support Ruby language features such as encodings and refine operations to better support typical Ruby programs. We also use ropes to work around underlying limitations of the JVM platform in representing strings. Finally, we evaluate the performance of our implementation of ropes and demonstrate that they perform 0.9x – 9.4x as fast as byte array-based string representations in representative benchmarks.
Combining speculative optimizations with flexible scheduling of side-effects
Speculative optimizations allow compilers to optimize code based on assumptions that cannot be verified at compile time. Taking advantage of the specific run-time situation opens more optimization possibilities. Speculative optimizations are key to the implementation of high-performance language runtimes. Using them requires cooperation between the just-in-time compilers and the runtime system and influences the design and the implementation of both. New speculative optimizations as well as their application in more dynamic languages are using these systems much more than current implementations were designed for. We first quantify the run time and memory footprint caused by their usage. We then propose a structure for compilers that separates the compilation process into two stages. It helps to deal with this issues without giving up on other traditional optimizations. In the first stage, floating guards can be inserted for speculative optimizations. Then the guards are fixed in the control-flow at appropriate positions. In the second stage, side-effecting instructions can be moved or reordered. Using this framework we present two optimizations that help reduce the run-time costs and the memory footprint. We study the effects of both stages as well as the effects of these two optimizations in the Graal compiler. We evaluate this on classical benchmarks targeting the JVM: SPECjvm2008, DaCapo and Scala-DaCapo. We also evaluate JavaScript benchmarks running on the Truffle platform that uses the Graal compiler. We find that combining both stages can bring up to 84% improvement in performance (9% on average) and our optimization of memory footprint can bring memory usage down by 27% to 92% (45% on average).
Taurus: A Holistic Language Runtime System for Coordinating Distributed Managed-Language Applications
Many distributed workloads in today’s data centers are written in managed languages such as Java or Ruby. Examples include big data frameworks such as Hadoop, data stores such as Cassandra or applications such as the SOLR search engine. These workloads typically run across many independent language runtime systems on different nodes. This setup represents a source of inefficiency, as these language runtime systems are unaware of each other. For example, they may perform Garbage Collection at times that are locally reasonable but not in a distributed setting. We address these problems by introducing the concept of a Holistic Runtime System that makes runtime-level decisions for the entire distributed application rather than locally. We then present Taurus, a Holistic Runtime System prototype. Taurus is a JVM drop-in replacement, requires almost no configuration and can run unmodified off-the-shelf Java applications. Taurus enforces user-defined coordination policies and provides a DSL for writing these policies. By applying Taurus to Garbage Collection, we demonstrate the potential of such a system and use it to explore coordination strategies for the runtime systems of real-world distributed applications, to improve application performance and address tail-latencies in latency-sensitive workloads.
Minimally Constrained Multilingual Word Embeddings via Artificial Code Switching
We present a method that consumes a large corpus of multilingual text and produces a single, unified word embedding in which the word vectors generalize across languages. In contrast to current approaches that require language identification, our method is agnostic about the languages with which the documents in the corpus are expressed, and does not rely on parallel corpora to constrain the spaces. Instead we utilize a small set of human provided word translations— which are often freely and readily available. We can encode such word translations as hard constraints in the model’s objective functions; however, we find that we can more naturally constrain the space by allowing words in one language to borrow distributional statistics from context words in another language. We achieve this via a process we term artificial code-switching. As the name suggests, we induce code switching so that words across multiple languages appear in contexts together. Not only do embedding models trained on code-switched data learn common cross-lingual structure, the common structure allows an NLP model trained in a source language to generalize to multiple target languages (achieving up to 80% of the accuracy of models trained with target language data).
Nesoi: compile time checking of transactional coverage in parallel programs.
In this paper we describe our implementation of Nesoi, a tool for static checking the transactional requirements of a program. Nesoi categorizes the fields of each instance of an object in the program and reports missing and unrequired transactions at compile time. As transactional requirements are detected at the level of object fields in independent object instances the fields that need to be considered for possible collisions in a transaction can be cleanly identified, reducing the possibility of false collisions. Running against a set of benchmarks these fields account for just 2.5% of reads and 17-31% of writes within a transaction. Nesoi is constructed as a plugin for the Scala compiler and is integrated with the dataflow libraries used in the Teraflux project to providing support both for conventional programming modes and the dataflow + transactions model of the Teraflux project.
Attribute Extraction from Noisy Text Using Character-based Sequence Tagging Models
Attribute extraction is the problem of extracting structured key-value pairs from unstructured data. Many similar entity recognition problems are usually solved as a sequence labeling task in which elements of the sequence are word tokens. While word tokens are suitable for newswire, for many types of data—from social media text to product descriptions–word tokens are problematic because simple regular-expression based word tokenizers can not accurately tokenize text that is inconsistently spaced. Instead, we propose a character-based sequence tagging approach that jointly tokenizes and tags tokens. We find that the character-based approach is surprisingly accurate both at tokenizing words, and at inferring labels. We also propose an end-to-end system that uses pair- wise entity linking models for normalizing the extracted values.
Minimally Constrained Multilingual Word Embeddings via Artificial Code Switching
We present a method that consumes a large corpus of multilingual text and produces a single, unified word embedding in which the word vectors generalize across languages. Our method is agnostic about the languages with which the documents in the corpus are expressed, and does not rely on parallel corpora to constrain the spaces. Instead we utilize a small set of human provided word translations to artificially induce code switching; thus, allowing words in multiple languages to appear in contexts together and share distributional information. We evaluate the embeddings on a new multilingual word analogy dataset. We also find that our embeddings allow an NLP model trained in one language to generalize to another, achieving up to 80% of the accuracy of an in-language model.
CodeSurveyor: Mapping Large-Scale Software to Aid in Code Comprehension
Large codebases — in the order of millions of lines of code (MLOC) — are incredibly complex. Whether fixing a fault, or implementing a new feature, changes to such systems often have unanticipated effects, as it is impossible for a developer to maintain a complete understanding of the code in their head. This paper presents CodeSurveyor, a spatial visualization technique that aims to support code comprehension in large codebases by allowing developers to view large-scale software at all levels of abstraction. It uses a cartographic metaphor to produce an interactive map of a codebase where users can zoom from a view of a system’s high-level architectural components, represented as continents, down to the individual source files and the entities they define, shown as countries and states, respectively. The layout of the produced code map incorporates system dependency data and sizes regions according to a user-configurable metric (line count by default), to create distinctive shapes and positions that serve as strong visual landmarks and keep users oriented. We detail the CodeSurveyor algorithm, show it generates code maps of the Linux kernel (1.4 MLOC) in 1.5 minutes, and evaluate the intuitiveness of the metaphor to software developers and its utility in navigation tasks. Results show the effectiveness of the approach with developers of varying experience levels.
An Efficient and Generic Event-based Profiler Framework for Dynamic Languages
Profilers help programmers analyze their programs and identify performance bottlenecks. We implement a profiler framework that helps to compare and analyze the programs implementing the same algorithms written in different languages. Profiler implementers replicate common functionalities in their language profiler. We focus on building a generic profiler framework for dynamic languages to minimize the recurring implementation effort. We implement our profiler in a framework that optimizes abstract syntax tree (AST) interpreters using a just-in-time (JIT) compiler. We evaluate it on ZipPy and JRuby+Truffle, Python and Ruby implementations in this framework, respectively. We show that our profiler runs faster than the existing profilers in these languages and requires modest implementation effort. Our profiler serves three purposes: 1) helps users to find the bottlenecks in their programs, 2) helps language implementers to improve the performance of their language implementation, 3) helps to compare and evaluate different languages on cross-language benchmarks.
Join Size Estimation Subject to Filter Conditions
In this paper, we present a new algorithm for estimating the size of equality join of multiple database tables. The proposed algorithm, Correlated Sampling, constructs a small space synopsis for each table, which can then be used to provide a quick estimate of the join size of this table with other tables subject to dynamically specified predicate filter conditions, possibly specified over multiple columns (attributes) of each table. This algorithm makes a single pass over the data and is thus suitable for streaming scenarios. We compare this algorithm analytically to two other previously known sampling approaches (independent Bernoulli Sampling and End-Biased Sampling) and to a novel sketch-based approach. We also compare these four algorithms experimentally and show that results fully correspond to our analytical predictions based on derived expressions for the estimator variances, with Correlated Sampling giving the best estimates in a large range of situations.
Breaking Payloads with Runtime Code Stripping and Image Freezing
Fighting off attacks based on memory corruption vulnerabilities is hard and a lot of research was and is conducted in this area. In our recent work we take a different approach and looked into breaking the payload of an attack. Current attacks assume that they have access to every piece of code and the entire platform API. In this talk, we present a novel defensive strategy that targets this assumption. We built a system that removes unused code from an application process to prevent attacks from using code and APIs that would otherwise be present in the process memory but normally are not used by the actual application. Our system is only active during process creation time, and, therefore, incurs no runtime overhead and thus no performance degradation. Our system does not modify any executable files or shared libraries as all actions are executed in memory only. We implemented our system for Windows 8.1 and tested it on real world applications. Besides presenting our system we also show the results of our investigation into code overhead present in current applications.
Callisto-RTS: Fine-Grain Parallel Loops
We introduce Callisto-RTS, a parallel runtime system designed for multi-socket shared-memory machines. It supports very fine-grained scheduling of parallel loops— down to batches of work of around 1K cycles. Fine-grained scheduling helps avoid load imbalance while reducing the need for tuning workloads to particular machines or inputs. We use per-core iteration counts to distribute work initially, and a new asynchronous request combining technique for when threads require more work. We present results using graph analytics algorithms on a 2-socket Intel 64 machine (32 h/w contexts), and on an 8-socket SPARC machine (1024 h/w contexts). In addition to reducing the need for tuning, on the SPARC machines we improve absolute performance by up to 39% (compared with OpenMP). On both architectures Callisto-RTS provides improved scaling and performance compared with a state-of-the-art parallel runtime system (Galois).
Shoal: smart allocation and replication of memory for parallel programs
Modern NUMA multi-core machines exhibit complex latency and throughput characteristics, making it hard to allocate memory optimally for a given program’s access patterns. However, sub-optimal allocation can significantly impact performance of parallel programs. We present an array abstraction that allows data placement to be automatically inferred from program analysis, and implement the abstraction in Shoal, a runtime library for parallel programs on NUMA machines. In Shoal, arrays can be automatically replicated, distributed, or partitioned across NUMA domains based on annotating memory allocation statements to indicate access patterns. We further show how such annotations can be automatically provided by compilers for high-level domainspecific languages (for example, the Green-Marl graph language). Finally, we show how Shoal can exploit additional hardware such as programmable DMA copy engines to further improve parallel program performance. We demonstrate significant performance benefits from automatically selecting a good array implementation based on memory access patterns and machine characteristics. We present two case-studies: (i) Green-Marl, a graph analytics workload using automatically annotated code based on information extracted from the highlevel program and (ii) a manually-annotated version of the PARSEC Streamcluster benchmark.
Building Debuggers and Other Tools: We Can “Have it All” (Position Paper)
Software development tools that “instrument” running programs, notably debuggers, are presumed to demand difficult tradeoffs among performance, functionality, implementation complexity, and user convenience. A fundamental change in our thinking about such tools makes that presumption obsolete. By building instrumentation directly into the core of a high-performance language implementation framework, tool support can be always on, with confidence that optimization will apply uniformly to instrumentation and result in near zero overhead. Tools can be always available (and fast), not only for end user programmers, but also for language implementors throughout development.
Snippets: Taking the High Road to a Low Level
When building a compiler for a high-level language, certain intrinsic features of the language must be expressed in terms of the resulting low-level operations. Complex features are often expressed by explicitly weaving together bits of low-level IR, a process that is tedious, error prone, difficult to read, difficult to reason about, and machine dependent. In the Graal compiler for Java, we take a different approach: we use snippets of Java code to express semantics in a high-level, architecture-independent way. Two important restrictions make snippets feasible in practice: they are compiler specific, and they are explicitly prepared and specialized. Snippets make Graal simpler and more portable while still capable of generating machine code that can compete with other compilers of the Java HotSpot VM.
The Judgment of Forseti: Economic Utility for Dynamic Heap Sizing of Multiple Runtimes
We introduce the FORSETI system, which is a principled approach for holistic memory management. It permits a sysadmin to specify the total physical memory resource that may be shared between all concurrent virtual machines on a physical node. FORSETI models the heap size versus application throughput for each virtual machine, and seeks to maximize the combined throughput of the set of VMs based on concepts from economic utility theory. We evaluate the FORSETI system using a standard Java managed runtime, i.e. OpenJDK. Our results demonstrate that FORSETI enables dramatic reductions (up to 5x) in heap footprint without compromising application execution times.
Formal Model Checking: From Oblivion to a Pillar of Success
This session will describe how formal methods went from being used opportunistically to a central place in the verification methodology of the RAPID SoC to help the Oracle team achieve its verification goals of finishing on schedule and achieving first pass silicon success.
The Influence of Malloc Placement on TSX Hardware Transactional Memory.
The hardware transactional memory (HTM) implementation in Intel's i7-4770 "Haswell" processor tracks the transactional read-set in the L1 (level-1), L2 (level-2) and L3 (level-3) caches and the write-set in the L1 cache. Displacement or eviction of read-set entries from the cache hierarchy or write-set entries from the L1 results in abort. We show that the placement policies of dynamic storage allocators -- such as those found in common "malloc" implementations -- can influence the L1 conflict miss rate in the L1. Conflict misses -- sometimes called mapping misses -- arise because of less than ideal associativity and represent imbalanced distribution of active memory blocks over the set of available L1 indices. Under transactional execution conflict misses may manifest as aborts, representing wasted or futile effort instead of a simple stall as would occur in normal execution mode. Furthermore, when HTM is used for transactional lock elision (TLE), persistent aborts arising from conflict misses can force the offending thread through the so-called "slow path". The slow path is undesirable as the thread must acquire the lock and run the critical section in normal execution mode, precluding the concurrent execution of threads in the "fast path" that monitor that same lock and run their critical sections in transactional mode. For a given lock, multiple threads can concurrently use the transactional fast path, but at most one thread can use the non-transactional slow path at any given time. Threads in the slow path preclude safe concurrent fast path execution. Aborts rising from placement policies and L1 index imbalance can thus result in loss of concurrency and reduced aggregate throughput.
Trash Day: Coordinating Garbage Collection in Distributed Systems
Cloud systems such as Hadoop, Spark and Zookeeper are frequently written in Java or other garbage-collected languages. However, GC-induced pauses can have a signifi- cant impact on these workloads. Specifically, GC pauses can reduce throughput for batch workloads, and cause high tail-latencies for interactive applications. In this paper, we show that distributed applications suffer from each node’s language runtime system making GC-related decisions independently. We first demonstrate this problem on two widely-used systems (Apache Spark and Apache Cassandra). We then propose solving this problem using a Holistic Runtime System, a distributed language runtime that collectively manages runtime services across multiple nodes. We present initial results to demonstrate that this Holistic GC approach is effective both in reducing the impact of GC pauses on a batch workload, and in improving GC-related tail-latencies in an interactive setting.
Making meaningful decisions about time, workload and pedagogy in the digital age: the Course Resource Appraisal Model
This article reports on a design-based research project to create a modelling tool to analyse the costs and learning benefits involved in different modes of study. The Course Resource Appraisal Model (CRAM) provides accurate cost-benefit information so that institutions are able to make more meaningful decisions about which kind of courses—online, blended or traditional face-to-face—make sense for them to provide. The tool calculates the difference between expenses and income over three iterations of the course and presents a pedagogical analysis of the learning experience provided. The article draws on a CRAM analysis of the costs and learning benefits of a massive open online course to show how the tool can illuminate the pedagogical and financial viability of a course of this kind.
Architectural support for task scheduling: hardware scheduling for dataflow on NUMA systems.
To harness the compute resource of many-core system with tens to hundreds of cores, applications have to expose parallelism to the hardware. Researchers are aggressively looking for program execution models that make it easier to expose parallelism and use the available resources. One common approach is to decompose a program into parallel `tasks' and allow an underlying system layer to schedule these tasks to different threads. Software-only schedulers can implement various scheduling policies and algorithms that match the characteristics of different applications and programming models. Unfortunately with large-scale multi-core systems, software schedulers suffer significant overheads as they synchronize and communicate task information over deep cache hierarchies. To reduce these overheads, hardware-only schedulers like Carbon have been proposed to enable task queuing and scheduling to be done in hardware. This paper presents a hardware scheduling approach where the structure provided to programs by task-based programming models can be incorporated into the scheduler, making it aware of a task's data requirements. This prior knowledge of a task's data requirements allows for better task placement by the scheduler which result in a reduction in overall cache misses and memory traffic, improving the program's performance and power utilization. Simulations of this technique for a range of synthetic benchmarks and components of real applications have shown a reduction in the number of cache misses by up to 72 and 95 {\%} for the L1 and L2 caches, respectively, and up to 30 {\%} improvement in overall execution time against FIFO scheduling. This results not only in faster execution and in less data transfer with reductions of up to 50 {\%}, allowing for less load on the interconnect, but also in lower power consumption.
Shelf space product placement optimizer
A system for optimizing shelf space placement for a product receives decision variables and constraints, and executes a Randomized Search (“RS”) using the decision variables and constraints until an RS solution is below a pre-determined improvement threshold. The system then solves a Mixed-Integer Linear Program (“MILP”) problem using the decision variables and constraints, and using the RS solution as a starting point, to generate a MILP solution. The system repeats the RS executing and MILP solving as long as the MILP solution is not within a predetermined accuracy or does not exceed a predetermined time duration. The system then, based on the final MILP solution, outputs a shelf position and a number of facings for the product.
Java-to-JavaScript translation via structured control flow reconstruction of compiler IR
We present an approach to cross-compile Java bytecodes to Java-Script, building on existing Java optimizing compiler technology. Static analysis determines which Java classes and methods are reachable. These are then translated to JavaScript using a re-configured Java just-in-time compiler with a new back end that generates JavaScript instead of machine code. Standard compiler optimizations such as method inlining and global value numbering, as well as advanced optimizations such as escape analysis, lead to compact and optimized JavaScript code. Compiler IR is unstructured, so structured control flow needs to be reconstructed before code generation is possible. We present details of our control flow reconstruction algorithm. Our system is based on Graal, an open-source optimizing compiler for the Java HotSpot VM and other VMs. The modular and VM-independent architecture of Graal allows us to reuse the intermediate representation, the bytecode parser, and the high-level optimizations. Our custom back end first performs control flow reconstruction and then JavaScript code generation. The generated JavaScript undergoes a set of optimizations to increase readability and performance. Static analysis is performed on the Graal intermediate representation as well. Benchmark results for medium-sized Java benchmarks such as SPECjbb2005 run with acceptable performance on the V8 JavaScript VM.
Augur: Data-Parallel Probabilistic Modelling
Implementing inference procedures for each new probabilistic model is time-consuming and error-prone. Probabilistic programming addresses this problem by allowing a user to specify the model and automatically generating the inference procedure. To make this practical it is important to generate high performance inference code. In turn, on modern architectures, high performance implies parallel execution. In this paper we present Augur, a probabilistic modelling language and compiler for Bayesian networks designed to make effective use of data-parallel architectures such as GPUs. We show that the compiler can generate data-parallel inference code scalable to thousands of GPU cores by making use of the conditional independence relationships in the Bayesian network.
Generalized decomposition for non-linear problems
Brief descriptions of RD applications to min cut, max cut, QAP, quadratic programming, and Rosenbrock Functions.
Frappé: Using Clang to Query and Visualize Large Codebases
Frappé is a new tool to support developers with a range of code comprehension queries in multi-million line codebases, from "Does function X or something it calls write to global variable Y?" to "How much code could be affected if I change this macro?". Results are overlaid on a visualisation of the code based on a cartographic map, where the continent/country/state hierarchy corresponds to the code equivalent: high-level architectural components down to individual files and functions. This allows users to visually filter results based on their location and more immediately guage their number and locality.
Translating Java into LLVM IR to Detect Security Vulnerabilities
Late 2012 and early 2013 saw a spike of new Java vulnerabilities being reported in 0-day attacks and use in the wild, allowing bypass of the Java sandbox: unguarded caller sensitive methods, misuse of doPrivileged, invalid deserialisation, invalid serialisation, and more. Oracle quickly reacted by making available patches and has now increased the scheduled patch update cycle to 4 releases a year. Given the lack of available tools in the market to detect these types of vulnerabilities, and the internal success of the Parfait-for-C static code analysis tool[1] within Oracle, the question on whether Parfait could be extended quickly to support the Java language semantics as well as detect these new vulnerabilities was raised. In this talk we describe how, in the course of 1 year, we are developing and deploying Parfait-for-Java, with the first couple of three deployment milestones in place. The Java translator, Jaffa, reuses the LLVM intermediate representation, which Parfait uses as its own intermediate representation, and extends it with metadata to support the semantics of the Java language. Jaffa's translation is done for analysis purposes, not for execution purposes. New analyses of the detection of the new vulnerabilities encode the Java Secure Coding Guidelines (http://www.oracle.com/technetwork/java/seccodeguide-139067.html). Interaction with the Java Security team is in place, in order to better understand the guidelines themselves, and to obtain early feedback on results of the analyses. Staged deployment of Parfait-for-Java provides developers with timely feedback on the new code being developed, and it provides QA with feedback on the existing code. [1] Parfait-for-C was reported at the LLVM Developer Meeting 2009
Deployment of Query Plans on Multicores
Efficient resource scheduling of multithreaded software on multicore hardware is difficult given the many parameters involved and the hardware heterogeneity of existing systems. In this paper we explore the efficient deployment of query plans over a multicore machine. We focus on shared query systems, and implement the proposed ideas using SharedDB. The goal of the paper is to explore how to deliver maximum performance and predictability, while minimizing resource utilization when deploying query plans on multicore machines. We propose to use resource activity vectors to characterize the behavior of individual database operators. We then present a novel deployment algorithm which uses these vectors together with dataflow information from the query plan to optimally assign relational operators to physical cores. Experiments demonstrate that this approach significantly reduces resource requirements while preserving performance and is robust across different server architectures.
Supporting Maintenance and Evolution of Access Control Models in Web Applications
This paper presents an approach to support the maintenance and evolution of Role-Based Access Control (RBAC) models with reverse-engineered Secure UML models. Starting from the Policy Decision Points (PDP) and Policy Enforcement Points (PEP) of an application, our approach statically reverse-engineers the implemented Secure UML model of an application. The secure UML model is then stored in an RDF triple store for easy querying and exploration. In the context of this study, we extracted the Secure UML model of the GRAND Forum, a web-based forum for the members of the GRAND (Graphics, Animation and New Media) NCE (Networks of Centers of Excellence), that is developed and maintained at the University of Alberta. Using three real use-case scenarios, we illustrate how simple queries to the extracted Secure UML can save developers significant amounts of manual work and support them in their access control related maintenance and evolution tasks.
Why Inheritance Anomaly Is Not Worth Solving
Modern computers improve their predecessors with additional parallelism but require concurrent software to exploit it. Object-orientation is instrumental in simplifying sequential programming, however, in a concurrent setting, programmers adding new methods in a subclass typically have to modify the code of the superclass, which inhibits reuse, a problem known as inheritance anomaly. There have been much efforts by researchers in the last two decades to solve the problem by deriving anomaly-free languages. Yet, these proposals have not ended up as practical solutions, thus one may ask why. In this article, we investigate from a theoretical perspective if a solution of the problem would introduce extra code complexity. We model object behavior as a regular language, and show that freedom from inheritance anomaly necessitates a language where ensuring Liskov-Wing substitutability becomes a language containment problem, which in our modeling is PSPACE hard. This indicates that we cannot expect programmers to manually ensure that subtyping holds in an anomaly-free language. Anomaly freedom thus predictably leads to software bugs and we doubt the value of providing it. From the practical perspective, the problem is already solved. Inheritance anomaly is part of the general fragile base class problem of object-oriented programming, that arises due to code coupling in implementation inheritance. In modern software practice, the fragile base class problem is circumvented by interface abstraction to avoid implementation inheritance, and opting for composition as means for reuse. We discuss concurrent programming issues with composition for reuse.
Hardware extensions to make lazy subscription safe
Transactional Lock Elision (TLE) uses Hardware Transactional Memory (HTM) to execute unmodified critical sections concurrently, even if they are protected by the same lock. To ensure correctness, the transactions used to execute these critical sections “subscribe” to the lock by reading it and checking that it is available. A recent paper proposed using the tempting “lazy subscription” optimization for a similar technique in a different context, namely transactional systems that use a single global lock (SGL) to protect all transactional data. We identify several pitfalls that show that lazy subscription is not safe for TLE because unmodified critical sections executing before subscribing to the lock may behave incorrectly in a number of subtle ways.We also show that recently proposed compiler support for modifying transaction code to ensure subscription occurs before any incorrect behavior could manifest is not sufficient to avoid all of the pitfalls we identify. We further argue that extending such compiler support to avoid all pitfalls would add substantial complexity and would usually limit the extent to which subscription can be deferred, undermining the effectiveness of the optimization. Hardware extensions suggested in the recent proposal also do not address all of the pitfalls we identify. In this extended version of our WTTM 2014 paper, we describe hardware extensions that make lazy subscription safe, both for SGL-based transactional systems and for TLE, without the need for special compiler support.We also explain how nontransactional loads can be exploited, if available, to further enhance the effectiveness of lazy subscription.
Pitfalls of lazy subscription
Transactional Lock Elision (TLE) uses Hardware Transactional Memory (HTM) to execute unmodified critical sections concurrently, even if they are protected by the same lock. To ensure correctness, the transactions used to execute these critical sections “subscribe” to the lock by reading it and checking that it is available. A recent paper proposed using the tempting “lazy subscription” optimization for a similar technique in a different context, namely transactional systems that use a single global lock (SGL) to protect all transactional data. We identify several pitfalls that show that lazy subscription is not safe for TLE because unmodified critical sections executing before subscribing to the lock may behave incorrectly in a number of subtle ways.We also show that recently proposed compiler support for modifying transaction code to ensure subscription occurs before any incorrect behavior could manifest is not sufficient to avoid all of the pitfalls we identify. We further argue that extending such compiler support to avoid all pitfalls would add substantial complexity and would usually limit the extent to which subscription can be deferred, undermining the effectiveness of the optimization. Hardware extensions suggested in the recent proposal also do not address all of the pitfalls we identify. A longer version of this paper proposes hardware extensions that make lazy subscription safe, both for SGL-based transactional systems and for TLE, without the need for special compiler support.
The Future(s) of Shared Data Structures.
This paper considers how to use futures, a well-known mechanism to manage parallel computations, to improve the performance of long-lived, mutable shared data structures in large-scale multicore systems. We show that futures can enable type-specic optimizations such as combining and elimination, improve cache locality and reduce contention. To exploit these benets in an eective way, however, it is important to dene clear notions of correctness. We propose new extensions to linearizability appropriate for method calls that return futures as results. To illustrate the utility and trade-os of these extensions, we describe implementations of three common data structures: stacks, queues, and linked lists, designed to exploit futures. Our experimental results show that optimizations enabled by futures lead to substantial performance improvements, in some cases up to two orders of magnitude, compared to well-known lock-free alternatives.
Adaptive Integration of Hardware and Software Lock Elision Techniques.
Transactional Lock Elision (TLE) and optimistic software execution can both improve scalability of lock-based pro- grams. The former uses hardware transactional memory (HTM) without requiring code changes; the latter involves modest code changes but does not require special hardware support. Numerous factors affect the choice of technique, including: critical section code, calling context, workload characteristics, and hardware support for synchronization. The ALE library integrates these techniques, and collects detailed, fine-grained performance data, enabling policies that decide between them at runtime for each critical section execution. We describe an adaptive policy and present experiments on three platforms, two of which support HTM, showing that—without tuning for specific platforms or workload—the adaptive policy is competitive with and often significantly better than hand-tuned static policies.
Debugging At Full Speed
Debugging support for highly optimized execution environments is notoriously difficult to implement. The Truffle/Graal platform for implementing dynamic languages offers an opportunity to resolve the apparent trade-off between debugging and high performance. Truffle/Graal-implemented languages are expressed as abstract syntax tree (AST) interpreters. They enjoy competitive performance through platform support for type specialization, partial evaluation, and dynamic optimization/deoptimization. A prototype debugger for Ruby, implemented on this platform, demonstrates that basic debugging services can be implemented with modest effort and without significant impact on program performance. Prototyped functionality includes breakpoints, both simple and conditional, at lines and at local variable assignments. The debugger interacts with running programs by inserting additional nodes at strategic AST locations; these are semantically transparent by default, but when activated can observe and interrupt execution. By becoming in effect part of the executing program, these “wrapper” nodes are subject to full runtime optimization, and they incur zero runtime overhead when debugging actions are not activated. Conditions carry no overhead beyond evaluation of the expression, which is optimized in the same way as user code, greatly improving the prospects for capturing rarely manifested bugs. When a breakpoint interrupts program execution, the platform automatically restores the full execution state of the program (expressed as Java data structures), as if running in the unoptimized AST interpreter. This then allows full introspection of the execution data structures such as the AST and method activation frames when in the interactive debugger console. Our initial evaluation indicates that such support could be permanently enabled in production environments.
Exploiting Implicit Parallelism in Dynamic Array Programming Languages
We have built an interpreter for the array programming language J. The interpreter exploits implicit data parallelism in the language to achieve good parallel speedups on a variety of benchmark applications. Many array programming languages operate on entire arrays without the need to write loops. Writing without loops simplifies the programs. Array programs without loops allow an interpreter to parallelize the execution of the code without complex analysis or input from the programmer. The J programming language includes the usual idioms of operations on arrays of the same size and shape, where the operations can often be performed in parallel for each individual item of the operands. Another opportunity comes from Js reduction operations, where suitable operations can be performed in parallel for all the items of an operand. J has a notion of verb rank, which allows programmers to simplify programs by declaring how operations are applied to operands. The verb rank mechanism allows us to extract further parallelism. Our implementation of an implicitly parallelizing interpreter for J is written entirely in Java. We have written the interpreter in a framework that produces native code for the interpreter, giving good scalar performance. The interpreter itself is responsible for exploiting the parallelism available in the applications. Our results show we attain good parallel speed-up on a variety of benchmarks, including near perfect linear speed-up on inherently parallel benchmarks. We believe that the lessons learned from our approach to exploiting data parallelism in an interpreter can be applied to other interpreted languages as well.
Callisto: Co-Scheduling Parallel Runtime Systems
It is increasingly important for parallel applications to run together on the same machine. However, current performance is often poor: programs do not adapt well to dynamically varying numbers of cores, and the CPU time received by concurrent jobs can differ drastically. This paper introduces Callisto, a resource management layer for parallel runtime systems.We describe Callisto and the implementation of two Callisto-enabled runtime systems—one for OpenMP, and another for a task-parallel programming model. We show how Callisto eliminates almost all of the scheduler-related interference between concurrent jobs, while still allowing jobs to claim otherwise-idle cores. We use examples from two recent graph analytics projects and from SPEC OMP.
Towards Whatever-Scale Abstractions for Data-Driven Parallelism
Increasing diversity in computing systems often requires problems to be solved in quite different ways depending on the workload, data size, and resources available. This kind of diversity is becoming increasingly broad in terms of the organization, communication mechanisms, and the performance and cost characteristics of individual machines and clusters. Researchers have thus been motivated to design abstractions that allow programmers to express solutions independently of target execution platforms, enabling programs to scale from small shared memory systems to distributed systems comprising thousands of processors.We call these abstractions “Whatever-Scale Computing”. In prior work, we have found data-driven parallelism to be a promising approach for solving many problems on shared memory machines. In this paper, we describe ongoing work towards extending our previous abstractions to support data-driven parallelism for Whatever-Scale Computing.We plan to target rack-scale distributed systems. As an intermediate step, we have implemented a runtime system that treats a NUMA shared memory system as if each NUMA domain were a node in a distributed system, using shared memory to implement communication between nodes.
The Case for the Holistic Language Runtime System
We anticipate that, by 2020, the basic unit of warehouse- scale cloud computing will be a rack-sized machine instead of an individual server. At the same time, we expect a shift from commodity hardware to custom SoCs that are specically designed for the use in warehouse-scale comput- ing. In this paper, we make the case that the software for such custom rack-scale machines should move away from the model of running managed language workloads in separate language runtimes on top of a traditional operating system but instead run a distributed language runtime system capa- ble of handling dierent target languages and frameworks. All applications will execute within this runtime, which per-forms most traditional OS and cluster manager functionality such as resource management, scheduling and isolation.
Frappé: a code comprehension tool for large codebases
Code comprehension is an integral part of a developer’s everyday programming tasks. Today’s modern, graphical IDEs have many features that facilitate this process. Unfortunately, using these IDEs is often impractical when working with large C/C++ code bases (in the order of millions to tens of millions of lines of code) as they are difficult to integrate with the custom and often complex build systems commonly employed in such projects, and their start-up time and memory usage can be excessive due the sheer volume of code involved. For these reasons, it has been our experience that developers often fall back to lightweight editors, such as Emacs or Vim, and use regex-based text scanning tools like Grep or CScope to aid in code comprehension tasks. But while these text scanning tools are fast, they are also largely unaware of symbol types, scopes and linking information, handle the C pre-processor poorly, and deal only with direct dependencies. We are developing a code comprehension tool called Frappé that aims to address the limitations of these tools while maintaining a comparable level of scalability and ease-of-integration. Our approach is based around building and incrementally maintaining a dependency graph of the program. The nodes in the graph represent source entities such as functions, global variables, types, macros, files, etc. and the edges represent the relations between them, e.g. calls, reads, writes, uses, contains, etc. Code comprehension questions then become graph-matching problems. This works both for questions involving direct dependencies, like finding all functions that write to a particular global variable (match all function nodes with an outgoing writes edge to the global variable of interest) and transitive dependencies, like estimating the impact of a code change (match all functions in the transitive closure of calls edges from the function or functions to be modified). Integration with custom builds is made easy by providing wrapper scripts that serve as drop-in replacements for the most common compilers (e.g. gcc, icc, cc, clang). These scripts still execute the native compiler they wrap, but also run a modified version of the clang compiler to write out precise information on the various source entities and dependencies in the given compilation unit. Frappé reads this information in as it becomes available and incrementally constructs (or updates) the dependency graph of the system as it does so. Frappé provides several UIs for exploring the dependency data it generates, including editor plugins for Emacs and Vim and a web UI that allows users to navigate the dependencies in their code and write their own custom queries from their web browser. The web UI overlays query results on a 2D spatial visualisation of the code called a Code Map that gives an immediate general impression of the location, locality and quantity of results. The prototype version of Frappé has seen a positive initial response from internal development organisations. A formal evaluation is planned.
Partial Escape Analysis and Scalar Replacement for Java
Escape Analysis allows a compiler to determine whether an object is accessible outside the allocating method or thread. This information is used to perform optimizations such as Scalar Replacement, Stack Allocation and Lock Elision, allowing modern dynamic compilers to remove some of the abstractions introduced by advanced programming models. The all-or-nothing approach taken by most Escape Analysis algorithms prevents all these optimizations as soon as there is one branch where the object escapes, no matter how unlikely this branch is at runtime. This paper presents a new, practical algorithm that performs control flow sensitive Partial Escape Analysis in a dynamic Java compiler. It allows Escape Analysis, Scalar Replacement and Lock Elision to be performed on individual branches. We implemented the algorithm on top of Graal, an open-source Java just-in-time compiler, and it performs well on a diverse set of benchmarks. In this paper, we evaluate the effect of Partial Escape Analysis on the DaCapo, ScalaDaCapo and SpecJBB2005 benchmarks, in terms of run-time, number and size of allocations and number of monitor operations. It performs particularly well in situations with additional levels of abstraction, such as code generated by the Scala compiler. It reduces the amount of allocated memory by up to 58.5%, and improves performance by up to 33%.
Finding Java Vulnerabilities with the Parfait Static Code Analysis Tool
Poster describing the aims of the Java Vulnerability Detection project, as well as the initial components of it -- a translator of Java source code onto the LLVM intermediate representation (Jaffa) and analyses in the Parfait infrastructure to support vulnerability-detection of interest.
One VM to rule them all
Building high-performance virtual machines is a complex and expensive undertaking; many popular languages still have low-performance implementations. We describe a new approach to virtual machine (VM) construction that amortizes much of the effort in initial construction by allowing new languages to be implemented with modest additional effort. The approach relies on abstract syntax tree (AST) interpretation where a node can rewrite itself to a more specialized or more general node, together with an optimizing compiler that exploits the structure of the interpreter. The compiler uses speculative assumptions and deoptimization in order to produce efficient machine code. Our initial experience suggests that high performance is attainable while preserving a modular and layered architecture, and that new high-performance language implementations can be obtained by writing little more than a stylized interpreter.
An intermediate representation for speculative optimizations in a dynamic compiler
We present a compiler intermediate representation (IR) that allows dynamic speculative optimizations for high-level languages. The IR is graph-based and contains nodes fixed to control flow as well as floating nodes. Side-effecting nodes include a framestate that maps values back to the original program. Guard nodes dynamically check assumptions and, on failure, deoptimize to the interpreter that continues execution. Guards implicitly use the framestate and program position of the last side-effecting node. Therefore, they can be represented as freely floating nodes in the IR. Exception edges are modeled as explicit control flow and are subject to full optimization. We use profiling and deoptimization to speculatively reduce the number of such edges. The IR is the core of a just-in-time compiler that is integrated with the Java HotSpot VM. We evaluate the design decisions of the IR using major Java benchmark suites.
An Intermediate Representation for Speculative Optimizations in a Dynamic Compiler
We present a compiler intermediate representation (IR) that allows dynamic speculative optimizations for high-level languages. The IR is graph-based and contains nodes fixed to control flow as well as floating nodes. Side-effecting nodes include a framestate that maps values back to the original program. Guard nodes dynamically check assumptions and, on failure, deoptimize to the interpreter that continues execution. Guards implicitly use the framestate and program position of the last side-effecting node. Therefore, they can be represented as freely floating nodes in the IR. Exception edges are modeled as explicit control flow and are subject to full optimization. We use profiling and deoptimization to speculatively reduce the number of such edges. The IR is the core of a just-in-time compiler that is integrated with the Java HotSpot VM. We evaluate the design decisions of the IR using major Java benchmark suites.
A Joint Model for Discovering and Linking Entities
Entity resolution, the task of automatically determining which mentions refer to the same real-world entity, is a crucial aspect of knowledge base construction and management. However, performing entity resolution at large scales is challenging because (1) the inference algorithms must cope with unavoidable system scalability issues and (2) the search space grows exponentially in the number of mentions. Current conventional wisdom declares that performing coreference at these scales requires decomposing the problem by first solving the simpler task of entity-linking (matching a set of mentions to a known set of KB entities), and then performing entity discovery as a postprocessing step (to identify new entities not present in the KB). However, we argue that this traditional approach is harmful to both entity-linking and overall coreference accuracy. Therefore, we embrace the challenge of jointly model entity-linking and entity-discovery as a single entity resolution problem. In order to achieve scalability we (1) present a model that reasons over compact hierarchical entity representations, and (2) propose a novel distributed inference architecture that does not suffer from the synchronicity bottleneck which is inherent in map-reduce architectures. We demonstrate that more test-time data actually improves the accuracy of coreference, and show that the joint approach to coreference is substantially more accurate than traditional entity-linking, reducing error by over 75\%.
Assessing Confidence of Knowledge Base Content with an Experimental Study in Entity Resolution
The purpose of this paper is to begin a conversation about the importance and role of confidence estimation in knowledge bases (KBs). KBs are never perfectly accurate, yet without confidence reporting their users are likely to treat them as if they were, possibly with serious real-world consequences. We define a notion of confidence based on the probability of a KB fact being true. For automatically constructed KBs we propose several algorithms for estimating this confidence from pre-existing probabilistic models of data integration and KB construction. In particular, this paper focusses on confidence estimation in entity resolution. A goal of our exposition here is to encourage creators and curators of KBs to include confidence estimates for entities and relations in their KBs.
Improved dataflow executions with user assisted scheduling.
In pure dataflow applications scheduling can have a huge effect on the memory footprint and number of active tasks in the program. However, in impure programs, scheduling not only effects the system resources, but can also effect the overall time complexity and accuracy of the program. To address both of these aspects this paper describes and analyses effective extensions to a dataflow scheduler to allow programmers to provide priority information describing the preferred execution order of a dataflow graph. We demonstrate that even very crude task priority metrics can be extremely effective, providing an average saving of 91% over the worst case scenario and 60% over the best case naive scenario. We also note that by specifying the scheduling information explicitly based on the algorithm, not the hardware, we provide portability to the application.
An Experimental Study of the Influence of Dynamic Compiler Optimizations on Scala Performance
Java Virtual Machines are optimized for performing well on traditional Java benchmarks, which consist almost exclusively of code generated by the Java source compiler (javac). Code generated by compilers for other languages has not received nearly as much attention, which results in performance problems for those languages. One important specimen of "another language" is Scala, whose syntax and features encourage a programming style that differs significantly from traditional Java code. It suffers from the same problem -- its code patterns are not optimized as well as the ones originating from Java code. JVM developers need to be aware of the differences between Java and Scala code, so that both types of code can be executed with optimal performance. This paper presents a detailed investigation of the performance impact of a large number of optimizations on the Scala DaCapo and the Java DaCapo benchmark suites. It describes the optimization techniques and analyzes the differences between traditional Java applications and Scala applications. The results help compiler engineers in understanding the characteristics of Scala. We performed these experiments on the work-in-progress Graal compiler. Graal is a new dynamic compiler for the HotSpot VM which aims to work well for a diverse set of workloads, including languages other than Java.
Constrained Data-Driven Parallelism
In data-driven parallelism, changes to data spawn new tasks, which may change more data, spawning yet more tasks. Computation propagates until no further changes occur. Benefits include increasing opportunities for fine-grained parallelism, avoiding redundant work, and supporting incremental computations on large data sets. Nonetheless, data-driven parallelism can be problematic. For example, convergence times of data-driven single-source shortest paths algorithms can vary by two orders of magnitude depending on task execution order. We propose constrained data-driven parallelism, in which programmers can impose ordering constraints on tasks. In particular, we propose new abstractions for defining groups of tasks and constraining the execution order of tasks within each group. We sketch an initial implementation and present experimental results demonstrating that our approach enables new efficient data-driven implementations of a variety of graph algorithms.
Code Maps: A Scalable Visualisation Technique for Large Codebases
Large codebases (in the order of millions to 10s of millions of lines of code) are notoriously difficult to understand, modify and maintain. They are the product of hundreds of developers working simultaneously over several decades and are often poorly documented. For new developers, building up a workable mental model of these systems can take a considerable amount of time. Software visualisation is one area that seems ideally suited to address this problem, but that in practice, does not see much use beyond high-level, hand-crafted architecture diagrams and node-link diagrams that typically scale poorly to larger systems. We are developing a scalable, spatial visualisation for large codebases based on a world-map metaphor. The core of the idea is mapping the continent/country/state/city/etc. hierarchy to the equivalent in a code base – the high-level architectural components down to the individual files and functions that comprise them – and laying these out so as to maintain the intuitive notion that the proximity of any two entities is proportional to their coupling. Our approach takes as input a dependency graph of the system and, optionally, a predefined abstraction hierarchy to group the low-level source code entities into their higher-level system components. If a hierarchy is not supplied, we recover one from the dependency graph using a graph clustering algorithm. From there we use a combination of force-directed graph layout, implicit surface generation, and Voronoi treemaps to produce a map of the codebase. Our approach allows users to browse the system at any level of detail using the familiar pan/zoom interaction model of web-based mapping services. It provides strong visual landmarks for faster navigation thanks to the distinctive shapes and positions of regions on the map, and lends itself to easy data overlay – the size and colour of regions are proportional to supplied code metrics, while bug locations, search results, and dependency edges can be superimposed. A formal evaluation of the prototype has not yet been undertaken, but initial feedback from internal development organisations has been very positive, particularly for the data overlay capabilities and intuitive zoom-for-detail interaction model.
Code Maps: A Scalable Visualisation Technique for Large Codebases
Large codebases (in the order of millions to 10s of millions of lines of code) are notoriously difficult to understand, modify and maintain. They are the product of hundreds of developers working simultaneously over several decades and are often poorly documented. For new developers, building up a workable mental model of these systems can take a considerable amount of time. Software visualisation is one area that seems ideally suited to address this problem, but that in practice, does not see much use beyond high-level, hand-crafted architecture diagrams and node-link diagrams that typically scale poorly to larger systems. We are developing a scalable, spatial visualisation for large codebases based on a world-map metaphor. The core of the idea is mapping the continent/country/state/city/etc. hierarchy to the equivalent in a code base – the high-level architectural components down to the individual files and functions that comprise them – and laying these out so as to maintain the intuitive notion that the proximity of any two entities is proportional to their coupling. Our approach takes as input a dependency graph of the system and, optionally, a predefined abstraction hierarchy to group the low-level source code entities into their higher-level system components. If a hierarchy is not supplied, we recover one from the dependency graph using a graph clustering algorithm. From there we use a combination of force-directed graph layout, implicit surface generation, and Voronoi treemaps to produce a map of the codebase. Our approach allows users to browse the system at any level of detail using the familiar pan/zoom interaction model of web-based mapping services. It provides strong visual landmarks for faster navigation thanks to the distinctive shapes and positions of regions on the map, and lends itself to easy data overlay – the size and colour of regions are proportional to supplied code metrics, while bug locations, search results, and dependency edges can be superimposed. A formal evaluation of the prototype has not yet been undertaken, but initial feedback from internal development organisations has been very positive, particularly for the data overlay capabilities and intuitive zoom-for-detail interaction model.
Min Cut Results from 2013
Randomized Decomposition results against state-of-the-art packages and best known solutions in 2013
Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications
Fano's inequality lower bounds the probability of transmission error through a communication channel. Applied to classification problems, it provides a lower bound on the Bayes error rate and motivates the widely used Infomax principle. In modern machine learning, we are often interested in more than just the error rate. In medical diagnosis, different errors incur different cost; hence, the overall risk is cost-sensitive. Two other popular criteria are balanced error rate (BER) and F-score. In this work, we focus on the two-class problem and use a general definition of conditional entropy (including Shannon's as a special case) to derive upper/lower bounds on the optimal F-score, BER and cost-sensitive risk, extending Fano's result. As a consequence, we show that Infomax is not suitable for optimizing F-score or cost-sensitive risk, in that it can potentially lead to low F-score and high risk. For cost-sensitive risk, we propose a new conditional entropy formulation which avoids this inconsistency. In addition, we consider the common practice of using a threshold on the posterior probability to tune performance of a classifier. As is widely known, a threshold of 0.5, where the posteriors cross, minimizes error rate---we derive similar optimal thresholds for F-score and BER.
Something here
test
Static Analysis by Elimination
In this paper we describe a program analysis technique for finding value ranges for variables in the LLVM compiler infrastructure. Range analysis has several important applications for embedded systems, including elimination of assertions in programs, automatically deducing numerical stability, eliminating array bounds checking, and integer overflow detection. Determining value ranges poses a major challenge in program analysis because it is difficult to ensure the termination and precision of the program analysis in the presence of program cycles. This work uses a technique where loops are detected intrinsically within the program analysis. Our work combines methods of elimination-based data flow analysis with abstract interpretation. We have implemented a prototype of the proposed framework in the LLVM compiler framework and have conducted experiments with a suite of test programs to show the feasibility of our approach.
Graal IR: An Extensible Declarative Intermediate Representation
We present an intermediate representation (IR) for a Java just in time (JIT) compiler written in Java. It is a graph-based IR that models both control-flow and data-flow dependencies between nodes. We show the framework in which we developed our IR. Much care has been taken to allow the programmer to focus on compiler optimization rather than IR bookkeeping. Edges between nodes are declared concisely using Java annotations, and common properties and functions on nodes are communicated to the framework by implementing interfaces. Building upon these declarations, the graph framework automatically implements a set of useful primitives that the programmer can use to implement optimizations.
NUMA-aware reader-writer locks
Non-Uniform Memory Access (NUMA) architectures are gaining importance in mainstream computing systems due to the rapid growth of multi-core multi-chip machines. Extracting the best possible performance from these new machines will require us to re-visit the design of the concurrent algorithms and synchronization primitives which form the building blocks of many of today’s applications. This paper revisits one such critical synchronization primitive – the reader-writer lock.
We present what is, to the best of our knowledge, the first family of reader-writer lock algorithms tailored to NUMA architectures. We present several variations which trade fairness between readers and writers for higher concurrency among readers and better back-to-back batching of writers from the same NUMA node.Our algorithms leverage the lock cohorting technique to manage synchronization between writers in a NUMA-friendly fashion, binary flags to coordinate readers and writers, and simple distributed reader counter implementations to enable NUMA-friendly concurrency among readers. The end result is a collection of surprisingly simple NUMA-aware algorithms that outperform the state-of-the-art reader-writer locks by up to a factor of 10 in our microbenchmark experiments. To evaluate ....
Reliable peer-to-peer connections
Embodiments of a system and method for establishing reliable connections between peers in a peer-to-peer networking environment. In one embodiment, a reliable communications channel may use transmit and receive windows, acknowledgement of received messages, and retransmission of messages not received to provide reliable delivery of messages between peers in the peer-to-peer environment. In one embodiment, each message may include a sequence number configured for use in maintaining ordering of received messages on a receiving peer. A communications channel may make multiple hops on a network, and different hops in the connection may use different underlying network protocols. Communications channels may also pass through one or more firewalls and/or one or more gateways on the network. A communications channel may also pass through one or more router (relay) peers on the network. The peers may adjust the sizes of the transmit and receive window based upon reliability of the connection.
The Potential to Coordinate Digital Simulations for UK-wide VET:Report to the Commission on Adult Vocational Teaching and Learning
The report provides an insight and analysis of the opportunity and potential of simulation tools for Education.In VET learning and assessment are primarily practice-based. Consequently many colleges build simulations of real world locations, such as kitchens, hairdressing salons, garages, building sites, and farms in land-based colleges. A wide-range of digital tools are then used to support, amplify or augment these real-world learning processes, to prepare learners for authentic workplace practice, aid reflection on practice, reinforce their practice-based learning, and to help with revision before assessments. We examine here the particular role of digital simulation technologies alongside other digital applications and conventional methods.
Communities of Practice
Communities of practice was first adopted in education as a theory of learning, and by business, particularly within organizational development, as a knowledge management approach. This chapter reviews the literature on communities of practice. It is organized into five sections which call out the major areas in which communities of practice has had an influence. Each section provides an overview of the literature on communities of practice for the following domains: communities of practice definitions and theory, identities and belonging, learning and teaching methods using the theory of communities of practice, workplace communities of practice, and virtual communities of practice. The chapter finally addresses communities of scientific practice through the use of a case study.
SIMMAT: A Metastability Analysis Tool
Presentation at the IEEE/ACM ICCAD 2012 Workshop on CAD for Multi-Synchronous and Asynchronous Circuits and Systems, 8 November 2012, Hilton San Jose, CA USA.
Compilation Queuing and Graph Caching for Dynamic Compilers
Modern virtual machines for Java use a dynamic compiler to optimize the program at run time. The compilation time therefore impacts the performance of the application in two ways: First, the compilation and the program's execution compete for CPU resources. Second, the sooner the compilation of a method finishes, the sooner the method will execute faster. In this paper, we present two strategies for mitigating the performance impact of a dynamic compiler. We introduce and evaluate a way to cache, reuse and, at the right time, evict the compiler's intermediate graph representation. This allows reuse of this graph when a method is inlined multiple times into other methods. We show that the combination of late inlining and graph caching is highly effective by evaluating the cache hit rate for several benchmarks. Additionally, we present a new mechanism for optimizing the order in which methods get compiled. We use a priority queue in order to make sure that the compiler processes the hottest methods of the program first. The machine code for hot methods is available earlier, which has a significant impact on the first benchmark. Our results show that our techniques can significantly improve the start up performance of Java applications. The techniques are applicable to dynamic compilers for managed languages.
Intransitive noninterference in nondeterministic systems
This paper addresses the question of how TA-security, a semantics for intransitive information-flow policies in deterministic systems, can be generalized to nondeterministic systems. Various definitions are proposed, including definitions that state that the system enforces as much of the policy as possible in the context of attacks in which groups of agents collude by sharing information through channels that lie outside the system. Relationships between the various definitions proposed are characterized, and an unwinding-based proof technique is developed. Finally, it is shown that on a specific class of systems, access control systems with local non-determinism, the strongest definition can be verified by checking a simple static property.
RSSolver: A tool for solving large non-linear, non- convex discrete optimization problems
Describes the initial implementation of Randomized Decomposition (then called Randomized Search) with some numerical results on the price optimization and shelf space optimization problems
R2RML:RDB to RDF Mapping Language, W3 recommendation
Co-editor of the W3C recommendation on describing a language to map relational databases to RDF datasets.
DFScala: high level dataflow support for Scala.
In this paper we present DFScala, a library for constructing and executing dataflow graphs in the Scala language. Through the use of Scala this library allows the programmer to construct coarse grained dataflow graphs that take advantage of functional semantics for the dataflow graph and both functional and imperative semantics within the dataflow nodes. This combination allows for very clean code which exhibits the properties of dataflow programs, but we believe is more accessible to imperative programmers. We first describe DFScala in detail, before using a number of benchmarks to evaluate both its scalability and its absolute performance relative to existing codes. DFScala has been constructed as part of the Teraflux project and is being used extensively as a basis for further research into dataflow programming.
Low-loss Low-crosstalk Silicon Rib Waveguide Crossing with Tapered Multimode-Interference Design
Abstract: We report the design and fabrication of silicon rib-waveguide crossings based on taper-integrated multimode-interference. Measured devices built in a 130nm SOI CMOS process showed an insertion loss of 0.1dB/crossing, and an extracted crosstalk below -35dB.
A physical design tool for carbon nanotube field-effect transistor circuits
In this article, we present a graphical Computer-Aided Design (CAD) environment for the design, analysis, and layout of Carbon NanoTube (CNT) Field-Effect Transistor (CNFET) circuits. This work is motivated by the fact that such a tool currently does not exist in the public domain for researchers. Our tool has been integrated within Electric a very powerful, yet free CAD system for custom design of Integrated Circuits (ICs). The tool supports CNFET schematic and layout entry, rule checking, and HSpice/VerilogA netlist generation. We provide users with a customizable CNFET technology library with the ability to specify λ-based design rules. We showcase the capabilities of our tool by demonstrating the design of a large CNFET standard cell and components library. Meanwhile, HSPICE simulations also have been presented for cell library characterization. We hope that the availability of this tool will invigorate the CAD community to explore novel ideas in CNFET circuit design.
MCMCMC: Efficient Inference by Approximate Sampling
Conditional random fields and other graphical models have achieved state of the art results in a variety of NLP and IE tasks including coref- erence and relation extraction. Increasingly, practitioners are using models with more com- plex structure—higher tree-width, larger fan- out, more features, and more data—rendering even approximate inference methods such as MCMC inefficient. In this paper we pro- pose an alternative MCMC sampling scheme in which transition probabilities are approx- imated by sampling from the set of relevant factors. We demonstrate that our method con- verges more quickly than a traditional MCMC sampler for both marginal and MAP inference. In an author coreference task with over 5 mil- lion mentions, we achieve a 13 times speedup over regular MCMC inference.
A Discriminative Hierarchical Model for Fast Coreference at Large Scale
Methods that measure compatibility between mention pairs are currently the dominant ap- proach to coreference. However, they suffer from a number of drawbacks including diffi- culties scaling to large numbers of mentions and limited representational power. As these drawbacks become increasingly restrictive, the need to replace the pairwise approaches with a more expressive, highly scalable al- ternative is becoming urgent. In this paper we propose a novel discriminative hierarchical model that recursively partitions entities into trees of latent sub-entities. These trees suc- cinctly summarize the mentions providing a highly compact, information-rich structure for reasoning about entities and coreference un- certainty at massive scales. We demonstrate that the hierarchical model is several orders of magnitude faster than pairwise, allowing us to perform coreference on six million author mentions in under four hours on a single CPU.
Evaluating the Design of the R Language - Objects and Functions for Data Analysis.
R is a dynamic language for statistical computing that combines lazy functional features and object-oriented programming. This rather unlikely linguistic cocktail would probably never have been prepared by computer scientists, yet the language has become surprisingly popular. With millions of lines of R code available in repositories, we have an opportunity to evaluate the fundamental choices underlying the R language design. Using a combination of static and dynamic program analysis we assess the success of different language features.
Evaluating the Design of the R Language - Objects and Functions for Data Analysis
R is a dynamic language for statistical computing that combines lazy functional features and object-oriented programming. This rather unlikely linguistic cocktail would probably never have been prepared by computer scientists, yet the language has become surprisingly popular. With millions of lines of R code available in repositories, we have an opportunity to evaluate the fundamental choices underlying the R language design. Using a combination of static and dynamic program analysis we assess the success of different language features.
Combining Functional and Imperative Programming for Multicore Software: An Empirical Study Evaluating Scala and Java
Recent multi-paradigm programming languages combine functional and imperative programming styles to make software development easier. Given today’s proliferation of multicore processors, parallel programmers are supposed to benefit from this combination, as many difficult problems can be expressed more easily in a functional style while others match an imperative style. Due to a lack of empirical evidence from controlled studies, however, important software engineering questions are largely unanswered. Our paper is the first to provide thorough empirical results by using Scala and Java as a vehicle in a controlled comparative study on multicore software development. Scala combines functional and imper- ative programming while Java focuses on imperative shared- memory programming. We study thirteen programmers who worked on three projects, including an industrial application, in both Scala and Java. In addition to the resulting 39 Scala programs and 39 Java programs, we obtain data from an industry software engineer who worked on the same project in Scala. We analyze key issues such as effort, code, language usage, performance, and programmer satisfaction. Contrary to popular belief, the functional style does not lead to bad performance. Average Scala run-times are comparable to Java, lowest run-times are sometimes better, but Java scales better on parallel hardware. We confirm with statistical significance Scala’s claim that Scala code is more compact than Java code, but clearly refute other claims of Scala on lower programming effort and lower debugging effort. Our study also provides explanations for these observations and shows directions on how to improve multi-paradigm languages in the future.
Informative Priors for Markov Blanket Discovery
We present a novel interpretation of information theoretic feature selection as optimization of a discriminative model. We show that this formulation coincides with a group of mutual information based filter heuristics in the literature, and show how our probabilistic framework gives a well-founded extension for informative priors. We then derive a particular sparsity prior that recovers the well-known IAMB algorithm (Tsamardinos & Aliferis, 2003) and extend it to create a novel algorithm, IAMB-IP, that includes domain knowledge priors. In empirical evaluations, we find the new algorithm to improve Markov Blanket recovery even when a misspecified prior was used, in which half the prior knowledge was incorrect.
Solving retail space optimization problem using the randomized search algorithm
Solving the Retail Space Optimization Problem using the Randomized Search Algorithm. An application of RD to shelf-space optimization
Resource-bounded Information Acquisition and Learning
In many scenarios it is desirable to augment existing data with information ac- quired from an external source. For example, information from the Web can be used to ll missing values in a database or to correct errors. In many machine learning and data mining scenarios, acquiring additional feature values can lead to improved data quality and accuracy. However, there is often a cost associated with such information acquisition, and we typically need to operate under limited resources. In this thesis, I explore dierent aspects of Resource-bounded Information Acquisition and Learning. The process of acquiring information from an external source involves multiple steps, such as deciding what subset of information to obtain, locating the documents that contain the required information, acquiring relevant documents, extracting the specic piece of information, and combining it with existing information to make useful decisions. The problem of Resource-bounded Information Acquisition (RBIA) viiinvolves saving resources at each stage of the information acquisition process. I ex- plore four special cases of the RBIA problem, propose general principles for eciently acquiring external information in real-world domains, and demonstrate their eective- ness using extensive experiments. For example, in some of these domains I show how interdependency between elds or records in the data can also be exploited to achieve cost reduction. Finally, I propose a general framework for RBIA, that takes into account the state of the database at each point of time, dynamically adapts to the re- sults of all the steps in the acquisition process so far, as well as the properties of each step, and carries them out striving to acquire most information with least amount of resources.
A case for exiting a transaction in the context of hardware transactional memory.
Despite the rapid growth in the area of Transactional Memory (TM), there is a lack of standardisation of certain features. The behaviour of a transactional abort is one such feature. All hardware TM and most software TM designs treat abort as a way of restarting the current transaction. However an alternative representation for the same functionality has been expressed in some software transactional memories and programming languages proposals. These allow the termination of a transaction without restarting. In this paper we argue that similar functionality is required for hardware TM as well. We call this functionality Exit Transaction, in which a programmer can explicitly ask the underlying TM system to move to the end of the transaction without committing it. We discuss how to extend a hardware TM system to support such a feature and our evaluation with two hardware TM systems shows that by using this functionality a speedup of up to 1.35X can be achieved on the benchmarks tested. This is achieved as a result of lower contention for resources and less false positives.
“Dual-Purpose” Remateable Conductive Ball-in-Pit Interconnects for Chip Powering and Passive Alignment in Proximity Communication Enabled Multi-Chip Packages
with Hiren Thacker, Ivan Shubin, Ying Luo, Kannan Raj, Ashok Krishnamoorthy and John Cunningham
Yes, There is an "Expertise Gap" in HPC Applications Development
The High Productivity Computing Systems (HPCS) program seeks a tenfold productivity increase in High Performance Computing (HPC), where productivity is understood to be a composite of system performance, system robustness, programmability, portability, and administrative concerns. Of these, programmability is the least well understood and perceived to be the most problematic. It has been suggested that an "expertise gap" is at the heart of the problem in HPC application development. Preliminary results from research conducted by Sun Microsystems and other participants in the HPCS program confirm that such an "expertise gap" does exist and does exert a significant confounding influence on HPC application development. Further, the nature of the "expertise gap" appears not to be amenable to previously proposed solutions such as "more education" and "more people." A productivity improvement of the scale sought by the HPCS program will require fundamental transformations in the way HPC applications are developed and maintained.
Selecting Actions for Resource-bounded Information Extraction using Reinforcement Learning
Given a database with missing or uncertain content, our goal is to correct and ll the database by extracting specic in- formation from a large corpus such as the Web, and to do so under resource limitations. We formulate the informa- tion gathering task as a series of choices among alternative, resource-consuming actions and use reinforcement learning to select the best action at each time step. We use tempo- ral dierence q-learning method to train the function that selects these actions, and compare it to an online, error- driven algorithm called SampleRank. We present a system that nds information such as email, job title and depart- ment aliation for the faculty at our university, and show that the learning-based approach accomplishes this task e- ciently under a limited action budget. Our evaluations show that we can obtain 92.4% of the nal F1, by only using 14.3% of all possible actions.
Grating-Coupler Based Low-Loss Optical Interlayer Coupling
IEEE Group IV Photonics
Applying dataflow and transactions to Lee routing.
Programming multicore shared-memory systems is a challenging combination of exposing parallelism in your program and communicating between the resulting parallel paths of execution. The burden of communication can introduce complexity that is hard to separate from the pure expression of the algorithm and can negate the performance that is gained from parallelism. We are extending the Scala language with dataow for creating parallelism and transactions for the controlled mutation of shared state. We take an early look at applying this work to Lee's algorithm for routing circuit boards and consider the potential bene_ts of programming with this system with regard to the elegance of expression and the resulting performance.We show how our approach reduces the number of lines of code and synchronisation operations needed, at the same time as improving real-world performance.
Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
We present a unifying framework for information theoretic feature selection, bringing almost two decades of research on heuristic filter criteria under a single theoretical interpretation. This is in response to the question: "what are the implicit statistical assumptions of feature selection criteria based on mutual information?". To answer this, we adopt a different strategy than is usual in the feature selection literature−instead of trying to define a criterion, we derive one, directly from a clearly specified objective function: the conditional likelihood of the training labels. While many hand-designed heuristic criteria try to optimize a definition of feature 'relevancy' and 'redundancy', our approach leads to a probabilistic framework which naturally incorporates these concepts. As a result we can unify the numerous criteria published over the last two decades, and show them to be low-order approximations to the exact (but intractable) optimisation problem. The primary contribution is to show that common heuristics for information based feature selection (including Markov Blanket algorithms as a special case) are approximate iterative maximisers of the conditional likelihood. A large empirical study provides strong evidence to favour certain classes of criteria, in particular those that balance the relative size of the relevancy/redundancy terms. Overall we conclude that the JMI criterion (Yang and Moody, 1999; Meyer et al., 2008) provides the best tradeoff in terms of accuracy, stability, and flexibility with small data samples.
Integration and packaging of a macrochip with silicon nanophotonic links
in press, IEEE Journal of Selected Topics in Quantum Electronics, special issue on Packaging and Integration technologies for Optical MEMS/NEMS, Optoelectronic and Nanophotonic Devices, 2011.
Dense WDM Silicon Photonic Interconnects for Compact High-end Computing Systems
IEEE Photonics Society Winter Topicals, 2011.
Hybrid integrated silicon photonic bridge chips for ultralow energy inter-chip communications
Proceedings, SPIE Photonics West, 2011.
10Gbps, 530 fJ/b Optical Transceiver Circuits in 40nm CMOS
IEEE Symp. VLSI Circ.
Formal Machine-Checked Verification of a Real Transactional Memory Algorithm
Slides for talk at WTTM 2011
Learning to Select Actions for Resource-bounded Information Extraction
Given a database with missing or uncertain information, our goal is to extract specific information from a large corpus such as the Web under limited resources. We cast the information gathering task as a series of alternative, resource-consuming actions to choose from and propose a new algorithm for learning to select the best action to perform at each time step. The function that selects these actions is trained using an online, error-driven algorithm called SampleRank. We present a system that finds the faculty directory pages of top Computer Science departments in the U.S. and show that the learning-based approach accomplishes this task very efficiently under a limited action budget, obtaining approximately 90% of the overall F1 using less than 2% of actions. If we apply our method to the task of filling missing values in a large scale database with millions of rows and a large number of columns, the system can obtain just the required information from the Web very efficiently.
Architecture of the JInterval library
This is a translation from Russian of the paper for the conference "Statistics, Simulation, Optimization - 2011" to be held in Chelyabinsk, Russia. The JInterval library is an interval arithmetic library for Java. It was developed in collaboration with the Altai University, Barnaul, Russia. This paper presents the key architectural decisions made when designing JInterval library. It discusses compliance with functional requirements of the library as well as the current status of JInterval.
Learning to Select Actions for Resource-bounded Information Extraction
Given a database with missing or uncertain information, our goal is to extract specific information from a large corpus such as the Web under limited resources. We cast the information gathering task as a series of alternative, resource-consuming actions to choose from and propose a new algorithm for learning to select the best action to perform at each time step. The function that selects these actions is trained using an online, error-driven algorithm called SampleRank. We present a system that finds the faculty directory pages of top Computer Science departments in the U.S. and show that the learning-based approach accomplishes this task very efficiently under a limited action budget, obtaining approximately 90% of the overall F1 using less than 2% of actions. If we apply our method to the task of filling missing values in a large scale database with millions of rows and a large number of columns, the system can obtain just the required information from the Web very efficiently.
25Gb/s 1V-driving CMOS ring modulator with integrated thermal tuning
We report a high-speed ring modulator that fits many of the ideal qualities for optical interconnect in future exascale supercomputers. The device was fabricated in a 130nm SOI CMOS process, with 7.5m ring radius. Its high-speed section, employing PN junction that works at carrier-depletion mode, enables 25Gb/s modulation and an extinction ratio >5dB with only 1V peak-to-peak driving. Its thermal tuning section allows the device to work in broad wavelength range, with a tuning efficiency of 0.19nm/mW. Based on microwave characterization and circuit modeling, the modulation energy is estimated ~7fJ/bit. The whole device fits in a compact 400um2 footprint.
Towards Formally Specifying and Verifying Transactional Memory
Over the last decade, great progress has been made in developing practical transactional memory (TM) implementations, but relatively little attention has been paid to precisely specifying what it means for them to be correct, or formally proving that they are.
In this paper, we present TMS1 (Transactional Memory Specification 1), a precise specification of correct behaviour of a TM runtime library. TMS1 targets TM runtimes used to implement transactional features in an unmanaged programming language such as C or C++. In such contexts, even transactions that ultimately abort must observe consistent states of memory; otherwise, unrecoverable errors such as divide-by-zero may occur before a transaction aborts, even in a correct program in which the error would not be possible if transactions were executed atomically.
We specify TMS1 precisely using an I/O automaton (IOA). This approach enables us to also model TM implementations using IOAs and to construct fully formal and machine-checked correctness proofs for them using well established proof techniques and tools.
We outline key requirements for a TM system. To avoid precluding any implementation that satisfies these requirements, we specify TMS1 to be as general as we can, consistent with these requirements. The cost of such generality is that the condition does not map closely to intuition about common TM implementation techniques, and thus it is difficult to prove that such implementations satisfy the condition.
To address this concern, we present TMS2, a more restrictive condition that more closely reflects intuition about common TM implementation techniques. We present a simulation proof that TMS2 implements TMS1, thus showing that to prove that an implementation satisfies TMS1, it suffices to prove that it satisfies TMS2. We have formalised and verified this proof using the PVS specification and verification system.
Relating similar terms for information retrieval
A resource analyzer selects a resource (eg, document) from a grouping of resources. The grouping of resources can be any type of social tagging system used for information retrieval. The selected resource has an assigned uncontrolled tag and an assigned controlled tag. The controlled tag is a term derived from a controlled vocabulary of terms. Having selected the resource for analyzing, the resource analyzer identifies a first set of resources in the grouping of resources having also been assigned a same value as the uncontrolled tag as the selected resource. Similarly, the resource analyzer identifies a second set of resources in the grouping of resources having also been assigned a same value as the controlled tag. With this information, the resource analyzer then produces a comparison result indicative of a similarity between the first set of resources and the second set of resources.
System Considerations for Capacitive Chip-to-Chip Signaling
This paper is a submission to the IEEE Radio Frequency Integration Technology (RFIT) conference, http://www.ieee-rfit.org. This is an invited submission to be presented at a special session on "Wireless Replacement of Wireline I/O." Conference Date: Nov 30 - Dec 2, 2011; Location: Beijing, China.
The SOM Family: Virtual Machines for Teaching and Research
The talk gives an overview of the development of a family of Smalltalk virtual machine implementations called SOM (Simple Object Machine). The SOM VM, originating from the University of Aarhus, Denmark, has been ported to several programming languages, exploring different objectives. All of the VM implementations focus on providing an easily accessible workbench for teaching, but have also turned out to be a viable research platform. In this talk, each of the SOM VMs will be briefly described along with the results that were achieved in applying it in teaching at both undergraduate and graduate levels as well as research.
Simple Low-Jitter Scheduler
To appear at High Performance Switching and Routing Conference (HPSR), Cartagena, Spain
SampleRank: Training Factor Graphs with Atomic Gradients
We present SampleRank, an alternative to con- trastive divergence (CD) for estimating param- eters in complex graphical models. SampleR- ank harnesses a user-provided loss function to distribute stochastic gradients across an MCMC chain. As a result, parameter updates can be computed between arbitrary MCMC states. Sam- pleRank is not only faster than CD, but also achieves better accuracy in practice (up to 23% error reduction on noun-phrase coreference).
Query-Aware MCMC
Traditional approaches to probabilistic inference such as loopy belief propagation and Gibbs sampling typically compute marginals for all the unobserved variables in a graphical model. However, in many real-world applications the user’s inter- ests are focused on a subset of the variables, specified by a query. In this case it would be wasteful to uniformly sample, say, one million variables when the query concerns only ten. In this paper we propose a query-specific approach to MCMC that accounts for the query variables and their generalized mutual information with neighboring variables in order to achieve higher computational efficiency. Surprisingly there has been almost no previous work on query-aware MCMC. We demonstrate the success of our approach with positive experimental results on a wide range of graphical models.
A Network Architecture for the Web of Things
The "Web of Things" is emerging as an exciting vision for seamlessly integrating everyday objects like home appliances, digital picture frames, health monitoring devices and energy meters into the Internet using the Web's well-known stan- dards and blueprints. The key idea is to represent resources on these devices as URIs and use HTTP verbs (GET, PUT, POST, DELETE) as the uniform interface to manipulate them. Unfortunately, practical considerations such as band- width or energy constraints, rewalls/NATs and mobility pose interesting challenges in the realization of this ideal vi- sion. This paper describes these challenges, identies some potential solutions and presents the design and implemen- tation of a gateway-based network architecture to address these concerns. To the best of our knowledge, it represents the rst attempt within the Web of Things community to tackle these issues in a comprehensive manner.
Conversion of Decimal Strings to Floating-Point Numbers
Although floating-point operations are accurately-implemented in modern computers, the conversion of numbers from strings of decimal digits (text) to floating-point is difficult and often inaccurate. The difficulty of this conversion has even been used by hackers to attack weak points in systems. This report explores text-to-floating-point conversion and discusses possible performance improvements.
Exploiting CMOS Manufacturing to Reduce Tuning Requirements for Resonant Optical Devices
IEEE Photonics Journal
MUTS: native Scala constructs for software transactional memory.
In this paper we argue that the current approaches to implementing transactional memory in Scala, while very clean, adversely affect the programmability, readability and maintainability of transactional code. These problems occur out of a desire to avoid making modifications to the Scala compiler. As an alternative we introduce Manchester University Transactions for Scala (MUTS), which instead adds keywords to the Scala compiler to allow for the implementation of transactions through traditional block syntax such as that used in “while” statements. This allows for transactions that do not require a change of syntax style and do not restrict their granularity to whole classes or methods. While implementing MUTS does require some changes to the compiler’s parser, no further changes are required to the compiler. This is achieved by the parser describing the transactions in terms of existing constructs of the abstract syntax tree, and the use of Java Agents to rewrite to resulting class files once the compiler has completed. In addition to being an effective way of implementing transactional memory, this technique has the potential to be used as a light-weight way of adding support for additional Scala functionality to the Scala compiler.
Revisiting Condition Variables and Transactions
In the 6th ACM SIGPLAN Workshop on Transactional Computing (Transact’11), June 2011. Citations: 0.
An Evaluation of Asynchronous Stacks
We present an evaluation of some novel hardware implementations of a stack. All designs are asynchronous, fast, and energy efficient, while occupying modest area. We implemented a hybrid of two stack designs that can contain 42 data items with a family of GasP circuits. Measurements from the actual chip show that the chip functions correctly at speeds of up to 2.7 GHz in a 180 nm TSMC process at 2V. The energy consumption per stack operation depends on the number of data movements in the stack, which grows very slowly with the number of data items in the stack. We present a simple technique to measure separately the dynamic and static energy consumption of the complete chip as well as individual data movements in the stack. The average dynamic energy per move in the stack varies between 6pJ and 8pJ depending on the type of move.
Revisiting Condition Variables and Transactions
Prior condition synchronization primitives for memory transactions either force waiting transactions to abort (the retry construct), or force them to commit (also called punctuation in the literature). Although these primitives are useful in some settings, they do not enable programmers to conveniently express idioms that require synchronous communication (e.g., n-way rendezvous operations) between transactions. We present xCondition, a new form of condition variable that neither forces transactions to abort, nor to commit. Instead, an xCondition creates dependencies between the waiting and the corresponding notifying transactions such that the waiter can commit only if the corresponding notifier commits. If waiters and notifiers form dependency cycles (for instance, in synchronous communication idioms), they must commit or abort together. The xCondition construct builds on our earlier work on transaction communicators. We describe how to use xConditions in conjunction with communicators to enable effective coordination and communication between concurrent transactions. We illustrate the use of xConditions, and describe their implementation in the Maxine VM.
Synchronizer Performance in Deep Sub-Micron Technology
17th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC2011), April 2011. We show that the performance characteristics of synchronizer circuits track fabrication feature size reductions in a similar manner to the fan-out-of-four, FO4, inverter delay. We compare a variety of flip-flop circuit designs to a reference cross-coupled inverter circuit and show that flip-flops specifically designed for synchronizer use outperform regular data path flip-flops with the progression of fabrication processes. However, care must be taken to compare circuits in each technology, because additional circuit features have often been added to flip-flop cells with each generation of process. These added features, for example to improve test coverage and facilitate clock selection, frequently degrade synchronizer performance. We present a new synchronizer circuit that performs almost as well as the cross-coupled inverter circuit and has reduced sensitivity to voltage supply variation.
Progress in low-power switched optical interconnects
in press, IEEE Journal of Selected Topics in Quantum Electronics, special issue on Green Photonics, 2011.
On The Power of Hardware Transactional Memory to Simplify Memory Management
Dynamic memory management is a significant source of complexity in the design and implementation of practical concurrent data structures. We study how hardware transactional memory (HTM) can be used to simplify and streamline memory reclamation for such data structures. We propose and evaluate several new HTM-based algorithms for the “Dynamic Collect” problem that lies at the heart of many modern memory management algorithms. We demonstrate that HTM enables simpler and faster solutions, with better memory reclamation properties, than prior approaches. Despite recent theoretical arguments that HTM provides no worst-case advantages, our results support the claim that HTM can provide significantly better common-case performance, as well as reduced conceptual complexity.
High-efficiency 25Gb/s CMOS ring modulator with integrated thermal tuning
We report a 25Gb/s ring modulator with integrated thermal tuning fabricated in a 130nm CMOS process. With 2Vpp modulation, the optical eye shows >6dB extinction ratio. Modulation energy is estimated <24fJ/bit from circuit modeling.
Max Cut Results, 2011
Results for Max Cutset problem from 2011 compared to state-of-the-art in in 2011
Max Cut Results, 2011
Results for Max Cutset problem from 2011 compared to state-of-the-art in in 2011
A framework for reasoning about inherent parallelism in modern object-oriented languages
With the emergence of multi-core processors into the mainstream, parallel programming is no longer the specialized domain it once was. There is a growing need for systems to allow programmers to more easily reason about data dependencies and inherent parallelism in general purpose programs. Many of these programs are written in popular imperative programming languages like Java and C]. In this thesis I present a system for reasoning about side-effects of evaluation in an abstract and composable manner that is suitable for use by both programmers and automated tools such as compilers. The goal of developing such a system is to both facilitate the automatic exploitation of the inherent parallelism present in imperative programs and to allow programmers to reason about dependencies which may be limiting the parallelism available for exploitation in their applications. Previous work on languages and type systems for parallel computing has tended to focus on providing the programmer with tools to facilitate the manual parallelization of programs; programmers must decide when and where it is safe to employ parallelism without the assistance of the compiler or other automated tools. None of the existing systems combine abstraction and composition with parallelization and correctness checking to produce a framework which helps both programmers and automated tools to reason about inherent parallelism. In this work I present a system for abstractly reasoning about side-effects and data dependencies in modern, imperative, object-oriented languages using a type and effect system based on ideas from Ownership Types. I have developed sufficient conditions for the safe, automated detection and exploitation of a number task, data and loop parallelism patterns in terms of ownership relationships. To validate my work, I have applied my ideas to the C# version 3.0 language to produce a language extension called Zal. I have implemented a compiler for the Zal language as an extension of the GPC# research compiler as a proof of concept of my system. I have used it to parallelize a number of real-world applications to demonstrate the feasibility of my proposed approach. In addition to this empirical validation, I present an argument for the correctness of the type system and language semantics I have proposed as well as sketches of proofs for the correctness of the sufficient conditions for parallelization proposed.
A Novel MCM Package Enabling Proximity Communication I-O
A Novel MCM Package Enabling Proximity Communication I-O I. Shubin*, A. Chow, D. Popovic, H. Thacker, M. Giere+, R. Hopkins, A. V. Krishnamoorthy, J. G. Mitchell and J. E. Cunningham Oracle, 9515 Towne Centre Drive, San Diego, CA 92121, USA +currently with Hewlett-Packard, San Diego, CA *email: ivan.shubin@oracle.com phone: (858) 526-9032, fax: (858) 526-9176 t
Abstract A novel packaging approach is described that is based on micro-machined features integrated into CMOS chips. Our solution combines two key self-alignment mechanisms for the first time: solder reflow self-alignment and a novel micro-ball and pyramidal pit for passive self-alignment. We report on the demonstration of a MCM package with large footprint semiconductor CMOS chips interconnected by Proximity Communication (PxC), characterization of their high accuracy assembly process, and metrology of the resulting chip misalignment. Our goal is to develop a scalable, lead-free packaging approach by which large NxN PxC-enabled chip arrays are assembled with high precision on organic substrates in a cost effective manner while using industry standard parts and tooling.
Analytical Cache Replacement for Large Caches and Multiple Block Containers
Submitted to the 23rd ACM Symposium on Operating Systems Principles (SOSP), in conjunction with the USENIX conference in Cascais, Portugal.
A 4.6 GHz MDLL with -46dBc reference spur and aperture position tuning
to appear, IEEE International Solid-State Circuits Conference, February 2011.
Experimental studies of the Franz-Keldysh effect in CVD grown GeSi epi on SOI
Electroabsorption from GeSi on silicon-on-insulator (SOI) is expected to have promising potential for optical modulation due to its low power consumption, small footprint, and more importantly, wide spectral bandwidth for wavelength division multiplexing (WDM) applications. Germanium, as a bulk crystal, has a sharp absorption edge with a strong coefficient at the direct band gap close to the C-band wavelength. Unfortunately, when integrated onto Silicon, or when alloyed with dilute Si for blueshifting to the C-band operation, this strong Franz-Keldysh (FK) effect in bulk Ge is expected to degrade. Here, we report experimental results for GeSi epi when grown under a variety of conditions such as different Si alloy content, under selective versus non selective growth modes for both Silicon and SOI substrates. We compare the measured FK effect to the bulk Ge material. Reduced pressure CVD growth of GeSi heteroepitaxy with various Si content was studied by different characterization tools: X-ray diffraction (XRD), atomic force microscopy (AFM), secondary ion mass spectrometry (SIMS), Hall measurement and optical transmission/absorption to analyze performance for 1550 nm operation. State-of–the-art GeSi epi with low defect density and low root-mean-square (RMS) roughness were fabricated into pin diodes and tested in a surface-normal geometry. They exhibit low dark current density of 5 mA/cm2 at 1V reverse bias with breakdown voltages of 45 Volts. Strong electroabsorption was observed in our GeSi alloy with 0.6% Si content having maximum absorption contrast of ∆α/α ~5 at 1580 nm at 75 kV/cm.
Using virtual worlds for online role-play
The paper explores the use of virtual worlds to support online role-play as a collaborative activity. This paper describes some of the challenges involved in building online role-play environments in a virtual world and presents some of the ideas being explored by the project in the role-play applications being developed. Finally we explore how this can be used within the context of immersive education and 3D collaborative environments.
+SPACES: Serious Games for Role-Playing Government Policies
The paper explores how role-play simulations can be used to support policy discussion and refinement in virtual worlds. Although the work described is set primarily within the context of policy formulation for government, the lessons learnt are applicable to online learning and collaboration within virtual environments. The paper describes how the +Spaces project is using both 2D and 3D virtual spaces to engage with citizens to explore issues relevant to new government policies. It also focuses on the most challenging part of the project, which is to provide environments that can simulate some of the complexities of real life. Some examples of different approaches to simulation in virtual spaces are provided and the issues associated with them are further examined. We conclude that the use of role-play simulations seem to offer the most benefits in terms of providing a generalizable framework for citizens to engage with real issues arising from future policy decisions. Role-plays have also been shown to be a useful tool for engaging learners in the complexities of real-world issues, often generating insights which would not be possible using more conventional techniques.
Immersive Education Spaces using Open Wonderland From Pedagogy through to Practice
This chapter presents a case study of the use of virtual world environment in UK Higher Education. It reports on the activities carried out as part of the SIMiLLE (System for an Immersive and Mixed reality Language Learning) project to create a culturally sensitive virtual world to support language learning (funded by the UK government JISC program). The SIMiLLE project built on an earlier project called MiRTLE, which created a mixed-reality space for teaching and learning. The aim of the SIMiLLE project was to investigate the technical feasibility and pedagogical value of using virtual environments to provide a realistic socio-cultural setting for language learning interaction. The chapter begins by providing some background information on the Wonderland platform and the MiRTLE project, and then outlines the requirements for SIMiLLE, and how these requirements were supported through the use of a virtual world based on the Open Wonderland virtual world platform. The chapter then presents the framework used for the evaluation of the system, with a particular focus on the importance of incorporating pedagogy into the design of these systems, and how to support good practice with the ever-growing use of 3D virtual environments in formalized education. Finally, the results from the formative and summative evaluations are summarized, and the lessons learnt are presented, which can help inform future uses of immersive education spaces within Higher Education.
A Power-Efficient Network On-Chip Topology
International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip, New York, NY, 2011
A sub-picojoule-per-bit CMOS silicon photonic receiver
Laser Focus World, 2010.
A CAD tool for design and analysis of CNFET circuits
In this paper, we present a graphical computer-aided design (CAD) environment for the design, analysis, and layout of carbon nanotube (CNT) field-effect transistor (CNFET) circuits. This work is motivated by the fact that such a tool currently does not exist in the public domain for researchers. Our tool has been integrated within Electric - a very powerful, yet free CAD system for custom design of integrated circuits (ICs). The tool supports CNFET schematic and layout entry, rule checking, and HSpice/VerilogA netlist generation. We provide users with a customizable CNFET technology library with the ability to specify λ-based design rules. We showcase the capabilities of our tool by demonstrating the design of a CNFET standard cell library and a 16-bit carry-select adder. We hope that the availability of this tool will invigorate the CAD community to explore novel ideas in CNFET circuit design.
Low power silicon photonic transceivers
IEEE Photonics Society Summer Topicals, 2010.
Coupled Data Communication Techniques for High Performance and Low Power
part of the Integrated Circuits and Systems series, A. Chandrakasan, series editor. Springer Verlag, ISBN 978-1-4419-6587-5, 2010.
Macrochip computer systems enabled by silicon photonic interconnects
Proceedings, SPIE Photonics West, vol. 7607: Optoelectronic Interconnects and Component Integration X, 2010.
The integration of silicon photonics and VLSI electronics for computing and switching systems
OSA Photonics in Switching Topical Meeting, 2010.
Employing coherent detection for on-chip six-axis position sensors
9th Annual IEEE Conference on Sensors, November 2010.
On-chip CMOS position sensors using coherent detection
IEEE Asian Solid-State Circuits Conference, November 2010.
Breaking the picojoule-per-bit barrier
Proceedings of the IEEE Photonics Society Annual Meeting, October 2010.
Compacting high-end computing systems with dense WDM silicon photonic interconnects
IEEE Compound Semiconductor IC Symposium, October 2010.
Dynamic Code Evolution for Java
Dynamic code evolution is a technique to update a program while it is running. In an object-oriented language such as Java, this can be seen as replacing a set of classes by new versions. We modified an existing high-performance virtual machine to allow arbitrary changes to the definition of loaded classes. Besides adding and deleting fields and methods, we also allow any kind of changes to the class and interface hierarchy. Our approach focuses on increasing developer productivity during debugging. Changes can be applied at any point a Java program can be suspended. The evaluation section shows that our modifications to the virtual machine have no negative performance impact on normal program execution. The fast in-place instance update algorithm ensures that the performance characteristics of a change are comparable with performing a full garbage collection run. Standard Java development environments are capable of using the code evolution features of our modified virtual machine, so no additional tools are required.
Efficient Coroutines for the Java Platform
Coroutines are non-preemptive light-weight processes. Their advantage over threads is that they do not have to be synchronized because they pass control to each other explicitly and deterministically. Coroutines are therefore an elegant and efficient implementation construct for numerous algorithmic problems. Many mainstream languages and runtime environments, however, do not provide a coroutine implementation. Even if they do, these implementations often have less than optimal performance characteristics because of the tradeoff between run time and memory efficiency. As more and more languages are implemented on top of the Java virtual machine (JVM), many of which provide coroutine-like language features, the need for a coroutine implementation has emerged.We present an implementation of coroutines in the JVM that efficiently handles a large range of workloads. It imposes no overhead for applications that do not use coroutines and performs well for applications that do. For evaluation purposes, we use our coroutines to implement JRuby fibers, which leads to a significant speedup of certain JRuby programs. We also present general benchmarks that show the performance of our approach and outline its run-time and memory characteristics.
Environmental considerations when measuring relative performance of graphics cards.
In this paper we examine some of the environmental conditions that have to be considered when comparing the performance of GPU’s to CPU’s. The range of these considerations varies greatly from the differing ages of the hardware used, to the effects of running the GPU code before the CPU code within the same binary. The latter of these has some quite surprising effects on the system as a whole. We then go on to test the different hardware performance at matrix multiplication using both their basic linear algebra libraries and hand coded functions. This is done while respecting the considerations we have described earlier in the paper, and addressing a problem that with the use of the Intel MKL library cannot be argued to be unfair to the CPU.
Adaptive Data-Aware Utility-Based Scheduling in Resource-Constrained Systems
This paper addresses the problem of the dynamic scheduling of data-intensive multiprocessor jobs. Each job requires some number of CPUs and some amount of data that needs to be downloaded into a local storage. The completion of each job brings some benefit (utility) to the system, and the goal is to find the optimal scheduling policy that maximizes the average utility per unit of time obtained from all completed jobs. A co-evolutionary solution methodology is proposed, where the utility-based policies for managing local storage and for scheduling jobs onto the available CPUs mutually affect each other’s environments, with both policies being adaptively tuned using the Reinforcement Learning (RL) methodology. The simulation results demonstrate that the performance of the scheduling policies increases significantly as a result of being tuned with RL, to the point that they significantly outperform the best scheduling algorithm suggested in the literature for jobs with soft-deadline utility functions.
Optical interconnects in the data center
18th Annual IEEE Symposium on High Performance Interconnects (HOT-I2010), August 2010.
Clocking links in multi-chip packages: a case study
18th Annual IEEE Symposium on High Performance Interconnects, August 2010.
Ultra-low power silicon photonic transceivers for inter/intra-chip interconnects
Proceedings SPIE Optics + Photonics, August 2010.
How good is a span of terms? Exploiting proximity to improve web retrieval
Ranking search results is a fundamental problem in information retrieval. In this paper we explore whether the use of proximity and phrase information can improve web retrieval accuracy. We build on existing research by incorporating novel ranking features based on flexible proximity terms with recent state-of-the-art machine learning ranking models. We introduce a method of determining the goodness of a set of proximity terms that takes advantage of the structured nature of web documents, document metadata, and phrasal information from search engine user query logs. We perform experiments on a large real-world Web data collection and show that using the goodness score of flexible proximity terms can improve ranking accuracy over state-ofthe-art ranking methods by as much as 13%. We also show that we can improve accuracy on the hardest queries by as much as 9% relative to state-of-the-art approaches.
Wafer-Testing of Optoelectronic-Gigascale CMOS Integrated Circuits
Gigascale integrated (GSI) chips with high-bandwidth, integrated optoelectronic (OE) and photonic components are an emerging technology. In this paper, we present the prospects and opportunities for wafer-testing of chips with electrical and optical I/O interconnects. The issues and requirements of testing OE-GSI chips during high-volume manufacturing are identified and discussed. Two probe substrate technologies based on microelectromechanical systems (MEMS) for simultaneously interfacing a multitude of surface-normal optical I/Os and high-density electrical I/Os are detailed. The first probe substrate comprises vertically compliant probes for contacting electrical I/Os and grating-in-waveguide optical probes for optical I/O coupling. The second MEMS probe module uses microsockets and through-substrate vias (TSVs) to contact pillar-shaped electrical and optical I/Os and to redistribute the signals, respectively.
Optical interconnect for high-end computer systems
IEEE Design and Test of Computers, Vol. 27, No. 4, July/August 2010, pp. 10-19.
Resource-bounded Information Extraction: Acquiring Missing Feature Values On Demand
We present a general framework for the task of extracting specic information \on demand" from a large corpus such as the Web under resource-constraints. Given a database with missing or uncertain information, the proposed system automatically formulates queries, is- sues them to a search interface, selects a subset of the documents, ex- tracts the required information from them, and lls the missing values in the original database. We also exploit inherent dependency within the data to obtain useful information with fewer computational resources. We build such a system in the citation database domain that extracts the missing publication years using limited resources from the Web. We discuss a probabilistic approach for this task and present rst results. The main contribution of this paper is to propose a general, comprehensive architecture for designing a system adaptable to dierent domains.
Silicon photonic network architectures for scalable, power-efficient multi-ship systems
Proceedings of the 37th ACM/IEEE International Symposium on Computer Architecture (ISCA), 2010.
Flip-chip integrated silicon photonic bridge chips for sub-picojoule per bit optical links
accepted and to appear at IEEE Electronics Components and Technology Conference (ECTC2010), June 2010.
You Are Not Alone: Breaking Transaction Isolation
In the 3rd International Workshop on Multicore Software Engineering (IWMSE10).
Thesis: Debugging and Profiling of Transactional Programs
Transactional memory (TM) has become increasingly popular in recent years as a promising programming paradigm for writing correct and scalable concurrent programs. Despite its popularity, there has been very little work on how to debug and profile transactional programs. This dissertation addresses this situation by exploring the debugging and profiling needs of transactional programs, explaining how the tools should change to support these needs, and implementing preliminary infrastructure to support this change. Defense Date: Tuesday, March 23rd, 4pm Lubrano Conference Room, CIT Building, Brown University A few demos for profiling transactional programs using the T-PASS prototype
A Package Demonstration with Solder Free Compliant Flexible Interconnects.
I. Shubin*, A. Chow, J. Cunningham, M. Giere, N. Nettleton, N. Pinckney, J. Shi, J. Simons, D. Douglas Oracle, 9515 Towne Centre Drive, San Diego, CA USA 92121 E. M. Chow, D. Debruyker, B. Cheng, G. Anderson Palo Alto Research Center (PARC), 3333 Coyote Hill Road, Palo Alto, CA USA 94304 Flexible, stress-engineered spring interconnects is a novel technology potentially enabling room temperature assembly approaches to building highly integrated and multi-chip modules (MCMs). Such interconnects are an essential solder-free technology facilitating the MCM package diagnostics and rework. Previously, we demonstrated the performance, functionality, and reliability of compliant micro-spring interconnects under temperature cycling, humidity bias and high-current soak. Currently, we demonstrate for the first time the package with the 1st level conventional fine pitch C4 solder bump interconnects replaced by the arrays of microsprings. A dedicated CMOS integrated circuits (ICs) have been assembled onto substrates using these integrated microsprings. Metrology modules on the ICs are designed and used to characterize the connectivity and resistance of each micro-spring site.
A macrochip interconnection network enabled by silicon nanophotonic devices
Journal of Nanoscience and Nanotechnology, Special Issue on Nanophotonics and Nanooptics, Vol. 10, Number 3, March 2010, pp. 1616-1625.
Debugging applications at resource constrained virtual machines using dynamically installable lightweight agents
A system for debugging applications at resource-constrained virtual machines may include a target device configured to host a lightweight debug agent to obtain debug information from one or more threads of execution at a virtual machine executing at the target device, and a debug controller. The lightweight debug agent may include a plurality of independently deployable modules. The debug controller may be configured to select one or more of the modules for deployment at the virtual machine for a debug session initiated to debug a targeted thread, to deploy the selected modules at the virtual machine for the debug session, and to receive debug information related to the targeted thread from the lightweight debug agent during the session.
How Open Source and Collaboration aid Innovation in VLSI CAD
CRAW (Committee on the Status of Women in Computing Research) Distinguished Lecture, February 2010.
High bandwidth and low energy on-chip signaling with adaptive pre-emphasis in 90nm CMOS
Digest of Technical Papers, IEEE International Solid-State Circuits Conference (ISSCC2010), February 2010, pp. 182-183.
Ultra-low-energy all-CMOS modulator integrated with driver
Optics Express, Vol. 18, Number 3, 2010, pp. 3059-3070.
Ultralow-power silicon photonic interconnect for high-performance computing systems
The Ultra-performance Nanophotonic Intrachip Communication (UNIC) project aims to achieve unprecedented high-density, low-power, large-bandwidth, and low-latency optical interconnect for highly compact supercomputer systems. This project, which has started in 2008, sets extremely aggressive goals on power consumptions and footprints for optical devices and the integrated VLSI circuits. In this paper we will discuss our challenges and present some of our first-year achievements, including a 320 fJ/bit hybrid-bonded optical transmitter and a 690 fJ/bit hybrid-bonded optical receiver. The optical transmitter was made of a Si microring modulator flip-chip bonded to a 90nm CMOS driver with digital clocking. With only 1.6mW power consumption measured from the power supply voltages and currents, the transmitter exhibits a wide open eye with extinction ratio >7dB at 5Gb/s. The receiver was made of a Ge waveguide detector flip-chip bonded to a 90nm CMOS digitally clocked receiver circuit. With 3.45mW power consumption, the integrated receiver demonstrated -18.9dBm sensitivity at 5Gb/s for a BER of 10-12. In addition, we will discuss our Mux/Demux strategy and present our devices with small footprints and low tuning energy.
A sub-picojoule-per-bit CMOS photonic receiver for densely integrated systems
Optics Express, Vol. 18, Number 1, 2010, pp. 204-211.
A Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi-Processors
International Workshop on Network on Chip Architectures (NoCArc'09), New York, NY, Dec 12, 2009
Ultralow-Power High-Performance Si Photonic Transmitter
We report a 320fJ/bit transmitter made of a Si microring modulator flip-chip bonded to a CMOS driver. The transmitter consumes only 1.6mW power, and exhibits a wide open eye with extinction ratio >7dB at 5Gb/s.
Circuits for silicon photonics on a 'macrochip'
Digest of Technical Papers, IEEE Asian Solid-State Circuits Conference (ASSCC2009), November 2009, pp. 17-20.
A test platform for thermal, electrical, and mechanical characterization of packages
42nd International Symposium on Microelectronics (IMAPS), November 2009.
Improving Software Quality with Parfait
Parfait is a static bug-checking tool for C/C++ source code, designed to be both scalable and precise. Requirements for this tool were derived from interaction with the Solaris(TM) operating system team, where millions of lines of source code must be checked in a time-efficient manner, with minimal noise and a low cost of integration into the build process.
Internally at Sun various software organizations are using Parfait to analyse thousands to millions of lines of code, with over 500 buffer overflows found and fixed. Assisted by its graphical web-based user interface, both developers and managers are able to traverse bug data in a quick and easy way. Internal feedback from the various organizations allows us to improve the tool on a regular basis.
Externally, we and others are using Parfait to analyse open source code, including the open source operating system kernels OpenSolaris(TM), Linux and OpenBSD. Bugs found have been submitted to their respective communities and are normally fixed in a timely fashion. Presentation at the Software Assurance Forum, November 2009.
Benchmarking Static C Bug-Checking Tools
One of the problems with the large number of static bug-checking tools is that it is hard for users (developers and managers) to determine which tool best fits their organisation; quantifying precision of a tool and its scalability is necessary. Precision is the ratio of the number of bugs correctly reported to the total number of bugs reported by a tool. Scalability is the ability of a tool to scale proportionally in runtime relative to the size of the input codebase.
Another problem that quality assurance engineers have with these tools is the lack of information on what bugs are missed in the code; quantifying recall of the tool is also needed. Recall is the ratio of the number of bugs correctly reported by a tool to the total number of bugs in a codebase. Taking into account both, precision and recall, gives a measure of a tool's accuracy. Accuracy is the ability of a bug-checking tool to report correct bugs while at the same time holding back incorrect ones.
In Proceedings of "The Second Static Analysis Tool Exposition (SATE) 2009" Workshop, U.S. National Institute of Standards and Technology (NIST) Special Publication (SP) 500-287, June, 2010.
Early Experience with a Commercial Hardware Transactional Memory Implementation
We report on our experience with the hardware transactional memory (HTM) feature of two revisions of a prototype multicore processor. Our experience includes a number of promising results using HTM to improve performance in a variety of contexts, and also identifies some ways in which the feature could be improved to make it even better. We give detailed accounts of our experiences, sharing techniques we used to achieve the results we have, as well as describing challenges we faced in doing so. This technical report expands on our ASPLOS paper [9], providing more detail and reporting on additional work conducted since that paper was written.
An ultra-low power all CMOS Si photonic transmitter
OSA Frontiers in Optics, postdeadline session, October 2009.
Adaptive Optimization of the Sun Java Real-Time System Garbage Collector
Garbage collection (GC) is one of the largest sources of unpredictability in JavaTM applications, and a real-time virtual machine must use garbage collection algorithms that minimize delays to real-time threads and at the same time maximize the overall application’s throughput. In order to achieve the optimal tradeoff between these conflicting objectives, the GC cycle (which needs to take place periodically in order to free the memory no longer used by the application) needs to be triggered at the optimal time: if it is triggered too soon then the application’s throughput will decrease unnecessarily, while if it is triggered too late then the application can run out of free memory and block real-time threads unnecessarily. Starting with Sun Java Real-Time System 2.0 (Java RTS), a new real-time garbage collector (RTGC) is available. One of the key RTGC parameters is the StartupMemoryThreshold, which determines how low the free memory in the system can fall before a garbage collection is triggered. This paper presents a framework for dynamically adapting the StartupMemoryThreshold for achieving the optimal balance between the application’s throughput and pause time, which was integrated into the beta release of Java RTS 2.2. An experimental evaluation of this framework using the SPECjbb2005 benchmark confirmed its effectiveness. This framework can be used in conjunction with any concurrent or a time-based incremental garbage collector.
A Silicon photonic WDM network for high performance macrochip communications
Proceedings, SPIE Photonics West, Vol. 7221: Photonics packaging, integration, and interconnects IX, 2009.
Lazy Continuations for Java Virtual Machines
Continuations, or 'the rest of the computation', are a concept that is most often used in the context of functional and dynamic programming languages. Implementations of such languages that work on top of the Java virtual machine (JVM) have traditionally been complicated by the lack of continuations because they must be simulated. We propose an implementation of continuations in the Java virtual machine with a lazy or on-demand approach. Our system imposes zero run-time overhead as long as no activations need to be saved and restored and performs well when continuations are used. Although our implementation can be used from Java code directly, it is mainly intended to facilitate the creation of frameworks that allow other functional or dynamic languages to be executed on a Java virtual machine. As there are no widely used benchmarks for continuation functionality on JVMs, we developed synthetical benchmarks that show the expected costs of the most important operations depending on various parameters.
Simple Fairness Protocols for Daisy Chain Interconnects
Symposium on High-Performance Interconnects (HotI'09), New York
BegBunch: Benchmarking for C Bug Detection Tools
Benchmarks for bug detection tools are still in their infancy. Though in recent years various tools and techniques were introduced, little effort has been spent on creating a benchmark suite and a harness for a consistent quantitative and qualitative performance measurement. For assessing the performance of a bug detection tool and determining which tool is better than another for the type of code to be looked at, the following questions arise: 1) how many bugs are correctly found, 2) what is the tool's average false positive rate, 3) how many bugs are missed by the tool altogether, and 4) does the tool scale. In this paper we present our contribution to the C bug detection community: two benchmark suites that allow developers and users to evaluate accuracy and scalability of a given tool. The two suites contain buggy, mature open source code; bugs are representative of ``real world'' bugs. A harness accompanies each benchmark suite to compute automatically qualitative and quantitative performance of a bug detection tool. BegBunch has been tested to run on the Solaris(TM), Mac OS~X and Linux operating systems. We show the generality of the harness by evaluating it with our own Parfait and three publicly available bug detection tools developed by others.
Productive Petascale Computing: Requirements, Hardware, and Software
Supercomputer designers traditionally focus on low-level hardware performance criteria such as CPU cycle speed, disk bandwidth, and memory latency. The High-Performance Computing (HPC) community has more recently begun to realize that escalating hardware performance is, by itself, contributing less and less to real productivity—the ability to develop and deploy high-performance supercomputer applications at acceptable time and cost.
The Defense Advanced Research Projects Agency (DARPA) High Productivity Computing Systems (HPCS) initiative challenged industry vendors to design a new generation of supercomputers that would deliver a 10x improvement in this newly acknowledged but poorly understood domain of real productivity. Sun Microsystems, choosing to abandon customary evolutionary approaches, responded with two revolutionary decisions. The first was to investigate the nature of supercomputer productivity in the full context of use, which includes people, organizations, goals, practices, and skills as well as processors, disks, memory, and software. The second decision was to rethink completely the design of supercomputing systems, informed by productivity-based requirements and driven by recent technological breakthroughs. Crucial to the implementation of these decisions was the establishment of multidisciplinary, closely collaborating teams that conducted research into productivity and developed the many closely intertwined design decisions needed to meet DARPA’s challenge.
Among the most significant results from Sun’s productivity research was a detailed diagnosis of software development as the dominant barrier to productivity improvements in the HPC community. The level of expertise required, combined with the amount of effort needed to develop conventional HPC codes, has already created a crisis of productivity. Even worse, there is no path forward within the existing paradigm that will significantly increase productivity as hardware systems scale up. The same issues also prevent HPC from “scaling out” to a broader class of applications. This diagnosis led to design requirements that address specific issues behind the expertise and effort bottlenecks.
Sun’s design teams explored complex, system-wide tradeoffs needed to meet these requirements in all aspects of the design, including reliability, performance, programmability, and ease of administration. These tradeoffs drew on technological advances in massive chip multithreading, extremely high-performance interconnects, resource virtualization, and programming language design. The outcome was the design for a machine to operate at petascale, with extremely high reliability and a greatly simplified programming model. Although this design supports existing codes and software technologies—crucial requirements—it also anticipates that the greatest productivity breakthroughs will follow from dramatic changes in how HPC codes are developed, changes that require a system of the type designed by Sun’s HPCS team.
A Reinforcement Learning Framework for Utility-Based Scheduling in Resource-Constrained Systems
This paper presents a general methodology for online scheduling of parallel jobs onto multi-processor servers in a soft real-time environment, where the final utility of each job decreases with the job completion time. A solution approach is presented where each server uses Reinforcement Learning for tuning its own value function, which predicts the average future utility per time step obtained from completed jobs based on the dynamically observed state information. The server then selects jobs from its job queue, possibly preempting some currently running jobs and “squeezing” some jobs into fewer CPUs than they ideally require to maximize the value of the resulting server state. The experimental results demonstrate the feasibility and benefits of the proposed approach.
3D visualization of integrated circuits in the Electric VLSI design system
User Track poster, 2009 Design Automation Conference, July 2009.
Computing microsystems based on silicon photonic interconnects
Proceedings of the IEEE, Vol. 97, Issue 7, July 2009, pp. 1337-1361.
Sun Small Programmable Object Technology (Sun SPOTs) and Sensor.Network
Presentation and demo at the Sensor Web Enablement (SWE) working group meeting of the Open Geospatial Consortium (OGC), Cambridge, MA, Jun 23, 2009.
Integrating novel packaging technologies for large-scale computer systems
ASME/Pacific Rim Technical Conference and Exhibition on Packaging and Integration of Electronic and Photonic Systems, MEMS, and NEMS Conference (Interpack2009), June 2009.
Flat tree networks
ASME/Pacific Rim Technical Conference and Exhibition on Packaging and Integration of Electronic and Photonic Systems, MEMS, and NEMS Conference (Interpack2009), June 2009.
JavaOne Minute with Vipul Gupta
A video demonstrating Sensor.Network filmed live during JavaOne 2009, Jun, 2009.
Generating Transparent, Steerable Recommendations from Textual Descriptions of Items
We propose a recommendation technique that works by collecting text descriptions of the items that we want to recommend and then using this emph{textual aura} to compute the similarity between items using techniques drawn from information retrieval. We show how this representation can be used to explain the similarities between items using terms from the textual aura and further how it can be used to steer the recommender. We'll describe a system that demonstrates these techniques and we'll detail some preliminary experiments aimed at evaluating the quality of the recommendations and the effectiveness of the explanations of item similarity.
Proximity interconnect flip-chip package with micron chip-to-chip alignment tolerances
IEEE Electronic Components and Technology Conference (ECTC2009), May 2009.
A modular synchronizing FIFO for NoCs
3rd ACM/IEEE International Symposium on Networks-on-Chip (NOCs2009), May 2009.
Hierarchical Filesystems Are Dead
For over forty years, we have assumed hierarchical file system namespaces. These namespaces were a rudimentary attempt at simple organization. As users have begun to interact with increasing amounts of data and are increasingly demanding search capability, such a simple hierarchical model has outlasted its usefulness. For this reason, we should design file systems whose organizations map to the ways we access and manipulate data now. We present a new file system architecture in which we replace the hierarchical namespace with a tagged, search-based one.
Novel Packaging with Rematable Spring Interconnect Chips for MCMs
IEEE Electronic Components and Technology Conference (ECTC2009), May 2009.
BGA package co-integration of electrical, optical, and capacitive interconnects
IEEE Electronic Components and Technology Conference (ECTC2009), May 2009.
Synchroniser Behaviour and Analysis
15th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC2009), May 2009.
The integration of silicon photonics and VLSI electronics for computing systems intra-connect
Proceedings, SPIE Photonics West, Vol. 7220: Silicon photonics IV, 2009.
Communication in macrochips using silicon photonics for high-performance and low-energy computing
5th Annual IEEE Int'l Symposium on VLSI Design, Automation, and Test (VLSI-DAT2009), April 2009.
Enabling Technologies for Multi-Chip Integration using Proximity Communication
5th Annual IEEE Int'l Symposium on VLSI Design, Automation, and Test (VLSI-DAT2009), April 2009.
Exceptions and Transactions in C++
In the 1st USENIX Workshop on Hot Topics in Parallelism (HotPar’09).
Trends from Ten Years of Soft Error Experimentation
Proceedings, System Effects of Logic Soft Errors (SELSE2009), March 2009.
Experiments with a Solar-powered Sun SPOT
Sun SPOTs are small, battery-powered, wireless embedded devices that can autonomically sense and respond to their environment. These devices have the potential to revolutionize a broad spectrum of applications - environmental monitoring, asset tracking, proactive health care, intelligent agriculture, military surveillance, etc. Many of these require the device to run for long periods (months) using a combination of duty cycling and renewable energy sources (e.g., solar panels). This note describes lessons learned while collecting data from a solar-powered SPOT for a period of nearly four weeks.
Anatomy of a Scalable Software Transactional Memory
Existing software transactional memory (STM) implementations often exhibit poor scalability, usually because of nonscalable mechanisms for read sharing, transactional consistency, and privatization; some STMs also have nonscalable centralized commit mechanisms. We describe novel techniques to eliminate bottlenecks from all of these mechanisms, and present SkySTM, which employs these techniques. SkySTM is the first STM that supports privatization and scales on modern multicore multiprocessors with hundreds of hardware threads on multiple chips. A central theme in this work is avoiding frequent updates to centralized metadata, especially for multi-chip systems, in which the cost of accessing centralized metadata increases dramatically. A key mechanism we use to do so is a scalable nonzero indicator (SNZI), which was designed for this purpose. A secondary contribution of the paper is a new and simplified SNZI algorithm. Our scalable privatization mechanism imposes only about 4% overhead in low-contention experiments; when contention is higher, the overhead still reaches only 35% with over 250 threads. In contrast, prior approaches have been reported as imposing over 100% overhead in some cases, even with only 8 threads.
An Exit Hole method for Verified Solution of IVPs for ODEs using Linear Programming for the Search of Tight Bounds
In his survey [5], Nedialkov stated that ?Although high-order Taylor series may be reasonably efficient for mildly stiff ODEs, we do not have an interval method suitable for stiff ODEs.? This paper is an attempt to find such a method, based on building a positively invariant set in extended state space. A positively invariant set is treated as geometric generalization of differential inequalities. We construct a positively invariant set from simpler sets which are not positively invariant, but have exit hole instead. The exit holes of simpler sets are suppressed during the construction. This paper considers only sets which are polytopes. Linear interval forms are used to evaluate a projection of ODE velocity vector to the normals of the polytope facets. This permits the use of Linear Programming for the search of tighter positively invariant set. The Exit Hole method is illustrated by stiff Van der Pol ODE.
Modeling, Analysis and Throughput Optimization of a Generational Garbage Collector
One of the garbage collectors in Sun's HotSpot Java(TM)Virtual Machine is known as the generational throughput collector, as it was designed to have a large throughput (fraction of time spent on application's work rather than on garbage collection). This paper derives an analytical expression for the throughput of this collector in terms of the following key parameters: the sizes of the "Young" and "Old" memory spaces and the value of the tenuring threshold. Based on the derived throughput model, a practical algorithm ThruMax is proposed for tuning the collector's parameters so as to formally maximize its throughput. This algorithm was implemented as an optional feature in an early release of JDK(TM)7, and its performance was evaluated for various settings of the SPECjbb2005 workload. A consistent improvement in throughput was demonstrated when the ThruMax algorithm was enabled in JDK.
MiRTLE: a mixed reality teaching & learning environment
This technical report describes a project to create a mixed reality teaching and learning environment using the virtual world toolkit Project Wonderland. The purpose of this document is to provide details about the background to the project, its goals and achievements. The intended audience for this document is educators, educational technologists, and others interested in the educational applications of virtual worlds.
Kinesis: A New Approach to Replica Placement in Distributed Storage Systems
Kinesis is a novel data placement model for distributed storage systems. It exemplifies three design principles: structure (division of servers into a few failure-isolated segments), freedom of choice (freedom to allocate the best servers to store and retrieve data based on current resource availability), and scattered distribution (independent, pseudo-random spread of replicas in the system). These design principles enable storage systems to achieve balanced utilization of storage and network resources in the presence of incremental system expansions, failures of single and shared components, and skewed distributions of data size and popularity. In turn, this ability leads to significantly reduced resource provisioning costs, good user-perceived response times, and fast, parallelized recovery from independent and correlated failures. This article validates Kinesis through theoretical analysis, simulations, and experiments on a prototype implementation. Evaluations driven by real-world traces show that Kinesis can significantly outperform the widely used Chain replica-placement strategy in terms of resource requirements, end-to-end delay, and failure recovery.
Prediction-time Active Feature-value Acquisition for Cost-Effective Customer Targeting
In general, the prediction capability of classification models can be enhanced by acquiring additional relevant featuresfor instances. However, in many cases, there is a significant cost associated with this additional information— driving the need for an intelligent acquisition strategy. Motivated by real-world customer targeting domains, we consider the setting where a fixed set of additional features can be acquired for a subset of the instances at test time. We study different acquisition strategies of selecting instances for which to acquire more information, so as to obtain the most improvement in prediction performance per unit cost. We apply our methods to various targeting datasets and show that we can achieve a better prediction performance by actively acquiring features for only a small subset of instances, compared to a random-sampling baseline.
Dealing with Issues in VLSI Interconnect Scaling
IEEE xpert Now on-line learning course, ISBN 1-4244-1450-4, 2008.
To Yaeko, on the occasion of her retirement
Photo album, Dec 2008.
Fault-tolerant distributed algorithms on VLSI chips
Dagstuhl Seminar 08371 Proceedings, 2008.
A hardware-assisted concurrent & parallel GC algorithm
Tutorial on the Maxwell algorithm (hardware assistance for concurrent and parallel GC) for an external audience. This is a draft for early release to academic collaborators.
VLSI CAD research at Sun Labs
Japan-America Frontiers of Engineering, National Academy of Engineering Symposium, Nov 2008.
High-radix crossbar switches enabled by proximity communication
Proceedings of the 2008 ACM/IEEE conference on Supercomputing, November 2008.
A differential inequalities method for verified solution of IVPs for ODEs using linear programming for the search of tight bounds
13th GAMM-IMACS International Symposium on Scientific Computing, Computer Arithmetic, and Verified Numerical Computing, October 2008.
VLSI tutorial
Tutorial for Dagstuhl Seminar, Fault-tolerant distributed algorithms in VLSI chips, Sept 2008. (Slides.)
TPE: A network of closely coupled computational elements
SML# 2008-0443, Sep 2008
Silicon photonic WDM point-to-point network for multi-chip processor interconnects
5th Annual IEEE International Conference on Group IV Photonics (GFP2008), September 2008.
Optical proximity communication in packaged SiPhotonics
5th Annual IEEE International Conference on Group IV Photonics (GFP2008), September 2008.
Synchrony and Asynchrony in VLSI
Tutorial for Dagstuhl Seminar, Fault-tolerant distributed algorithms in VLSI chips, Sept 2008. (Slides.)
Optical interconnects for present and future high-performance computer systems
16th Annual IEEE Symposium on High Performance Interconnects (HOT-I2008), August 2008.
A Mixed Reality Teaching and Learning Environment
This work in progress paper describes collaborative research, taking place on three continents, towards creating a 'mixed reality teaching & learning environment' (MiRTLE) that enables teachers and students participating in real-time mixed and online classes to interact with avatar representations of each other. The longer term hypothesis that will be investigated is that avatar representations of teachers and students will help create a sense of shared presence, engendering a sense of community and improving student engagement in online lessons. This paper explores the technology that will underpin such systems by presenting work on the use of a massively multi-user game server, based on Sun's Project Darkstar and Project Wonderland tools, to create a shared teaching environment, illustrating the process by describing the creation of a virtual classroom. We describe the Shanghai NEC eLearning system that will form the platform for the deployment of this work. As these systems will take on an increasingly global reach, we discuss how cross cultural issues will effect such systems. We conclude by outlining our future plans to test our hypothesis by deploying this technology on a live system with some 15,000 online users.
Ultrascale Nanophotonic Intrachip Communication for high-performance computing systems
Optical Society of America--Integrated Photonics and Nanophotonics Research and Applications (IPNRA2008), July 2008.
Wonderland with kids
CommunityCorner talk at JavaOne, July 2008.
Introducing EclipseLink
The Eclipse Persistence Services Project, more commonly known as EclipseLink, is a comprehensive open source persistence solution. EclipseLink was started by a donation of the full source code and test suites of Oracle's TopLink product. This project brings the experience of over 12 years of commercial usage and feature development to the entire Java community. This evolution into an open source project is now complete and developers will soon have access to the EclipseLink 1.0 release.
The Energy Cost of SSL in Deeply Embedded Systems
As the number of potential applications for tiny, battery-powered, "mote"-like, deeply embedded devices grows, so does the need to simplify and secure interactions with such devices. Embedding a secure web server (capable of HTTP over SSL, aka HTTPS), enables these devices to be monitored and controlled securely via a user-friendly, browser-based interface.
This paper presents the first empirical energy analysis of the Internet's dominant security protocol, SSL, on highly constrained devices. We have enhanced Sizzle, our tiny-footprint HTTPS stack, with energy conserving features and measured its performance on a Telos mote. We show that the key exchange phase, which consumes much more energy than bulk encryption and authentication, amortizes well over the transmission of a few kilobytes of application data. Such amortization is easily attained with features like session reuse and persistent HTTP(S), both of which are supported by Sizzle. The extra energy cost of encrypting and authenticating application data with SSL is around 15%. With the addition of an application-level, duty-cycle based approach to low-power listening for incoming service requests, a pair of alkaline batteries can power Sizzle for over a year under a variety of application scenarios.
Parfait - Designing a Scalable Bug Checker
We present the design of Parfait, a static layered program analysis framework for bug checking, designed for scalability and precision by improving false positive rates and scale to millions of lines of code. The Parfait framework is inherently parallelizable and makes use of demand driven analyses.
In this paper we provide an example of several layers of analyses for buffer overflow, summarize our initial implementation for C, and provide preliminary results. Results are quantified in terms of correctly-reported, false positive and false negative rates against the NIST SAMATE synthetic benchmarks for C code.
In Proceedings of the ACM SIGPLAN Static Analysis Workshop, pgs 4-11, 12 June 2008.
Parfait - Designing a Scalable Bug Checker
We present the design of Parfait, a static layered program analysis framework for bug checking, designed for scalability and precision by improving false positive rates and scale to millions of lines of code. The Parfait framework is inherently parallelizable and makes use of demand driven analyses. In this paper we provide an example of several layers of analyses for buffer overflow, summarize our initial implementation for C, and provide preliminary results. Results are quantified in terms of correctly-reported, false positive and false negative rates against the NIST SAMATE synthetic benchmarks for C code.
Flow Control in Output Buffered Switch with Input Groups
High Performance Switching and Routing Conference (HPSR'08), Shanghai, China
Using Ontologies and Vocabularies for Dynamic Linking
Ontology-based linking offers a solution to some of the problems with static, restricted, and inflexible traditional Web linking. Conceptual hypermedia provides navigation between Web resources, supported by a conceptual model, in which an ontology's definitions and structure, together with the lexical labels, drive the consistency of link provision and the linking's dynamic aspects. Lightweight standard representations make it possible to use existing vocabularies to support Web navigation and browsing. In this way, the navigation and linking of diverse resources (including those not in our control) based on a community understanding of the domain can be consistently managed.
Multi-threading in Electric, a Java VLSI CAD Tool
Java User Group Presentation, Universidad Andres Bello, Chile May 2008.
Gridless wire routing using cost functions and multiple processors
Sun Microsystems memo #SML2008-0232, May 2008.
Project Sun SPOT: A Java Technology-Enabled Platform for Ubiquitous Computing
Technical Session TS-6495, JavaOne, May 2008. [The Networking section starts at 18 min 57 sec and the Security section at 22 min 46 sec into the video.]
Validated method for IVPs for Ordinary Differential Equations based on Chaplygin's inequalities
Standard numerical methods for initial value problems (IVPs) for odinary differential equations (ODEs) return approximate solution only. Validated (also called interval) methods for IVPs for ODEs return approximate solution together with a rigorous enclosure of the true solution. A widely known validated method for IVPs for ODEs is the interval Hermite-Obreschkoff (IHO). This method meets the difficulties on stiff ODEs. The method of Chaplygin's inequalities is less known. However, it might be more suitable for problems like interval Spice simulator, because electrical circuits are described in Spice by stiff emprical ODEs which are not smooth enough. This memo describes IHO and Chaplygin validated methods and studies their stability on a simple ODE dy/dt=-y.
An abstract for the SCAN 2008 Symposium on Scientific Computing, Computer Arithmetic and Verified Numerical Computations
Sun Microsystems Memo #SML2008-0186, April 2008.
Exploiting capacitors in high-performance computer systems
4th Annual IEEE Int'l Symposium on VLSI Design, Automation, and Test (VLSI-DAT2008), April 2008.
Sun Small Programmable Object Technology
Sun Labs Open House, Apr 2008.
This presentation makes extensive use of animations which were lost in the processing of converting to PDF. Watch the presentation video if you find the PDF slides confusing. The networking and security section starts roughly 33 min 15 sec into the video.Surveying external Electric users
Sun Microsystems Memo #SML2008-0114, April 2008.
User-Input Dependence Analysis via Graph Reachability
Security vulnerabilities are software bugs that are exploited by an attacker. Systems software is at high risk of exploitation: attackers commonly exploit security vulnerabilities to gain control over a system, remotely, over the internet. Bug-checking tools have been used with fair success in recent years to automatically find bugs in software. However, for finding software bugs that can cause security vulnerabilities, a bug checking tool must determine whether the software bug can be controlled by user-input.
In this paper we introduce a static program analysis for computing user-input dependencies. This analysis is used as a pre-processing filter to our static bug checking tool, currently under development, to identify bugs that can be exploited as security vulnerabilities. Runtime speed and scalability of the user-input dependence analysis is of key importance if the analysis is used for large commercial systems software.
Our user-input dependency analysis takes both data and control dependencies into account. We extend Static Single Assignment (SSA) form by augmenting phi-nodes with control dependencies of its arguments. A formal definition of user-input dependency is expressed in a dataflow analysis framework as a Meet-Over-all-Paths (MOP) solution. We reduce the equation system to a sparse equation system exploiting the properties of SSA. The sparse equation system is solved as a reachability problem that results in a fast algorithm for computing user-input dependencies. We have implemented a call-insensitive and a call-sensitive version of the analysis. The paper compares their efficiency for various systems codes.
Research in Industrial Labs: How Collaboration Aids Innovation (slides)
CAHSI (Computing Alliance of Hispanic-Serving Institutions) Lecture, March 2008.
OpenSPARC: An Open Platform for Hardware Reliability Experimentation
Proceedings, System Effects of Logic Soft Errors (SELSE2008), March 2008.
Research in Industrial Labs: How Collaboration Aids Innovation, (RealMedia video) (QuickTime video)
CAHSI (Computing Alliance of Hispanic-Serving Institutions) Lecture, March 2008.
Usable Security on Sun SPOTs
Lightning Talk, Java Mobile & Embedded Developer Days, Jan 23-24, 2008.
Dynamic Linking of Web Resources: Customisation and Personalisation
Conceptual Open Hypermedia Service (COHSE) provides a framework that integrates a knowledge service and the open hypermedia link service to dynamically link Web documents via knowledge resources (e.g., ontologies or controlled vocabularies). The Web can be considered as a closed hypermedia system — Links on the Web are unidirectional, embedded, difficult to author and maintain. With a Semantic Web architecture COHSE addresses these limitations by dynamically creating, multi-headed links on third party documents by integrating third party knowledge resources and third party services. Therefore open-ness is a key aspect of COHSE. This chapter first presents how COHSE architecture is reengineered to support customisation and to create an adaptable open hypermedia system where the user explicitly provides information about himself. It then presents how this architecture is deployed in a portal and discusses how this portal architecture can be extended to turn COHSE from an adaptable system to an adaptive system where system implicitly infers some information about the user.
High-speed and low-energy capacitively-driven wires
IEEE Journal of Solid-State Circuits, Vol. 43, Issue 1, Jan. 2008, pp. 52-60.
A Reinforcement Learning Framework for Online Data Migration in Hierarchical Storage Systems
Multi-tier storage systems are becoming more and more widespread in the industry. They have more tunable parameters and built-in policies than traditional storage systems, and an adequate configuration of these parameters and policies is crucial for achieving high performance. A very important performance indicator for such systems is the response time of the file I/O requests. The response time can be minimized if the most frequently accessed (“hot”) files are located in the fastest storage tiers. Unfortunately, it is impossible to know a priori which files are going to be hot, especially because the file access patterns change over time. This paper presents a policy-based framework for dynamically deciding which files need to be upgraded and which files need to be downgraded based on their recent access pattern and on the system’s current state. The paper also presents a reinforcement learning (RL) algorithm for automatically tuning the file migration policies in order to minimize the average request response time. A multi-tier storage system simulator was used to evaluate the migration policies tuned by RL, and such policies were shown to achieve a significant performance improvement over the best hand-crafted policies found for this domain.
In memoriam Martin Rem
Brief presentation at ASYNC 2008.
Lessons from asynchronous design
Presentation to DARPA Science Research Council, October 2007. (Slides.)
Measuring 6D chip alignment in multi-chip packages
6th Annual IEEE Conference on Sensors (SENSORS2007), October 2007. (Slides)
Research challenges for on-chip interconnection networks
IEEE Micro, Vol. 27, Issue 5, September-October 2007, pp. 96-108.
A Gradient-Based Reinforcement Learning Approach to Dynamic Pricing in Partially-Observable Environments
As more companies are beginning to adopt the e-business model, it becomes easier for buyers to compare prices at multiple sellers and choose the one that charges the best price for the same item or service. As a result, the demand for the goods of a particular seller is becoming more unstable, since other sellers are regularly offering discounts that attract large fractions of buyers. Therefore, it becomes more important for each seller to switch from static to dynamic pricing policies that take into account observable characteristics of the current demand and the state of the seller's resources. This paper presents a Reinforcement Learning algorithm that can tune parameters of a seller's dynamic pricing policy in a gradient direction (thus converging to the optimal parameter values that maximize the revenue obtained by the seller) even when the seller's environment is not fully observable. This algorithm is evaluated using a simulated Grid market environment, where customers choose a Grid Service Provider (GSP) to which they want to submit a computing job based on the posted price and expected delay information at each GSP.
Potentials of Group IV Photonics Interconnects for 'Red-shift' Computing Applications
4th Annual IEEE International Conference on Group IV Photonics (GFP2007), September 2007.
Backlog Aware Low Complexity Schedulers for Input Queued Packet Switches
Symposium on High-Performance Interconnects (Hot Interconnects), Stanford University
Multiterabit Switch Fabrics Enabled by Proximity Communication
Symposium on High-Performance Chips (Hot Chips), Stanford University
Multi-terabit switch fabrics enabled by Proximity Communication
19th Annual Hot Chips Symposium, August 2007.
Optics for next-generation computing systems
Optical Society of America IPNRA, July 2007.
Using horizontal displays for distributed and collocated agile planning
Computer-supported environments for agile project planning are often limited by the capability of the hardware to support collaborative work. We present DAP, a tool developed to aid distributed and collocated teams in agile planning meetings. Designed with a multi-client architecture, it works on standard desktop computers and digital tables. Using digital tables, DAP emulates index card based planning without requiring team members to be in the same room.
Using horizontal displays for distributed and collocated agile planning
Computer-supported environments for agile project planning are often limited by the capability of the hardware to support collaborative work. We present DAP, a tool developed to aid distributed and collocated teams in agile planning meetings. Designed with a multi-client architecture, it works on standard desktop computers and digital tables. Using digital tables, DAP emulates index card based planning without requiring team members to be in the same room.
RAS by the Yard
Proceedings, International Conference on Dependable Systems and Networking (DSN2007), June 2007. (Slides)
Robust energy-efficient adder topologies
18th IEEE Symposium on Computer Arithmetic (ARITH2007), June 2007, pp. 16-28.
CMOS integration of capacitive, optical, and electrical interconnects
10th IEEE Int'l Interconnect Technology Conference (IITC2007), June 2007, pp. 78-80.
PWWFA: Parallel Wave Front Arbiter for Large Switches
High Performance Switching and Routing Conference (HPSR'07), Brooklyn, New York
Introduction and evaluation of martlet, a scientific workflow language for abstracted parallelisation.
The workflow language Martlet described in this paper implements a new programming model that allows users to write parallel programs and analyse distributed data without having to be aware of the details of the parallelisation. Martlet abstracts the parallelisation of the computation and the splitting of the data through the inclusion of constructs inspired by functional programming. These allow programs to be written as an abstract description that can be adjusted automatically at runtime to match the data set and available resources. Using this model it is possible to write programs to perform complex calculations across a distributed data set such as Singular Value Decomposition or Least Squares problems, as well as creating an intuitive way of working with distributed systems Having described and evaluated Martlet against other functional languages for parallel computation, this paper goes on to look at how Martlet might develop. In doing so it covers both possible additions to the language itself, and the use of JIT compilers to increase the range of platforms it is capable of running on.
Adaptive Data-Aware Utility-Based Scheduling in Resource-Constrained Systems
This paper addresses the problem of dynamic scheduling of data-intensive multiprocessor jobs. Each job requires some number of CPUs and some amount of data that needs to be downloaded onto a local storage space before starting the job. The completion of each job brings some benefit (utility) to the system, and the goal is to find the optimal scheduling policy that maximizes the average utility per unit of time obtained from all completed jobs. A co-evolutionary solution methodology is proposed, where the utility-based policies for managing local storage and for scheduling jobs onto the available CPUs mutually affect each other's environments, with both policies being adaptively tuned using the Reinforcement Learning methodology. Our simulation results demonstrate the feasibility of this approach and show that it performs better than the best heuristic scheduling policy we could find for this domain.
Balancing Security and Ease-of-Use on the Sun SPOTs
Sun Labs Open House, Apr, 2007.
A platform for wireless networked transducers
As computers, sensors, and wireless communication have become smaller, cheaper, and more sophisticated, wireless transducer platforms have become a focus of research and commercial interest. This report describes an investigation into such platforms. It presents a new taxonomy of transducer systems, describes the construction of prototypes of a new transducer device designed for ease of application development, and discusses commercialization issues.
Electric VLSI design system
Publicly released poster from Sun Lab's Open, Sun Microsystems Memo #SMl2007-0192, April 2007.
Optical transceiver chips based on co-integration of capacitively-coupled proximity interconnects and VCSELs
IEEE Photonics Technology Letters, Vol. 19, Number 7, April 2007, pp. 453-455.
A Reinforcement Learning Approach to Dynamic Resource Allocation
This paper presents a general framework for performing adaptive reconfiguration of a distributed system based on maximizing the long-term business value, defined as the discounted sum of all future rewards and penalties. The problem of dynamic resource allocation among multiple entities sharing a common set of resources is used as an example. A specific architecture (DRA-FRL) is presented, which uses the emerging methodology of reinforcement learning in conjunction with fuzzy rulebases to achieve the desired objective. This architecture can work in the context of existing resource allocation policies and learn the values of the states that the system encounters under these policies. Once the learning process begins to converge, the user can allow the DRA-FRL architecture to make some additional resource allocation decisions or override the ones suggested by the existing policies so as to improve the long-term business value of the system. The DRA-FRL architecture can also be deployed in an environment without any existing resource allocation policies. An implementation of the DRA-FRL architecture in Solaris 10 demonstrated a robust performance improvement in the problem of dynamically migrating CPUs and memory blocks between three resource partitions so as to match the stochastically changing workload in each partition, both in the presence and in the absence of resource migration costs.
Constrained circuit optimization using logical effort
Sun Microsystems memo #SML2007-0336.
Notes on pulse signaling
13th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC2007), March 2007, pp. 15-24. (Slides)
On-chip samplers for test and debug of asynchronous circuits
13th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC2007), March 2007, pp. 153-62. (Slides)
Deleting Files in the Celeste Peer-to-Peer Storage System
Celeste is a robust peer-to-peer object store built on top of a distributed hash table (DHT). Celeste is a working system, developed by Sun Microsystems Laboratories. During the development of Celeste, we faced the challenge of complete object deletion, and moreover, of deleting "files" composed of several different objects. This important problem is not solved by merely deleting meta-data, as there are scenarios in which all file contents must be deleted, e.g., due to a court order. Complete file deletion in a realistic peer-to-peer storage system has not been previously dealt with due to the intricacy of the problem - the system may experience high churn rates, nodes may crash or have intermittent connectivity, and the overlay network may become partitioned at times. We present an algorithm that eventually deletes all file content, data and meta-data, in the aforementioned complex scenarios. The algorithm is fully functional and has been successfully integrated into Celeste.
A configurable asynchronous pseudorandom bit sequence generator
IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC2007), March 2007, pp. 143-152. (Slides)
Open Source and You
The real value of open-source software is the community it fosters.
High-speed and low-energy capacitively-driven wires
Digest of Technical Papers, IEEE International Solid-State Circuits Conference (ISSCC2007), February 2007, pp. 412-3.
Circuit techniques to enable 430 Gb/s/mm2 Proximity Communication
Digest of Technical Papers, IEEE International Solid-State Circuits Conference (ISSCC2007), February 2007, pp. 368-9.
Comprehensive Multivariate Extrapolation Modeling of Multiprocessor Cache Miss Rates
Cache miss rates are an important subset of system model inputs. Cache miss rate models are used for broad design space exploration in which many cache configurations cannot be simulated directly due to limitations of trace collection setups or available resources. Often it is not practical to simulate large caches. Large processor counts and consequent potentially high degree of cache sharing are frequently not reproducible on small existing systems. In this article, we present an approach to building multivariate regression models for predicting cache miss rates beyond the range of collectible data. The extrapolation model attempts to accurately estimate the high-level trend of the existing data, which can be extended in a natural way. We extend previous work by its applicability to multiple miss rate components and its ability to model a wide range of cache parameters, including size, line size, associativity and sharing. The stability of extrapolation is recognized to be a crucial requirement. The proposed extrapolation model is shown to be stable to small data perturbations that may be introduced during data collection. We show the effectiveness of the technique by applying it to two commercial workloads. The wide design space contains configurations that are much larger than those for whichmiss rate data were available. The fitted data match the simulation data very well. The various curves show how a miss rate model is useful for not only estimating the performance of specific configurations, but also for providing insight into miss rate trends.
Resource Partitioning in a Java Operating Environment
Managing the partitioning of resources between uncooperating applications is a fundamental requirement of an operating environment. Traditional operating environments only manage low-level resources which presents an impedance mismatch for internet-facing applications with service levels defined in terms of application-level transactions. The Multi-tasking Virtual Machine (MVM) and associated Resource Management API (RM) provide basic mechanisms for managing multiple applications within a Java operating environment. RM separates mechanism and policy and takes the unusual position of delegating rate-based management of resources to the policy level. This report describes the design and implementation of policies that provide flexible resource partitioning among applications and shows their effectiveness using microbenchmarks and an application level benchmark. The latter demonstrates the partitioning of an application-specific resource among a set of application instances using exactly the same policies as used for machine-level resources.
Personalised Dynamic Links on theWeb
Links on theWeb are unidirectional, embedded, difficult to author and maintain. With a Semantic Web architecture, COHSE (Conceptual Open Hypermedia System) aims to address these limitations by dynamically creating links on the Web. Here we present how this architecture is extended and modified to support customisation and create an adaptable open system by using third party ontologies and services to discover resources on the Web. We then present the deployment of this in a portal and discuss possible extensions to create an adaptive system to dynamically create personalised links.
COHSE: dynamic linking of web resources
This document presents a description of the COHSE collaborative research project between Sun Microsystems Laboratories and the School of Computer Science at the University of Manchester, UK. The purpose of this document is to summarise the project in terms of the work completed and the results achieved. The focus of the project was an application to enable the dynamic creation of hypertext links between documents on a Web, thus the intended audience for this document comprises those members of academic and industrial research groups whose focus includes the Web in general and the Semantic Web and Hyper- text in particular.
Software Productivity Research In High Performance Computing
The challenge of utilizing supercomputers effectively at ever increasing scale is not being met, a phenomenon perceived within the high performance computing (HPC) community as a crisis of "productivity." Acknowledging that narrow focus on peak machine performance numbers has not served HPC goals well in the past, and acknowledging that the "productivity" of a computing system is not a well-understood phenomenon, the Defense Advanced Research Project Agency (DARPA) created the High Productivity Computing Systems (HPCS) program: Industry vendors were challenged to develop a new generation of supercomputers that are dramatically (10 times!) more productive, not just faster; and A community of vendor teams and non-vendor research institutions were challenged to develop an understanding of supercomputer productivity that will serve to guide future supercomputer development and to support productivity-based evaluation of computing systems. The HPCS Productivity Team at Sun Microsystems responded with two commitments: A community of vendor teams and non-vendor research institutions were challenged to develop an understanding of supercomputer productivity that will serve to guide future supercomputer development and to support productivity-based evaluation of computing systems. Put the investigation of these phenomena on the soundest scientific basis possible, drawing on well-established research methodologies from relevant fields, many of which are unfamiliar within the HPC community.
Conscientious Software
Software needs to grow up and become responsible for itself and its own future by participating in its own installation and customization, maintaining its own health, and adapting itself to new circumstances, new users, and new uses. To create such software will require us to change some of our underlying assumptions about how we write programs. A promising approach seems to be to separate software that does the work (allopoietic) from software that keeps the system alive (autopoietic).
Research in industrial labs: How collaboration aids innovation
Grace Hopper Celebration of Women in Computing, October 2006. (Slides)
Programming the world with sun SPOTs
We describe the Sun1 Small Programmable Object Technology, or Sun SPOT. The Sun SPOT is a small wireless computing platform that runs Java1 directly, with no operating system. The system comes with an on-board set of sensors, and I/O pins for easy connection to external devices, and supporting software.
Introspection of a Java Virtual Machine under Simulation
Virtual machines are commonly used in commercially-significant systems, for example, Sun Microsystems' Java and Microsoft's .NET. The virtual machine offers many advantages to the system designer and administrator, but complicates the task of workload characterization: it presents an extra abstraction layer between the application and observed hardware effects. Understanding the behavior of the virtual machine is therefore important for all levels of the system architecture.
We have constructed a tool which examines the state of a Sun Java HotSpot virtual machine running inside Virtutech's Simics execution-driven simulator. We can obtain detailed information about the virtual machine and application without disturbing the state of the simulation. For data, we can answer such questions as: Is a given address in the heap? If so, in which object? Of what class? For code, we can map program counter values back to Java methods and approximate Java source line information. Our tool allows us to relate individual events in the simulation, for example, a cache miss, to the higher-level behavior of the application and virtual machine.
In this report, we present the design of our tool, including its capabilities and limitations, and demonstrate its application on the simulation's cache contents and cache misses.
Multithreading in the Electric VLSI design system
Sun Microsystems memo #SML2006-0316, September 2006.
Martlet: A scientific workflow language for abstracted parallelisation.
This paper describes a work-flow language ‘Martlet’ for the analysis of large quantities of distributed data. This work-flow language is fundamentally different to other languages as it implements a new programming model. Inspired by inductive constructs of functional programming this programming model allows it to abstract the complexities of data and processing distribution. This means the user is not required to have any knowledge of the underlying architecture or how to write distributed programs. As well as making distributed resources available to more people, this abstraction also reduces the potential for errors when writing distributed programs. While this abstraction places some restrictions on the user, it is descriptive enough to describe a large class of problems, including algorithms for solving Singular Value Decompositions and Least Squares problems. Currently this language runs on a stand-alone middleware. This middleware can however be adapted to run on top of a wide range of existing work-flow engines through the use of JIT compilers capable of producing other work-flow languages at run time. This makes this work applicable to a huge range of computing projects.
Enterprise Mobility
With the proliferation of Wireless technologies and business globalization, mobility of people and devices have become inevitable. The experience of mobile computing in different campuses or drop-in offices also faces challenges of starting up applications and tools, synchronizing filesystems, maintaining one's desktop environment or even finding a printer location or network services. In this document, we will discuss different types of mobility, issues with mobility, and why it is important to consider the mobility issues. Finally, this document discusses a network layer solution for IP mobility for continuous connectivity. It also sheds light on future directions of research on mobility that might be interesting for Sun Microsystems.
Dynamic Tuning of Online Data Migration Policies in Hierarchical Storage Systems using Reinforcement Learning*
Multi-tier storage systems are becoming more and more widespread in the industry. In order to minimize the request response time in such systems, the most frequently accessed ("hot") files should be located in the fastest storage tiers (which are usually smaller and more expensive than the other tiers). Unfortunately, it is impossible to know ahead of time which files are going to be "hot", especially because the file access patterns change over time. This report presents a solution approach to this problem, where each tier uses Reinforcement Learning (RL) to learn its own cost function that predicts its future request response time, and the files are then migrated between the tiers so as to decrease the sum of costs of the tiers involved.
A multi-tier storage system simulator was used to evaluate the migration policies tuned by RL, and such policies were shown to achieve a significant performance improvement over the best hand-crafted policies found for this domain.
*This material is based upon work supported by DARPA under Contract No. NBCH3039002.
Data access and analysis with distributed federated data servers in climateprediction.net.
climateprediction.net is a large public resource distributed scientific computing project. Members of the public download and run a full-scale climate model, donate their computing time to a large perturbed physics ensemble experiment to forecast the climate in the 21st century and submit their results back to the project. The amount of data generated is large, consisting of tens of thousands of individual runs each in the order of tens of megabytes. The overall dataset is, therefore, in the order of terabytes. Access and analysis of the data is further complicated by the reliance on donated, distributed, federated data servers. This paper will discuss the problems encountered when the data required for even a simple analysis is spread across several servers and how webservice technology can be used; how different user interfaces with varying levels of complexity and flexibility can be presented to the application scientists, how using existing web technologies such as HTTP, SOAP, XML, HTML and CGI can engender the reuse of code across interfaces; and how application scientists can be notified of their analysis’ progress and results in an asynchronous architecture.
Knowledge-Driven Hyperlinks: Linking in the Wild
Since Ted Nelson coined the term “Hypertext”, there has been extensive research on non-linear documents. With the enormous success of the Web, non-linear documents have become an important part of our daily life activities. However, the underlying hypertext infrastructure of the Web still lacks many features that Hypertext pioneers envisioned. With advances in the Semantic Web, we can address and improve some of these limitations. In this paper, we discuss some of these limitations, developments in Semantic Web technologies and present a system – COHSE – that dynamically links Web pages. We conclude with remarks on future directions for semantics-based linking.
Elliptic Curve Cryptography (ECC) Cipher Suites for Transport Layer Security (TLS)
IETF RFC 4492, May. 2006.
Policy-based Management of a JDBC Connection Pool
Managing the communication between an application server and a back-end database is essential for scalability and crucial for good performance. The standard mechanism uses a variable-sized pool of connections, but typical application servers provide very rudimentary, implementation-centric, pool control mechanisms. This requires administrators to manually translate service level specifications into the pool control mechanism, and adjust these as the load or machine configurations change. We describe the use of a resource management framework to automatically control connection pool parameters based on externally supplied policies.This simplifies the connection pool implementation while at the same time allowing a variety of policies to be applied, including policies that automatically adapt to changing circumstances.
The implementation of two distinct policies are discussed and performance measurements are reported for a contemporary synthetic application benchmark.
Suite B Enablement in TLS: A Report on Interoperability Testing Between Sun, Red Hat and Microsoft
Invited presentation at NIST's 5th Annual PKI R&D Workshop, Apr 5, 2006 (co-presenters: Robert Relyea, Red Hat and Kelvin Yiu, Microsoft).
Scientific middleware for abstracted parallelisation.
In this paper we introduce a class of problems that arise when the analysis of data split into an unknown number of pieces is attempted. Such analysis falls under the definition of Grid computing, but fails to be addressed by the current Grid computing projects, as they do not provide the appropriate abstractions. We then describe a distributed web service based middleware platform, which solves these problems by supporting construction of parallel data analysis functions for datasets with an unknown level of distribution. This analysis is achieved through the combination of Martlet, a workflow language that uses constructs from functional programming to abstract the parallelisation in computations away from the user, and the construction of supporting middleware. To construct such a supporting middleware it is necessary to provide the capability to reason about the data structures held without restricting their nature. Issues covered in the development of this supporting middleware include the ability to handle distributed data transfer and management, function deployment and execution.
Writing Solaris Device Drivers in Java
We present an experimental implementation of the Java Virtual Machine that runs inside the kernel of the Solaris operating system. The implementation was done by porting an existing small, portable JVM, Squawk, into the Solaris kernel. Our first application of this system is to allow device drivers to be written in Java. A simple device driver was ported from C to Java. Characteristics of the Java device driver and our device driver interface are described.
Design Notes for Electric's Network Consistency Checker
This technical report is a collection of the memos written by members of the VLSI Research Group in 2004 and 2005 about Electric's1 Network Consistency Checker, NCC. Be warned that these memos are unrefined design notes, not polished papers. For the most part, these memos were written to help us think through problems; as such, they may contain errors and conjectures that have not been fully proven. Despite that, we've created this report so that we can share our ideas and collaborate with people outside of Sun.
1The Electric VLSI Design System is an open source electronic design automation system used by the VLSI Research Group to create layout and schematics for integrated circuits.
An asynchronous high-throughput control circuit for proximity communication
12th International Symposium on Asynchronous Circuits and Systems, March 2006. (Slides)
Yes, There is an "Expertise Gap" in HPC Applications Development
The High Productivity Computing Systems (HPCS) program seeks a tenfold productivity increase in High Performance Computing (HPC), where productivity is understood to be a composite of system performance, system robustness, programmability, portability, and administrative concerns. Of these, programmability is the least well understood and perceived to be the most problematic. It has been suggested that an "expertise gap" is at the heart of the problem in HPC application development. Preliminary results from research conducted by Sun Microsystems and other participants in the HPCS program confirm that such an "expertise gap" does exist and does exert a significant confounding influence on HPC application development. Further, the nature of the "expertise gap" appears not to be amenable to previously proposed solutions such as "more education" and "more people." A productivity improvement of the scale sought by the HPCS program will require fundamental transformations in the way HPC applications are developed and maintained.
Yes, There is an "Expertise Gap" in HPC Applications Development
Third Workshop on Productivity and Performance in High-End Computing (P-PHEC), 12 February 2006, Austin, Texas
Abstract:
The High Productivity Computing Systems (HPCS) program seeks a tenfold productivity increase in High Performance Computing (HPC), where productivity is understood to be a composite of system performance, system robustness, programmability, portability, and administrative concerns. Of these, programmability is the least well understood and perceived to be the most problematic. It has been suggested that an "expertise gap" is at the heart of the problem in HPC application development. Preliminary results from research conducted by Sun Microsystems and other participants in the HPCS program confirm that such an "expertise gap" does exist and does exert a significant confounding influence on HPC application development. Further, the nature of the "expertise gap" appears not to be amenable to previously proposed solutions such as "more education" and "more people." A productivity improvement of the scale sought by the HPCS program will require fundamental transformations in the way HPC applications are developed and maintained.
Yes, There is an "Expertise Gap" in HPC Applications Development
The High Productivity Computing Systems (HPCS) program seeks a tenfold productivity increase in High Performance Computing (HPC), where productivity is understood to be a composite of system performance, system robustness, programmability, portability, and administrative concerns. Of these, programmability is the least well understood and perceived to be the most problematic. It has been suggested that an "expertise gap" is at the heart of the problem in HPC application development. Preliminary results from research conducted by Sun Microsystems and other participants in the HPCS program confirm that such an "expertise gap" does exist and does exert a significant confounding influence on HPC application development. Further, the nature of the "expertise gap" appears not to be amenable to previously proposed solutions such as "more education" and "more people." A productivity improvement of the scale sought by the HPCS program will require fundamental transformations in the way HPC applications are developed and maintained.
Circuits without clocks: What makes them tick?
Invited talk, Canadian Undergraduate Technology Conference, January 2006. (Slides)
Dynamic-Sized Nonblocking Work Stealing Deque, A
The non-blocking work-stealing algorithm of Arora, Blumofe, and Plaxton [2] (henceforth ABP work-stealing) is on its way to becoming the multiprocessor load balancing technology of choice in both industry and academia. This highly efficient scheme is based on a collection of array-based double-ended queues (deques) with low cost synchronization among local and stealing processes. Unfortunately, the algorithm.s synchronization protocol is strongly based on the use of fixed size arrays, which are prone to overflows, especially in the multi programmed environments for which they are designed. This is a significant drawback since, apart from memory inefficiency, it means that the size of the deque must be tailored to accommodate the effects of the hard-to-predict level of multiprogramming, and the implementation must include an expensive and application-specific overflow mechanism.
This paper presents the first dynamic memory work-stealing algorithm. It is based on a novel way of building non-blocking dynamic-sized work stealing deques by detecting synchronization conflicts based on "pointer-crossing" rather than "gaps between indexes" as in the original ABP algorithm. As we show, the new algorithm dramatically increases robustness and memory efficiency, while causing applications no observable performance penalty. We therefore believe it can replace array-based ABP work stealing deques, eliminating the need for application- specific overflow mechanisms.
*This work was conducted while Yossi Lev was a student at Tel Aviv University, and is derived from his MS thesis [1].
An Overview of the Singularity Project
Singularity is a research project in Microsoft Research that started with the question: what would a software platform look like if it was designed from scratch with the primary goal of dependability? Singularity is working to answer this question by building on advances in programming languages and tools to develop a new system architecture and operating system (named Singularity), with the aim of producing a more robust and dependable software platform. Singularity demonstrates the practicality of new technologies and architectural decisions, which should lead to the construction of more robust and dependable systems.
Improving display speed in Electric (TM)
Sun Microsystems Memo #SML2005-0523, October 2005.
Reinforcement Learning Approach to Dynamic Resource Allocation, A
This paper presents a general framework for performing reconfiguration of a distributed system based on maximizing the long-term business value, defined as the discounted sum of all future rewards and penalties. The problem of dynamic resource allocation among multiple entities sharing a common set of resources is used as an example.
A specific architecture (DRA-FRL) is presented, which uses the emerging methodology of reinforcement learning in conjunction with fuzzy rulebases to achieve the desired objective. This architecture can work in the context of existing resource allocation policies and learn the values of the states that the system encounters under these policies. Once the learning process begins to converge, the user can allow the DRA-FRL architecture to make some additional resource allocation decisions or override the ones suggested by the existing policies so as to improve the long-term business value of the system. The DRA-FRL architecture can also be deployed in an environment without any existing resource allocation policies.
An implementation of the DRA-FRL architecture in Solarisâ„¢ 10 demonstrated a robust performance improvement in the problem of dynamically migrating CPUs and memory blocks between three resource partitions so as to match the stochastically changing workload in each partition, both in the presence and in the absence of resource migration costs.
*This material is based upon work supported by DARPA under Contract No. NBCH3039002.
Challenges in Building a Flat-Bandwidth Memory Hierarchy for a Large-Scale Computer with Proximity Communication
13th Annual IEEE Symposium on High Performance Interconnects, August 2005. (Slides)
Multi-Tier Checkpointing for Peta-Scale Systems
Proceedings, International Conference on Dependable Systems and Networking (DSN2005), July 2005.
Modeling Coordinated Checkpointing for Large-Scale Supercomputers
Proceedings, International Conference on Dependable Systems and Networking (DSN2005), July 2005.
Sizzle: A Standards-based End-to-End Security Architecture for the Embedded Internet
According to popular perception, public-key cryptography is beyond the capabilities of highly constrained, "mote"-like, embedded devices. We show that elliptic curve cryptography not only makes public-key cryptography feasible on these devices, it allows one to create a complete secure web server stack that runs efficiently within very tight resource constraints. Our smallfootprint HTTPS stack, nick-named Sizzle, has been implemented on multiple generations of the Berkeley/Crossbow motes where it runs in less than 4KB of RAM, completes a full SSL handshake in 1 second (session reuse takes 0.5 seconds) and transfers 1 KB of application data over SSL in 0.4 seconds. Sizzle is the world's smallest secure web server and can be embedded inside home appliances, personal medical devices, etc., allowing them to be monitored and controlled remotely via a web browser without sacrificing end-to-end security.
This report is an extended version of a paper that received the 'Mark Weiser Best Paper Award' at the Third IEEE International Conference on Pervasive Computing and Communications (PerCom), Hawaii, March 2005.
Can Software Engineering Solve the HPCS Problem?
Second International Workshop on Software Engineering for High Performance Computing System Applications, St. Louis, Missouri, May 15, 2005
Abstract:
The High Productivity Computing Systems (HPCS) program seeks a tenfold productivity improvement. Software Engineering has addressed this goal in other domains and identified many important principles that, when aligned with hardware and computer science technologies, do make dramatic improvements in productivity. Do these principles work for the HPC domain?
This case study collects data on the potential benefits of perfective maintenance in which human productivity (programmability, readability, verifiability, maintainability) is paramount. An HPC professional rewrote four FORTRAN77/MPI benchmarks in Fortran 90, removing optimizations (many improving distributed memory performance) and emphasizing clarity.
The code shrank by 5-10x and is significantly easier to read and relate to specifications. Run time performance slowed by about 2x. More studies are needed to confirm that the resulting code is easy to maintain and that the lost performance can be recovered with compiler optimization technologies, run time management techniques and scalable shared memory hardware.
HPC Needs a Tool Strategy
Second International Workshop on Software Engineering for High Performance Computing System Applications, St. Louis, Missouri, May 15, 2005
Abstract:
The High Productivity Computing Systems (HPCS) program seeks a tenfold productivity increase in High Performance Computing (HPC). A change of this magnitude in software development and maintenance demands a transformation similar to other great leaps in industrial productivity. By analogy, this requires a dramatic change to the "infrastructure" and to the way software developers use it. Software tools such as compilers, libraries, debuggers and analyzers constitute an essential part of the HPC infrastructure, without which codes cannot be efficiently developed nor production runs accomplished.
The underappreciated "HPC software infrastructure" is not up to the task and is becoming less so in the face of increasing scale, complexity, and mission importance. Infrastructure dependencies are seen as significant risks to success, and significant productivity gains remain unrealized. Support models for this infrastructure are not aligned with its strategic value.
To achieve the potential of the software infrastructure, both for stability and for productivity breakthroughs, a dedicated, long-term, client-focused support structure must be established. Goals for tools in the infrastructure would include ubiquity, portability, and longevity commensurate with the projects they support, typically decades. The strategic value of such an infrastructure necessarily transcends individual projects, laboratories, and organizations.
Electric, a VLSI CAD framework using Java technology
Invited talk, Dept. of Computer Science, Catholic University, Santiago, Chile, May 2005. (Slides)
Secure Adhoc Communication
Technical overview of the project.
Innovation Happens Elsewhere: Open Source as Business Strategy
It's a plain fact: regardless of how smart, creative, and innovative your organization is, there are more smart, creative, and innovative people outside your organization than inside. Open source offers the possibility of bringing more innovation into your business by building a creative community that reaches beyond the barriers of the business. The key is developing a web-driven community where new types of collaboration and creativity can flourish. Since 1998 Ron Goldman and Richard Gabriel have been helping groups at Sun Microsystems understand open source and advising them on how to build successful communities around open source projects. In this book the authors present lessons learned from their own experiences with open source, as well as those from other well-known projects such as Linux, Apache, and Mozilla.
Innovation Happens Elsewhere: Open Source as Business Strategy
It's a plain fact: regardless of how smart, creative, and innovative your organization is, there are more smart, creative, and innovative people outside your organization than inside. Open source offers the possibility of bringing more innovation into your business by building a creative community that reaches beyond the barriers of the business. The key is developing a web-driven community where new types of collaboration and creativity can flourish. Since 1998 Ron Goldman and Richard Gabriel have been helping groups at Sun Microsystems understand open source and advising them on how to build successful communities around open source projects. In this book the authors present lessons learned from their own experiences with open source, as well as those from other well-known projects such as Linux, Apache, and Mozilla.
Security Issues in Wireless Sensor Networks
Invited presentation at the 10th FBI Information Technology Study Group Workshop, Apr 21, 2005.
Technology Scaling and the Future of Interconnect
Invited talk, 7th IEEE Int'l Workshop on System Level Interconnect Prediction, April 2005. (Slides)
GasP control for domino circuits
IEEE International Symposium on Asynchronous Circuits and Systems, March 2005, pp. 12-22. (Slides)
Proximity communication and time
IEEE International Symposium on Asynchronous Circuits and Systems, March 2005, pp. xii. (Slides)
Reinforcement Learning Framework for Utility-Based Scheduling in Resource-Constrained Systems, A
This paper presents a general methodology for scheduling jobs in soft real-time systems, where the utility of completing each job decreases over time. This scheduling problem is known to be NP-hard, requiring a heuristic solution to operate in real-time. We present a utility-based framework for making repeated scheduling decisions based on dynamically observed information about unscheduled jobs and system's resources. This framework generalizes the standard scheduling problem to a resource-constrained environment, where resource allocation (RA) decisions (how many CPUs to allocate to each job) have to be made concurrently with the scheduling decisions (when to execute each job). We then use the discrete-time Optimal Control theory to formulate the optimization problem of finding the scheduling/RA policy that maximizes the average utility per time step obtained from completed jobs. We propose a Reinforcement Learning (RL) architecture for solving the NP-hard Optimal Control problem in real-time, and our experimental results demonstrate the feasibility and benefits of the proposed approach.
A Cryptographic Processor for Arbitrary Elliptic Curves over GF(2^m )
International Journal of Embedded Systems, Feb. 2005. Extended version of the paper that won the Best Paper award at IEEE ASAP 2003.
The use of capability descriptions in a wireless transducer network
This document presents the requirements for a language to describe the capabilities of a transducer in a wireless transducer network (WTN). It provides a survey of existing technologies in this field and concludes with a framework in which the capabilities of a transducer can be employed to assist users in the configuration of a WTN. The intended audience for this paper comprises members of academic and industrial research groups whose focus is networked devices, such as those used in wireless sensor networks.
Object-aware memory architecture, An
Despite its dominance, object-oriented computation has received scant attention from the architecture community. We propose a novel memory architecture that supports objects and garbage collection (GC). Our architecture is co-designed with a Java Virtual Machine to improve the functionality and efficiency of heap memory management. The architecture is based on an address space for objects accessed using object IDs mapped by a translator to physical addresses. To support this, the system includes object-addressed caches, a hardware GC barrier to allow in-cache GC of objects, and an exposed cache structure cooperatively managed by the JVM. These extend a conventional architecture, without compromising compatibility or performance for legacy binaries.
Our innovations enable various improvements such as: a novel technique for parallel and concurrent garbage collection, without requiring any global synchronization; an in-cache garbage collector, which never accesses main memory; concurrent compaction of objects; and elimination of most GC store barrier overhead. We compare the behavior of our system against that of a conventional generational garbage collector, both with and without an explicit allocate-incache operation. Explicit allocation eliminates many write misses; our scheme additionally trades L2 misses for in-cache operations, and provides the mapping indirection required for concurrent compaction.
Sizzle -- SSL on Motes
Invited presentation at U.C. Berkeley's CENTS Retreat, Tahoe, Jan. 2005.
Experiments in Wireless Internet Security
in Statistical Methods in Computer Security, William W. S. Chen, (Editor), Dekker/CRC Press, pp. 33-47.
An 8-Gb/s/pin simultaneously bidirectional transceiver in 0.35-um CMOS
IEEE Journal of Solid-State Circuits, Vol. 39, Issue 11, Nov. 2004, pp. 1894-1908.
Partitioning of Code for a Massively Parallel Machine
Code partitioning is the problem of dividing sections of code among a set of processors for execution in parallel taking into account the communication overhead between the processors. Code partitioning of large amounts of code onto numerous processors requires variations to the classical partitioning algorithms, in part due to the memory and time requirements to partition a large set of data, but also due to the nature of the target machine and multiple constraints imposed by its architectural features.
In this paper, we present our experience in the design of enhancements to the classical multi-level k-way partitioning algorithm to deal with large graphs of over 1 million nodes, 5 constraints, and nodes of irregular size. Our algorithm was implemented to produce code for a massively parallel machine of up to 40,000 processors, and forms part of a hardware description language compiler. The algorithm and the compiler were tested on RTL designs for a next generation SPARC® processor. We present performance results and comparisons for partitioning multi-processor hardware designs.
New division algorithms by digit recurrence
38th Asilomar Conference on Signals, Systems, and Computers, Nov. 2004, pp. 1849-55. (Slides)
Circuits without clocks: What makes them tick?
Invited talk, System-on-Chip Conference (SOC2004), November 2004. (Slides)
Garbage-first garbage collection
Garbage-First is a server-style garbage collector, targeted for multi-processors with large memories, that meets a soft real-time goal with high probability, while achieving high throughput. Whole-heap operations, such as global mark- ing, are performed concurrently with mutation, to prevent interruptions proportional to heap or live-data size. Concur- rent marking both provides collection ”completeness” and identifies regions ripe for reclamation via compacting evac- uation. This evacuation is performed in parallel on multi- processors, to increase throughput.
Proximity Communication
IEEE Journal of Solid-State Circuits, Vol. 39, Number 9, September 2004, pp. 1529-36.
Comparative Study of Persistence Mechanisms for the Java™ Platform, A
Access to persistent data is a requirement for the majority of computer applications. The Java programming language and associated run-time environment provide excellent features for the construction of reliable and robust applications, but currently these do not extend to the domain of persistent data. Many mechanisms for managing persistent data have been proposed, some of which are now included in the standard Java platforms, e.g., J2SE™ and J2EE™.
This paper defines a set of criteria by which persistence mechanisms may be compared and then applies the criteria to a representative set of widely used mechanisms. The criteria are evaluated in the context of a widely-known benchmark, which was ported to each of the mechanisms, and include performance and scalability results.
Maintaining Object Ordering in a Shared P2P Storage Environment
Modern peer-to-peer (P2P) storage systems have evolved to provide solutions to a variety of burning storage problems. While the first generation provided rather informal file sharing, more recent approaches provide more extensive security, sharing, and archive capabilities.
To be considered a viable storage solution the system must exhibit high availability and data persistence characteristics. In an attempt to provide these, most systems assume a continuously connected and available underlying communication infrastructure. But this is not necessarily the case because equipment failures, denial of service attacks, and just poor (yet common) corporate network design may cause discontinuities and interruptions in the communication service. Any proposed storage solution needs to address such issues transparently.
Storage archival systems can live with discontinuities, as long as the stored data can be uniquely identified. Continuous update systems that allow updating data by multiple writers have harder problems to overcome since the ordering of updates needs to be maintained independently of connectivity conditions. In this paper, we propose a solution for maintaining the ordering even under severe connectivity disruptions, allowing the system to continue functioning while connectivity is disrupted, and to recover from the disruption smoothly when connectivity is restored.
Accelerating Next-Generation Public-key Cryptography on General-Purpose CPUs
Hot Chips 16, Aug. 2004. Selected as one of the Best Papers.
Using gated experts in fault diagnosis and prognosis
Three individual experts have been developed based on extended auto associative neural networks (E-AANN), Kohonen self organizing maps (KSOM), and the radial basis function based clustering (RBFC) algorithms. An integrated method is proposed later to combine the set of individual experts managed by a gated experts algorithm, which assigns the experts based on their best performance regions. We have used a Matlab Simulink model of a chiller system and applied the individual experts and the integrated method to detect and recover sensor errors. It has been shown that the integrated method gets better performance in diagnostics and prognostics compared with each individual expert.
Grid style web services for climateprediction.net.
In this paper we describe a architecture which implements call and pass by reference using asynchronous Web Services. This architecture provides a distributed data analysis environment where functions can be dynamically described and used.
Challenges and potentials for multiterabit-per-second optical transceivers
Digest of the LEOS Summer Topical Meetings, Biophotonics/Optical Interconnects and VLSI Photonics/WMB Microcavities, June 2004, pp. 28-30.
Scaling J2EE™ Application Servers with the Multi-Tasking Virtual Machine
The Java 2 Platform, Enterprise Edition (J2EE) is established as the standard platform for hosting enterprise applications written in the Java programming language. Similar to an operating system, a J2EE server can host multiple applications, but this is rarely seen in practice due to limitations on scalability, weak inter-application isolation and inadequate resource management facilities in the underlying Java platform. This leads to a proliferation of server instances, each typically hosting a single application, with a consequent dramatic increase in the total memory footprint and more complex system administration. The Multi-tasking Virtual Machine (MVM) solves this problem by providing an efficient and scalable implementation of the isolate API for multiple, isolated tasks, enabling the co-location of multiple server instances in a single MVM process. Isolates also enable the restructuring of a J2EE server implementation as a collection of isolated components, offering increased flexibility and reliability. The resulting system is a step towards a complete and scalable operating environment for enterprise applications.
Transistor sizing: how to control the speed and energy consumption of a circuit
IEEE International Symposium on Asynchronous Circuits and Systems, April 2004, pp. 51-61. (Slides)
Long Wires and Asynchronous Control
Digest of Technical Papers, IEEE International Symposium on Asynchronous Circuits and Systems, April 2004, pp. 240-9. (Slides)
A fast and energy-efficient stack
IEEE International Symposium on Asynchronous Circuits and Systems, April 2004, pp. 7-16. (Slides)
Supporting Per-processor Local-allocation Buffers Using Multi-processor Restartable Critical Sections
One challenge for runtime systems like the Java™ platform that depend on garbage collection is the ability to scale performance with the number of allocating threads. As the number of such threads grows, allocation of memory in the heap becomes a point of contention. To relieve this contention, many collectors allow threads to preallocate blocks of memory from the shared heap. These per-thread local-allocation buffers (LABs) allow threads to allocate most objects without any need for further synchronization. As the number of threads exceeds the number of processors, however, the cost of committing memory to local-allocation buffers becomes a challenge and sophisticated LAB-sizing policies must be employed.
To reduce this complexity, we implement support for local-allocation buffers associated with processors instead of threads using multiprocess restartable critical sections (MP-RCSs). MP-RCSs allow threads to manipulate processor-local data safely. To support processor-spe-cific transactions in dynamically generated code, we have developed a novel mechanism for implementing these critical sections that is efficient, allows preemption-notification at known points in a given critical section, and does not require explicit registration of the critical sec-tions. Finally, we analyze the performance of per-processor LABs and show that, for highly threaded applications, this approach performs better than per-thread LABs, and allows for simpler LAB-sizing policies.
Supernets and snHubs: A Foundation for Public Utility Computing
The notion of procuring computer services from a utility, much the way we get water and electricity and phone service, is not new. The idea at the center of the public utility trend in computer services is to allow firms to focus less on administering and supporting their information technology and more on running their business. Supernets and their implementation as hardware devices (snHubs) are our approach to make networks part of the public utility computing (PUC) infrastructure. The infrastructure is a key to integrating and enabling such "remote access" constituencies as B2B, out-sourcing vendors, and workers who telecommute in a safe and scalable manner. We have designed, developed, and deployed a prototype whose viability is now being demonstrated by a small deployment throughout Sun Microsystems.
Electronic Alignment for Proximity Communication
Digest of Technical Papers, IEEE International Solid-State Circuits Conference, February 2004, pp. 144-5.
Shedding Light on the Hidden Web
The terms Hidden Web, Deep Web and Invisible Web describe those resources on the Web that are in some way unreachable by search engines, and are potentially unusable to other Web systems such as annotation services. These hidden resources make up a signicant part of the current Web. We provide rm denitions of the ways in which information can be "hidden", and discuss the challenges that face those working with annotation in the Hidden Web. We do not attempt to provide solutions for these challenges, but a clarification of the terms involved is certainly a step in the right direction.
Circuits without a clock: what makes them tick?
Keynote presentation at the Int'l Conference on Principles of Distributed Systems, December 2003. (Slides)
Logical effort of carry propagate adders
37th Asilomar Conference on Signals, Systems, and Computers, November 2003, pp. 873-878.
Design of JFluid: A Profiling Technology and Tool Based on Dynamic Bytecode Instrumentation
Instrumentation-based profiling has many advantages and one serious disadvantage: usually high performance overhead. This overhead can be substantially reduced if only a small part of the target application (for example, one that has previously been identified as a performance bottleneck) is instrumented, while the rest of the application code runs at full speed. Such an approach can also beat scalability issues caused by a high volume of profiling information generated by instrumented code running on behalf of multiple threads. The value of such a profiling technology would increase further if the code could be instrumented and de-instrumented as many times as needed at run time.
In this report we describe in detail the design of an experimental profiling system called JFluid, which includes a modified Java HotSpot™ VM and a GUI tool, and addresses both of the above issues. Our JVM™ supports arbitrary on-the-fly modifications to running Java methods, and can connect with a profiling tool at any moment, without any startup time preparation. Our tool collects, processes and presents profiling data on-line. To perform CPU profiling, it instruments a group of methods defined as an arbitrary "root" method plus all methods that it calls (a call subgraph). It appears that static determination of all methods in a call subgraph is difficult in the presence of virtual methods, but fortunately, with dynamic code hotswapping available, two schemes of dynamic call subgraph revelation and instrumentation can be suggested.
Measurements that we obtained when performing full and partial program profiling using both schemes show that the overhead can be reduced substantially using this technique, and that one of the schemes generally results in a smaller number of instrumented methods and better performance, especially for large applications.
Securing the Web with the Next Generation Public-Key Cryptosystem
Stanford Networking Research Center (SNRC) industry seminar
Sketchpad: A man-machine graphical communication system (Archival reprint edition of Sutherland's 1963 MIT Ph.D. Thesis, with a new forward by A. Blackwell and K. Rodden)
University of Cambridge Technical Report UCAM-CL-TR-574, September 2003.
Proximity Communication
Proceedings, IEEE Custom Integrated Circuits Conference, September 2003, pp. 469-472.
Co-evolutionary perception-based reinforcement learning for sensor allocation in autonomous vehicles
In this paper we study the problem of sensor allocation in Unmanned Aerial Vehicles (UAVs). Each UAV uses perception-based rules for generalizing decision strategy across similar states and reinforcement learning for adapting these rules to the uncertain, dynamic environment. A big challenge for reinforcement learning algorithms in this problem is that UAVs need to learn two complementary policies: how to allocate their individual sensors to appearing targets and how to distribute themselves as a team in space to match the density and importance of targets underneath. We address this problem using a co-evolutionary approach, where the policies are learned separately, but they use a common reward function. The applicability of our approach to the UAV domain is verified using a high-fidelity robotic simulator. Based on our results, we believe that the co-evolutionary reinforcement learning approach to reducing dimensionality of the action space presented in this paper is general enough to be applicable to many other multi-objective optimization problems, particularly those that involve a tradeoff between individual optimality and team-level optimality.
Inductive Learning for Fault Diagnosis
There is a steadily increasing need for autonomous systems that must be able to function with minimal human intervention to detect and isolate faults, and recover from such faults. In this paper we present a novel hybrid Model based and Data Clustering (MDC) architecture for fault monitoring and diagnosis, which is suitable for complex dynamic systems with continuous and discrete variables. The MDC approach allows for adaptation of both structure and parameters of identified models using supervised and reinforcement learning techniques. The MDC approach will be illustrated using the model and data from the Hybrid Combustion Facility (HCF) at the NASA Ames Research Center.
Circuits without a clock: what makes them tick?
Sun ONEDay 03 presentation, June 2003. (Slides)
A 10-mW 3.6-Gbps I/O transmitter
IEEE Symposium on VLSI Circuits, June 2003, pp. 97-98.
Securing the Web with Next Generation Cryptographic Technologies
Internetworking 2003, San Jose, Jun. 2003.
Cryptographic Processor for Arbitrary Elliptic Curves over GF(2^m), A
We describe a cryptographic processor for Elliptic Curve Cryptography (ECC). ECC is evolving as an attractive alternative to other public-key cryptosystems such as the Rivest-Shamir-Adleman algorithm (RSA) by offering the smallest key size and the highest strength per bit. The cryptographic processor performs point multiplication for elliptic curves over binary polynomial fields GF(2m). In contrast to other designs that only support one curve at a time, our processor is capable of handling arbitrary curves without requiring reconfiguration. More specifically, it can handle both named curves as standardized by the National Institute for Standards and Technology (NIST) as well as any other generic curves up to a field degree of 255. Efficient support for arbitrary curves is particularly important for the targeted server applications that need to handle requests for secure connections generated by a multitude of heterogeneous client devices. Such requests may specify curves which are infrequently used or not even known at implementation time.
We have implemented the cryptographic processor in a field-programmable gate array (FPGA) running at a clock frequency of 66.4 MHz. Its performance is 6955 point multiplications per second for named curves over GF(2163) and 3308 point multiplications per second for generic curves over GF(2163). We have integrated the cryptographic processor into the open source toolkit OpenSSL, which implements the Secure Sockets Layer (SSL) which is today's dominant Internet security protocol.
This report is an extended version of a paper presented at the IEEE 14th International Conference on Application-specific Systems, Architectures and Processors, The Hague, June 2003 where it received the "Best Paper Award".
Congestion and starvation detection in ripple FIFOs
IEEE International Symposium on Asynchronous Circuits and Systems, May 2003, pp. 36-45. (Slides)
Project JXTA: A Loosely-Consistent DHT Rendezvous Walker
The open-source community Project JXTA defines an open set of standard protocols for ad hoc, pervasive, peer-to-peer (P2P) computing as a common platform for developing a wide variety of decentralized network applications. The following paper describes a loosely- consistent DHT walker approach for searching adver- tisements and routing queries in the JXTA rendezvous network. The loosely-consistent DHT walker uses an hybrid approach that combines the use of a DHT to index and locate contents, with a limited range walker to resolve inconsistency of the DHT within the dynamic rendezvous network. This proposed DHT approach does not require maintaining consistency across the rendezvous network, a stable super-peer infrastructure, and is well adapted to ad hoc P2P net- work with high peer churn rate.
Project JXTA: A Loosely-Consistent DHT Rendezvous Walker
The open-source community Project JXTA defines an open set of standard protocols for ad hoc, pervasive, peer-to-peer (P2P) computing as a common platform for developing a wide variety of decentralized network applications. The following paper describes a loosely- consistent DHT walker approach for searching adver- tisements and routing queries in the JXTA rendezvous network. The loosely-consistent DHT walker uses an hybrid approach that combines the use of a DHT to index and locate contents, with a limited range walker to resolve inconsistency of the DHT within the dynamic rendezvous network. This proposed DHT approach does not require maintaining consistency across the rendezvous network, a stable super-peer infrastructure, and is well adapted to ad hoc P2P net- work with high peer churn rate.
Towards a Java™-Based Enterprise Client for Small Devices
The goal of the work reported here was to explore the use of the Java 2 Micro Edition (J2ME™) platform for applications connected to the enterprise, specifically focusing on Palm-based wireless applications. We found that the Java™ platform on the Palm is still maturing. The Palm itself has been carefully engineered to support small native applications, with a distinctive graphical user interface tuned for its display. Work remains to be done on the Palm to support more complex wireless applications and to make Java-based applications competitive. We also found that wireless enterprise applications in general are somewhat problematic, due to issues of network reliability, availability, bandwidth, and provisioning. Significantly, programming languages and their platforms are not the gating factors to large scale wireless deployment.
This work was performed in 2000 and 2001, before the current commercial deployment of Java-enabled mobile devices and faster wide-area wireless data services (such as GPRS). We hope to repeat our experiments using these technologies.
Computers without clocks
Scientific American, November 2002, pp. 62-69.
Implementation of a third-generation 1.1-GHz 64-bit microprocessor
IEEE Journal of Solid-State Circuits, Vol. 37, Issue 11, November 2002, pp. 1461-1469.
Radioport: A Radio Network for Monitoring and Diagnosing Computer Systems
A radio network is described for configuring, monitoring, and diagnosing the components of a computer system. Such a network offers several advantages: (a) It improves the robustness of the overall system by not having the monitoring functions rely on the interconnect of the monitored system; (b) by broadcasting information, it offers direct communication between the monitoring and monitored components thereby removing dependencies inherent to hierarchical and daisy-chained wired networks; (c) it does not rely on a physical interconnect thereby lowering implementation cost, offering non-intrusive monitoring, and improving reliability thanks to the lack of error- and failure-prone cables and connectors.
This report is an extended version of a paper presented at HOTI 2002, Stanford, California, August 2002. It received the Most Interesting New Topic Award.
Least Choice First (LCF) Scheduling Method for High-speed Network Switches, The
We describe a novel method for scheduling high-speed network switches. The targeted architecture is an input-buffered switch with a non-blocking switch fabric. The input buffers are organized as virtual output queues to avoid head-of-line blocking. The task of the scheduler is to decide when the input ports can forward packets from the virtual output queues to the corresponding output ports. Our Least Choice First (LCF) scheduling method selects the input and output ports to be matched by prioritizing the input ports according to the number of virtual output queues that contain packets: The fewer virtual output queues with packets, the higher the scheduling priority of the input port. This way, the number of switch connections and, with it, switch throughput is maximized. Fairness is provided through the addition of a round-robin algorithm.
We present two alternative implementations: A central implementation intended for narrow switches and a distributed implementation based on an iterative algorithm intended for wide switches.
The simulation results show that the LCF scheduler outperforms other scheduling methods such as the parallel iterative matcher [1], iSLIP [12], and the wave front arbiter [16].
This report is an extended version of a paper presented at IPDPS 2002, Fort Lauderdale, Florida, April 2002.
Separated High-bandwidth and Low-latency Communication in the Cluster Interconnect Clint
An interconnect for a high-performance cluster has to be optimized in respect to both high throughput and low latency. To avoid the tradeoff between throughput and latency, the cluster interconnect Clint has a segregated architecture that provides two physically separate transmission channels: a bulk channel optimized for high-bandwidth traffic and a quick channel optimized for low-latency traffic. Different scheduling strategies are applied. The bulk channel uses a scheduler that globally allocates time slots on the transmission paths before packets are sent off. In this way, collisions as well as blockages are avoided. In contrast, the quick channel takes a best-effort approach by sending packets whenever they are available thereby risking collisions and retransmissions.
Clint is targeted specifically at small- to medium-sized clusters offering a low-cost alternative to symmetric multiprocessor (SMP) systems. This design point allows for a simple and cost-effective implementation. In particular, by buffering packets only on the hosts and not requiring any buffer memory on the switches, protocols are simplified as switch forwarding delays are fixed, and throughput is optimized as the use of a global schedule is now possible.
This report is an extended version of a paper presented at SC2002, Baltimore, Maryland, November 2002.
DCAS-based Concurrent Deques Supporting Bulk Allocation
We present a lock-free implementation of a dynamically sized double-ended queue (deque) that is based on the double compare-and-swap (DCAS) instruction. This implementation improves over the best previous one by allowing storage to be allocated and freed in bulk when the size of the deque changes significantly, and to avoid invocation of the storage allocator at all while the size remains relatively stable. We achieved this implementation in two steps by first solving the easier problem of implementing the deque for a garbage-collected environment, and then applying the Lock-Free Reference Counting methodology we recently proposed in order to achieve a version independent of garbage collection.
Adaptive Coordination Among Fuzzy Reinforcement Learning Agents Performing Distributed Dynamic Load Balancing
In this paper we present an adaptive multi-agent coordination algorithm applied to the problem of distributed dynamic load balancing. As a specific example, we consider the problem of dynamic web caching in the Internet. In our general formulation of this problem, each agent represents a mirrored piece of content that tries to move itself closer to areas of the network with a high demand for this item. Each agent in our model uses a fuzzy rulebase for choosing the optimal direction of motion and adjusts the parameters of this rulebase using reinforcement learning. The resulting architecture for multi-agent coordination among fuzzy reinforcement learning agents (MAC-FRL) allows the team of agents to adaptively redistribute its members in the environment to match the changing pattern of demand. We simulate the performance of MAC-FRL and show that it significantly improves performance over non-coordinating agents.
Repeat Offender Problem: A Mechanism for Supporting Dynamic-sized Lock-free Data Structures, The
We define the Repeat Offender Problem (ROP). Elsewhere, we have presented the first dynamic-sized lock-free data structures that can free memory to any standard memory allocator -- even after thread failures -- without requiring special support from the operating system, the memory allocator, or the hardware. These results depend on a solution to the ROP problem. Here we present the first solution to the ROP problem and its correctness proof. Our solution is implementable in most modern shared memory multiprocessors.
Dynamic-sized Lockfree Data Structures
We address the problem of integrating lockfree shared data structures with standard dynamic allocation mechanisms (such as malloc and free).
We have two main contributions. The first is the design and experimental analysis of two dynamic-sized lockfree FIFO queue implementations, which extend Michael and Scott's previous implementation by allowing unused memory to be freed. We compare our dynamic-sized implementations to the original on 16-processor and 64-processor multiprocessors. Our experimental results indicate that the performance penalty for making the queue dynamic-sized is modest, and is negligible when contention is not too high. These results were achieved by applying a solution to the Repeat Offender Problem (ROP), which we recently posed and solved.
Our second contribution is another application of ROP solutions. Specifically, we show how to use any ROP solution to achieve a general methodology for transforming lockfree data structures that rely on garbage collection into ones that use explicit storage reclamation.
Developing Secure Web Applications for Constrained Devices
Invited presentation at the 11th World Wide Web Conference, Hawaii, May 2002.
On the Design of a New CPU Architecture for Pedagogical Purposes
Ant-32 is a new processor architecture designed specifically to address the pedagogical needs of teaching many subjects, including assembly language programming, machine architecture, compilers, operating systems, and VLSI design. This paper discusses our motivation for creating Ant-32 and the philosophy we used to guide our design decisions and gives a high-level description of the resulting design.
Experiments in Wireless Internet Security
Proc. of IEEE Wireless Communications and Networking Conference (WCNS), Orlando, Mar. 2002.
Composing snippets
in Advances in Concurrency and Hardware Design (ACHD). Springer-Verlag's Lecture Notes, Computer Science, Vol. 2549, eds. J. Cortadella, A. Yakovlev, and G. Rozenberg. Springer-Verlag, 2002.
Experience in the Design, Implementation and Use of a Retargetable Static Binary Translation Framework
Binary translation, the process of translating binary executables, makes it possible to run code compiled for source (input) machine Ms on target (output) machine Mt . Unlike an interpreter or emulator, a binary translator makes it possible to approach the speed of native code on machine Mt . Translated code may still run slower than native code because low-level properties of machine Ms must often be modeled on machine Mt.
The University of Queensland Binary Translation (UQBT) framework is a retargetable framework for experimenting with static binary translation on CISC and RISC machines. The system was built jointly by The University of Queensland and Sun Microsystems Laboratories in order to experiment with translations to and from different machines, to understand how to migrate applications from other UNIX migrate-based platforms to a (SPARC®, Solaris™) platform, and to experiment with translations from the current SPARC architecture to a future, not yet existing, version of the SPARC architecture.
This paper describes the overall design and architecture of the UQBT framework, the goals for the project, the resulting framework, experiences with translations across different machines, and lessons learned.
Towards a Java™-Based Enterprise Client for Small Devices
The goal of the work reported here was to explore the use of the Java 2 Micro Edition (J2ME™) platform for applications connected to the enterprise, specifically focusing on Palm-based wireless applications. We found that the Java™ platform on the Palm is still maturing. The Palm itself has been carefully engineered to support small native applications, with a distinctive graphical user interface tuned for its display. Work remains to be done on the Palm to support more complex wireless applications and to make Java-based applications competitive. We also found that wireless enterprise applications in general are somewhat problematic, due to issues of network reliability, availability, bandwidth, and provisioning. Significantly, programming languages and their platforms are not the gating factors to large scale wireless deployment.
A Transformational Approach to Binary Translation of Delayed Branches with Applications to SPARC® and PA-RISC Instructions Sets
A binary translator examines binary code for a source machine, optionally builds an intermediate representation, and generates code for a target machine. Understanding what to do with delayed branches in binary code can involve tricky case analyses, e.g., if there is a branch instruction in a delay slot. Correctness of a translation is of utmost importance. This paper presents a disciplined method for deriving such case analyses. The method identifies problematic cases, shows the translations for the non-problematic cases, and gives confidence that all cases are considered.The method supports such common architectures as SPARC®, MIPS, and PA-RISC.
We begin by writing a very simple interpreter for the source machine's code. We then transform the interpreter into an interpreter for a target machine without delayed branches. To maintain the semantics of the program being interpreted, we simultaneously transform the sequence of source-machine instructions into a sequence of target-machine instructions. The transformation of the instructions becomes our algorithm for binary translation. We show the translation is correct by reasoning about corresponding states on source and target machines.
Instantiation of this algorithm to the SPARC V8 and PA-RISC V1.1 architectures is shown. Of interest, these two machines share seven of 11 classes of delayed branching semantics; the PA-RISC has three classes which are not available in the SPARC architecture, and the SPARC architecture has one class which is not available in the PA-RISC architecture.
Although the delayed branch is an architectural idea whose time has come and gone, the method is significant to anyone who must write tools that deal with legacy binaries. For example, translators using this method could run PA-RISC on the new IA-64 architecture, or they may enable architects to eliminate delayed branches from a future version of the SPARC architecture.
*This report is a very extended version of TR 440, Department of Computer Science and Electrical Engineering, The University of Queensland, Dec 1998, and describes applications of the technique to translations of SPARC® and PA-RISC codes. This report fully documents the translation algorithms for these machines.
Walkabout-A Retargetable Dynamic Binary Translation Framework
Dynamic compilation techniques have found a renaissance in recent years due to their use in high-performance implementations of the Java™ language. Techniques origi-nally developed for use in virtual machines for such object-oriented languages as Smalltalk are now commonly used in Java virtual machines (JVM™) and Java just-in-time compilers. These techniques have also been applied to binary translation in recent years, most commonly appearing in binary optimizers for a given platform that improve the performance of binary programs while they execute.
The Walkabout project investigates and develops dynamic binary translation techniques that are based on properties of retargetability, ease of experimentation, separation of machine-dependent from machine-independent concerns, and good debugging support. Walkabout is a framework for experimenting with dynamic binary translation ideas, as well as techniques in related areas such as interpreters, instrumentation tools, and optimization.
In this report, we present the design of the Walkabout framework and its initial implementation. Tools generated from this initial framework include disassemblers, machine code interpreters (emulators), and binary rewriting tools for the SPARC® and x86 architectures.
Securing the Wireless Internet
IEEE Communications Magazine, pp. 68-74.
KSSL: Experiments in Wireless Internet Security
Internet enabled wireless devices continue to proliferate and are expected to surpass traditional Internet clients in the near future. This has opened up exciting new opportunities in the mobile e-commerce market. However, data security and privacy remain major concerns in the current generation of "wireless web" offerings. All such offerings today use a security architecture that lacks end-to-end security. This unfortunate choice is driven by perceived inadequacies of standard Internet security protocols like SSL (Secure Sockets Layer) on less capable CPUs and low-bandwidth wireless links.
This report presents our experiences in implementing and using standard security mechanisms and protocols on small wireless devices. We have created new classes for the Java 2 Micro-Edition (J2ME[tm]) platform that offer fundamental cryptographic operations such as message digests and ciphers as well as higher level security protocols like SSL. Our results show that SSL is a practical solution for ensuring end-to-end security of wireless Internet transactions even within today s technological constraints.
Parallel Garbage Collection For Shared Memory Multiprocessors
We present a multiprocessor "stop-the-world" garbage collection framework that provides multiple forms of load balancing. Our parallel collectors use this framework to balance the work of root scanning, using static overpartitioning, and also to balance the work of tracing the object graph, using a form of dynamic load balancing called work stealing. We describe two collectors written using this framework: pSemispaces, a parallel semispace collector, and pMarkcompact, a parallel markcompact collector.