Oracle Labs | Publications

Our Publications

Every year our researchers publish hundreds of papers to share their findings with the industry and the academic community. Our primary research areas are big data and machine learning , cloud computing and programming languages..

All Publications Macaron Possum Pie GraalVM Secure Languages Programming Language Research Group RASPunzel

Research Papers

GraalSP Profiles Logger: A Tool for Analyzing and Interpreting Predictions of the ML-Based Static Profilers

Program profiles provide information about the program’s behaviour during execution. Profiles include the number of branch executions in branching statements, context and number of function calls, context and number of virtual method calls, information on monitor locks and unlocks, and more. Profiles enable profile-guided optimizations including aggressive inlining, cache memory optimizations, and similar techniques which generate highly efficient code. Instrumentation-based profiling is a common technique for collecting precise profiles. However, the profile collection represents a significant overhead in software development, and in some environments, it is unfeasible (for example, in memory-constrained environments such as embedded systems or in real-time systems where extended execution time due to the additional instrumentation code is not acceptable). An alternative to dynamic profiling is static profiling. Static profilers predict profiles instead of collecting them. The current development of static profilers is mainly focused on using machine learning techniques for predicting branch execution probabilities. Modern static profilers use ensembles of random forests and deep neural networks for regression to predict the probabilities of the taken branch of a branching statement. The best current static profilers achieve execution time speedups of 5 to 7 per cent. Static profilers can make mispredictions as similar code segments may execute differently depending on inputs and various conditions. The impact of individual predictions can significantly affect the performance of the program. For instance, an incorrect prediction that the branch would not lead into a frequent loop body can result in a substantially slower program. To maximize performance, state-of-the-art static profilers employ instance weighting to prioritize critical cases. However, misprediction can still happen and it is necessary to efficiently identify and analyze them, to adjust and improve the ML model driving the static profiler. We have developed the GraalSP Profiles Logger (GraalSP-PLog), a tool for efficiently analyzing ML-based predictions made by the GraalSP static profiler which is a part of the Oracle GraalVM Native Image compiler. GraalSP-PLog runs the input program with GraalSP to log the profile predictions. As GraalSP employs a set of static heuristics that correct predictions of the ML model, GraalSP-PLog also captures this information. To collect ground truth values, GraalSP-PLog performs instrumentation profiling to dynamically collect program profiles. Combining static predictions and collected profiles, GraalSP-PLog generates a prediction report. The report facilitates the identification of performance-critical predictions, both accurate and inaccurate, and sorts nodes by their impact on overall program performance. The tool provides results in .csv format so that they can be easily opened on any operating system and sorted by any criteria...

Towards Intelligent Application Security

Over the past 20 years we have seen application security evolve from analysing application code through Static Application Security Testing (SAST) tools, to detecting vulnerabilities in running applications via Dynamic Application Security Testing (DAST) tools. The past 10 years have seen new flavours of tools to provide combinations of static and dynamic tools via Interactive Application Security Testing (IAST), examination of the components and libraries of the software called Software Composition Analysis (SCA), protection of web applications and APIs using signature-based Web Application Firewalls (WAF), and monitoring the application and blocking attacks through Runtime Application Self Protection (RASP) techniques. The past 10 years has also seen an increase in the uptake of the DevOps model that combines software development and operations to provide continuous delivery of high quality software. As security has become more important, the DevOps model has evolved to the DevSecOps model where software development, operations and security are all integrated. There has also been increasing usage of learning techniques, including machine learning, and program synthesis. Several tools have been developed that make use of machine learning to help developers make quality decisions about their code, tests, or runtime overhead their code produces. However, such techniques have not been applied to application security as yet. In this talk I discuss how to provide an automated approach to integrate security into all aspects of application development and operations, aided by learning techniques. This incorporates signals from the code operations and beyond, and automation, to provide actionable intelligence to developers, security analysts, operations staff, and autonomous systems. I will also consider how malware and threat intelligence can be incorporated into this model to support Intelligent Application Security in a rapidly evolving world. Bio: https://labs.oracle.com/pls/apex/f?p=94065:11:8452080560451:21 LinkedIn: https://www.linkedin.com/in/drcristinacifuentes/ Twitter: @criscifuentes

Presentation of Prognostic and Health Management System in AeroConf 2023

Oracle has an anomaly detection solution for monitoring time-series telemetry signals for dense-sensor IoT prognostic applications. It integrates an advanced prognostic pattern recognition technique called Multivariate State Estimation Technique (MSET) for high-sensitivity prognostic fault monitoring applications in commercial nuclear power and aerospace applications. MSET has since been spun off and met with commercial success for prognostic Machine Learning (ML) applications in a broad range of safety critical applications, including NASA space shuttles, oil-and-gas asset prognostics, and commercial aviation streaming prognostics. MSET proves to possess significant advantages over conventional ML solutions including neural networks, autoassociative kernel regression, and support vector machines. The main advantages include earlier warning of incipient anomalies in complex time-series signatures, and much lower overhead compute cost due to the deterministic mathematical structure of MSET. Both are crucial for dense-sensor avionic IoT prognostics. In addition, Oracle has developed an extensive portfolio of data preprocessing innovations around MSET to solve the common big-data challenges that cause conventional ML algorithms to perform poorly regarding prognostic accuracy (i.e, false/missed alarm probabilities). Oracle's MSET-based prognostic solution helps increase avionic reliability margins and system availability objectives while reducing costly sources of “no fault found” events that have become a significant sparing-logistics issue for many industries including aerospace and avionics. Moreover, by utilizing and correlating information from all on-board telemetry sensors (e.g., distributed pressure, voltage, temperature, current, airflow and hydraulic flow), MSET is able to provide the best possible prediction of failure precursors and onset of small degradation for the electronic components used on aircrafts, benefiting the aviation Prognostics and Health Management (PHM) system.

Oracle Cloud Advanced ML Prognostics Innovations for Enterprise Computing Servers

Oracle has a portfolio of Machine Learning (ML) offerings for monitoring time-series telemetry signals for anomaly detection. The product suite is called the Multivariate State Estimation Technique (MSET2) that integrates an advanced prognostic pattern recognition technique with a collection of intelligent data preprocessing (IDP) innovations for high-sensitivity prognostic applications. One of the important application is monitoring dynamic computer power and catching the early incipience of mechanisms that cause servers to fail using the telemetry signals of servers. Telemetry signals in computing servers typically include many physical variables (e.g., voltages, currents, temperatures, fan speeds, and power levels) that correlate with system IO traffic, memory utilization, and system throughput. By utilizing the telemetry signals, MSET2 improve power efficiencies by monitoring, reporting and forecasting energy consumption, cooling requirements and load utilization of servers. However, the common challenge in the computing server industry is that telemetry signals are never perfect. For example, enterprise-class servers have disparate sampling rates and are often not synchronized in time, resulting in a lead-lag phase change among the various signals. In addition, the enterprise computing industry often uses 8-bit A/D conversion chips for physical sensors. This makes it difficult to discern small variations in the physical variables that are severely quantized because of the use of low-resolution chips. Moreover, missing values often exist in the streaming telemetry signals, which can be caused by the saturated system bus or data transmission error. This paper describes some features of key IDP algorithms for optimal ML solutions to the aforementioned challenges across the enterprise computing industry. It assures optimal ML performance for prognostics, optimal energy efficiency of Enterprise Servers, and streaming analytics.

Toward Just-in-time and Language-agnostic Mutation Testing

Mutation Testing is a popular approach to determine the quality of a suite of unit tests. It is based on the idea that introducing faults into a system-under-test (SUT) should cause tests to fail, otherwise, the test suite might be of insufficient quality. In the language of mutation testing, such a fault is referred to as "mutation", and an instance of the SUT's code that contains the mutation is referred to as ``mutant''. Mutation testing is computationally expensive and time-consuming. Reasons for this include, for example, a high number of mutations to consider, interrelations between these mutations, and mutant-associated costs such as the cost of mutant creation or the cost of checking whether any tests fail in response. Furthermore, implementing a reliable tool for automatic mutation testing is a significant effort for any language. As a result, mutation testing is only available for some languages. Present mutation tools often rely on modifying code or binary executables. We refer to this as "ahead-of-time" mutation testing. Oftentimes, they neither take dynamic information that is only available at run-time into account nor alter program behavior at run-time. However, mutating via the latter could save costs on mutant creation: If the corresponding module of code is compiled, only the mutated section of code needs to be recompiled. Additional run-time information (like previous execution results of the mutated section) selected by an initial test run, could also help to determine the utility of a mutant. Skipping mutants of low utility could have an impact on mutation testing efficiency. We propose to refer to this approach as just-in-time mutation testing. In this paper, we provide a proof of concept for just-in-time and language-agnostic mutation testing. We present preliminary results of a feasibility study that explores the implementation of just-in-time mutation testing based on Truffle's instrumentation API. Based on these results, future research can evaluate the implications of just-in-time and language-agnostic mutation testing.

LXM: Better Splittable Pseudorandom Number Generators (and Almost as Fast)

Paper to be submitted to ACM OOPSLA 2021. Abstract: In 2014, Steele, Lea, and Flood presented {\sc SplitMix}, an object-oriented pseudorandom number generator (PRNG) that is quite fast (9 64-bit arithmetic/logical operations per 64 bits generated) and also {\it splittable}. A conventional PRNG object provides a {\it generate} method that returns one pseudorandom value and updates the state of the PRNG; a splittable PRNG object also has a second operation, {\it split}, that replaces the original PRNG object with two (seemingly) independent PRNG objects, by creating and returning a new such object and updating the state of the original object. Splittable PRNG objects make it easy to organize the use of pseudorandom numbers in multithreaded programs structured using fork-join parallelism. This overall strategy still appears to be sound, but the specific arithmetic calculation used for {\it generate} in the {\sc SplitMix} algorithm has some detectable weaknesses, and the period of any one generator is limited to $2^{64}$. Here we present the LXM \emph{family} of PRNG algorithms. The idea is an old one: combine the outputs of two independent PRNG algorithms, then (optionally) feed the result to a mixing function. An LXM algorithm uses a linear congruential subgenerator and an $\mathbf{F}_2$-linear subgenerator; the examples studied in this paper use an LCG of period $2^{16}$, $2^{32}$, $2^{64}$, or $2^{128}$ with one of the multipliers recommended by L'Ecuyer or by Steele and Vigna, and an $\mathbf{F}_2$-linear generator of the \texttt{xoshiro} family or \texttt{xoroshiro} family as described by Blackman and Vigna. Mixing functions studied in this paper include the MurmurHash3 finalizer function, David Stafford's variants, Doug Lea's variants, and the null (identity) mixing function. Like {\sc SplitMix}, LXM provides both a \emph{generate} operation and a \emph{split} operation. Also like {\sc SplitMix}, LXM requires no locking or other synchronization (other than the usual memory fence after instance initialization), and is suitable for use with {\sc simd} instruction sets because it has no branches or loops. We analyze the period and equidistribution properties of LXM generators, and present the results of thorough testing of specific members of this family, using the TestU01 and PractRand test suites, not only on single instances of the algorithm but also for collections of instances, used in parallel, ranging in size from $2$ to $2^{27}$. Single instances of LXM that include a strong mixing function appear to have no major weaknesses, and LXM is significantly more robust than {\sc SplitMix} against accidental correlation in a multithreaded setting. We believe that LXM is suitable for the same sorts of applications as {\sc SplitMix}, that is, ``everyday'' scientific and machine-learning applications (but not cryptographic applications), especially when concurrent threads or distributed processes are involved.

CompGen: Generation of Fast Compilers in a Multi-Language VM

The first Futamura projection enables compilation and high performance code generation of user programs by partial evaluation of language interpreters. Previous work has shown that it is sufficient to leverage profiling information and use partial evaluation directives in interpreters as hints to drive partial evaluation towards compiled code efficiency. However, this comes with the downside of additional application warm-up time: Partial evaluation of language interpreters has to specialize interpreter code on the fly to the dynamic types used at run time to create efficient target code. As a result, the tie spend on partial evaluation itself is a significant contributor to the overall compile time of a method. The second Futamura projection solves this problem by self-applying partial evaluation on the partial evaluation algorithm, effectively generating language-specific compilers from interpreters. This typically reduces compilation time compared to the first projection. Previous work employed the second projection to some extent, however to this day, no generic second Futamura projection approach is used in a state-of-the-art language runtime. Ultimately, the problems of code-size explosion for compiler generation and warm-up time increases are unsolved problems subject to research to this day. To solve the problems of code-size explosion and self-application warm-up this paper proposes \emph{CompGen}, an approach based on code generation of subsets of language interpreters which is loosely based upon the idea of the second Futamura projection. We implemented a prototype of CompGen for \textit{GraalVM} and show that our usage of a novel code-generation algorithm, incorporating interpreter directives allows to generate efficient compilers that emit fast target programs which easily outperform the first Fumatura projection in compilation time. We evaluated our approach with \textit{GraalJS}, an ECMAScript-compliant interpreter, and standard JavaScript benchmarks, showing that our approach achieves $2-3X$ speedups of partial evaluation.

Towards Intelligent Application Security

Over the past 20 years we have seen application security evolve from analysing application code through Static Application Security Testing (SAST) tools, to detecting vulnerabilities in running applications via Dynamic Application Security Testing (DAST) tools. The past 10 years have seen new flavours of tools to provide combinations of static and dynamic tools via Interactive Application Security Testing (IAST), examination of the components and libraries of the software called Software Composition Analysis (SCA), protection of web applications and APIs using signature-based Web Application Firewalls (WAF), and monitoring the application and blocking attacks through Runtime Application Self Protection (RASP) techniques. The past 10 years has also seen an increase in the uptake of the DevOps model that combines software development and operations to provide continuous delivery of high quality software. As security has become more important, the DevOps model has evolved to the DevSecOps model where software development, operations and security are all integrated. There has also been increasing usage of learning techniques, including machine learning, and program synthesis. Several tools have been developed that make use of machine learning to help developers make quality decisions about their code, tests, or runtime overhead their code produces. However, such techniques have not been applied to application security as yet. In this talk I discuss how to provide an automated approach to integrate security into all aspects of application development and operations, aided by learning techniques. This incorporates signals from the code operations and beyond, and automation, to provide actionable intelligence to developers, security analysts, operations staff, and autonomous systems. I will also consider how malware and threat intelligence can be incorporated into this model to support Intelligent Application Security in a rapidly evolving world.

What is a Secure Programming Language? (POPL slides)

Our most sensitive and important software systems are written in programming languages that are inherently insecure, making the security of the systems themselves extremely challenging. It is often said that these systems were written with the best tools available at the time, so over time with newer languages will come more security. But we contend that all of today’s mainstream programming languages are insecure, including even the most recent ones that come with claims that they are designed to be “secure”. Our real criticism is the lack of a common understanding of what “secure” might mean in the context of programming language design. We propose a simple data-driven definition for a secure programming language: that it provides first-class language support to address the causes for the most common, significant vulnerabilities found in real-world software. To discover what these vulnerabilities actually are, we have analysed the National Vulnerability Database and devised a novel categorisation of the software defects reported in the database. This leads us to propose three broad categories, which account for over 50% of all reported software vulnerabilities, that as a minimum any secure language should address. While most mainstream languages address at least one of these categories, interestingly, we find that none address all three. Looking at today’s real-world software systems, we observe a paradigm shift in design and implementation towards service-oriented architectures, such as microservices. Such systems consist of many fine-grained processes, typically implemented in multiple languages, that communicate over the network using simple web-based protocols, often relying on multiple software environments such as databases. In traditional software systems, these features are the most common locations for security vulnerabilities, and so are often kept internal to the system. In microservice systems, these features are no longer internal but external, and now represent the attack surface of the software system as a whole. The need for secure programming languages is probably greater now than it has ever been.

What is a Secure Programming Language?

Our most sensitive and important software systems are written in programming languages that are inherently insecure, making the security of the systems themselves extremely challenging. It is often said that these systems were written with the best tools available at the time, so over time with newer languages will come more security. But we contend that all of today's mainstream programming languages are insecure, including even the most recent ones that come with claims that they are designed to be "secure". Our real criticism is the lack of a common understanding of what "secure" might mean in the context of programming language design. We propose a simple data-driven definition for a secure programming language: that it provides first-class language support to address the causes for the most common, significant vulnerabilities found in real-world software. To discover what these vulnerabilities actually are, we have analysed the National Vulnerability Database and devised a novel categorisation of the software defects reported in the database. This leads us to propose three broad categories, which account for over 50% of all reported software vulnerabilities, that as a minimum any secure language should address. While most mainstream languages address at least one of these categories, interestingly, we find that none address all three. Looking at today's real-world software systems, we observe a paradigm shift in design and implementation towards service-oriented architectures, such as microservices. Such systems consist of many fine-grained processes, typically implemented in multiple languages, that communicate over the network using simple web-based protocols, often relying on multiple software environments such as databases. In traditional software systems, these features are the most common locations for security vulnerabilities, and so are often kept internal to the system. In microservice systems, these features are no longer internal but external, and now represent the attack surface of the software system as a whole. The need for secure programming languages is probably greater now than it has ever been.

What is a Secure Programming Language?

Our most sensitive and important software systems are written in programming languages that are inherently insecure, making the security of the systems themselves extremely challenging. It is often said that these systems were written with the best tools available at the time, so over time with newer languages will come more security. But we contend that all of today's mainstream programming languages are insecure, including even the most recent ones that come with claims that they are designed to be ``secure''. Our real criticism is the lack of a common understanding of what ``secure'' might mean in the context of programming language design. We propose a simple data-driven definition for a secure programming language: that it provides first-class language support to address the causes for the most common, significant vulnerabilities found in real-world software. To discover what these vulnerabilities actually are, we have analysed the National Vulnerability Database and devised a novel categorisation of the software defects reported in the database. This leads us to propose three broad categories, which account for over 50\% of all reported software vulnerabilities, that \emph{as a minimum} any secure language should address. While most mainstream languages address at least one of these categories, interestingly, we find that none address all three. Looking at today's real-world software systems, we observe a paradigm shift in design and implementation towards service-oriented architectures, such as microservices. Such systems consist of many fine-grained processes, typically implemented in multiple languages, that communicate over the network using simple web-based protocols, often relying on multiple software environments such as databases. In traditional software systems, these features are the most common locations for security vulnerabilities, and so are often kept internal to the system. In microservice systems, these features are no longer internal but external, and now represent the attack surface of the software system as a whole. The need for secure programming languages is probably greater now than it has ever been.

Telemetry Parameter Synthesis System for Enhanced Tuning and Validation of Machine Learning Algorithmics

Advanced machine learning (ML) prognostics are leading to increasing Return-on-Investment (ROI) for dense-sensor Internet-of-Things (IoT) applications across multiple industries including Utilities, Oil-and-Gas, Manufacturing, Transportation, and for business-critical assets in enterprise and cloud data centers. For all of these IoT prognostic applications, a nontrivial challenge for data scientists is acquiring enough time series data from executing assets with which to evaluate, tune, optimize, and validate important prognostic functional requirements that include false-alarm and missed-alarm probabilities (FAPs, MAPs), time-to-detect (TTD) metrics for early-warning of incipient issues in monitored components and systems, and overhead compute cost (CC) for real-time stream ML prognostics. In this paper we present a new data synthesis methodology called the Telemetry Parameter Synthesis System (TPSS) that can take any limited chunk of real sensor telemetry from monitored assets, decompose the sensor signals into deterministic and stochastic components, and then generate millions of hours of high-fidelity synthesized telemetry signals that possess exactly the same serial correlation structure and statistical idiosyncrasies (resolution, variance, skewness, kurtosis, auto-correlation content, and spikiness) as the real telemetry signals from the IoT monitored critical assets. The synthesized signals bring significant value-add for ML data science researchers for evaluation and tuning of candidate ML algorithmics and for offline validation of important prognostic functional requirements including sensitivity, false alarm avoidance, and overhead compute cost. The TPSS has become an indispensable tool in Oracle’s ongoing development of innovative diagnostic/prognostic algorithms for dense-sensor predictive maintenance applications in multiple industries.

Live Multi-language Development and Runtime Environments

Context: Software development tools should work and behave consistently across different programming languages, so that developers do not have to familiarize themselves with new tooling for new languages. Also, being able to combine multiple programming languages in a program increases reusability, as developers do not have to recreate software frameworks and libraries in the language they develop in and can reuse existing software instead.

Inquiry: However, developers often have a broad choice of tools, some of which are designed for only one specific programming language. Various Integrated Development Environments have support for multiple languages, but are usually unable to provide a consistent programming experience due to different language-specific runtime features. With regard to language integrations, common mechanisms usually use abstraction layers, such as the operating system or a network connection, which are often boundaries for tools and hence negatively affect the programming experience.

Approach: In this paper, we present a novel approach for tool reuse that aims to improve the experience with regard to working with multiple high-level dynamic, object-oriented programming languages. As part of this, we build a multi-language virtual execution environment and reuse Smalltalk’s live programming tools for other languages.

Knowledge: An important part of our approach is to retrofit and align runtime capabilities for different languages as it is a requirement for providing consistent tools. Furthermore, it provides convenient means to reuse and even mix software libraries and frameworks written in different languages without breaking tool support.

Grounding: The prototype system Squimera is an implementation of our approach and demonstrates that it is possible to reuse both development tools from a live programming system to improve the development experience as well as software artifacts from different languages to increase productivity.

Importance: In the domain of polyglot programming systems, most research has focused on the integration of different languages and corresponding performance optimizations. Our work, on the other hand, focuses on tooling and the overall programming experience.

Better Splittable Pseudorandom Number Generators (and Almost As Fast)

We have tested and analyzed the {\sc SplitMix} pseudorandom number generator algorithm presented by Steele, Lea, and Flood \citeyear{FAST-SPLITTABLE-PRNG}, and have discovered two additional classes of gamma values that produce weak pseudorandom sequences. In this paper we present a modification to the {\sc SplitMix} algorithm that avoids all three classes of problematic gamma values, and also a completely new algorithm for splittable pseudorandom number generators, which we call {\sc TwinLinear}. Like {\sc SplitMix}, {\sc TwinLinear} provides both a \emph{generate} operation that returns one (64-bit) pseudorandom value and a \emph{split} operation that produces a new generator instance that with very high probability behaves as if statistically independent of all other instances. Also like {\sc SplitMix}, {\sc TwinLinear} requires no locking or other synchronization (other than the usual memory fence after instance initialization), and is suitable for use with {\sc simd} instruction sets because it has no branches or loops. The {\sc TwinLinear} algorithm is the result of a systematic exploration of a substantial space of nonlinear mixing functions that combine the output of two independent generators of (perhaps not very strong) pseudorandom number sequences. We discuss this design space and our strategy for exploring it. We used the PractRand test suite (which has provision for failing fast) to filter out poor candidates, then used TestU01 BigCrush to verify the quality of candidates that withstood PractRand. We present results of analysis and extensive testing on {\sc TwinLinear} (using both TestU01 and PractRand). Single instances of {\sc TwinLinear} have no known weaknesses, and {\sc TwinLinear} is significantly more robust than {\sc SplitMix} against accidental correlation in a multithreaded setting. It is slightly more costly than {\sc SplitMix} (10 or 11 64-bit arithmetic operations per 64 bits generated, rather than 9) but has a shorter critical path (5 or 6 operations rather than 8). We believe that {\sc TwinLinear} is suitable for the same sorts of applications as {\sc SplitMix}, that is, ``everyday'' scientific and machine-learning applications (but not cryptographic applications), especially when concurrent threads or distributed processes are involved.

SimSPRT-II: Monte Carlo Simulation of Sequential Probability Ratio Test Algorithms for Optimal Prognostic Performance

New prognostic AI innovations are being developed, optimized, and productized for enhancing the reliability, availability, and serviceability of enterprise servers and data centers, known as Electronic Prognostics (EP). EP prognostic innovations are now being spun off for prognostic cyber-security applications, and for Internet-of-Things (IoT) prognostic applications in the industrial sectors of manufacturing, transportation, and utilities. For these applications, the function of prognostic anomaly detection is achieved by predicting what each monitored signal “should be” via highly accurate empirical nonlinear nonparametric (NLNP) regression algorithms, and then differencing the optimal signal estimates from the real measured signals to produce “residuals”. The residuals are then monitored with a Sequential Probability Ratio Test (SPRT). The advantage of the SPRT, when tuned properly, is that it provides the earliest mathematically possible annunciation of anomalies growing into time series signals for a wide range of complex engineering applications. SimSPRT-II is a comprehensive parametric monte-carlo simulation framework for tuning, optimization, and performance evaluation of SPRT algorithms for any types of digitized time-series signals. SimSPRT-II enables users to systematically optimize SPRT performance as a multivariate function of Type-I and Type-II errors, Variance, Sampling Density, and System Disturbance Magnitude, and then quickly evaluate what we believe to be the most important overall prognostic performance metrics for real-time applications: Empirical False and Missed-alarm Probabilities (FAPs and MAPs), SPRT Tripping Frequency as a function of anomaly severity, and Overhead Compute Cost as a function of sampling density. SimSPRT-II has become a vital tool for tuning, optimization, and formal validation of SPRT based AI algorithms for applications in a broad range of engineering and security prognostic applications.

Intrusion Detection of a Simulated SCADA System using Data-Driven Modeling

Supervisory Control and Data Acquisition (SCADA) systems have become integrated into many industries that have a need for control and automation. Examples of these industries include energy, water, transportation, and petroleum. A typical SCADA system consists of field equipment for process actuation and control, along with proprietary communication protocols. These protocols are used to communicate between the field equipment and the monitoring equipment located at a central facility. Given that distribution of vital resources is often controlled by this type of system, there is a need to secure the networked compute and control elements from users with malicious intent. This paper investigates the use of data-driven modeling techniques to identify various types of intrusions tested against a simulated SCADA system. The test bed uses three enterprise servers that were part of a university engineering linux cluster. These were isolated so that job queries on the cluster would not be reflected in the normal behavior of the test bed, and to ensure that intrusion testing would not affect other components of the cluster. One server acts as a Master Terminal Unit (MTU), which simulates control and data acquisition processes. The other two act as Remote Terminal Units (RTUs), these simulate monitoring and telemetry transmission. All servers use Ubuntu 14.04 as the OS. A separate workstation using Kali Linux acts as a Human Machine Interface (HMI), this is used to monitor the simulation and perform intrusion testing. Monitored telemetry included network traffic, hardware and software digitized time series signatures. The models used in this research include the Auto Associative Kernel Regression (AAKR) and Multivariate State Estimation Technique (AAMSET) [1, 2]. This type of intrusion detection can be classified as a behavior-based technique, wherein data collected when the system exhibits normal behavior is first used to train and optimize the previously mentioned machine learning models. Any future monitored telemetry that deviates from this normal behavior can be treated as anomalous, and may indicate an attack against the system. Models were tested to evaluate the prognostic effectiveness when monitoring clusters of signals from four classes of telemetry: combination of all telemetry signals, memory and CPU usage, disk usage, and TCP/IP statistics. Anomaly detection is performed by using the Sequential Probability Ratio Test (SPRT), which is a binary sequential statistical test developed by Wald [3]. This test determines whether the monitored observation has mean or variance shifted from defined normal behavior [4]. For the prognostic security experiments reported in this paper, we established rigorous quantitative functional requirements for evaluating the outcome of the intrusion-signature fault injection experiments. These were a high accuracy for model predictions of dynamic telemetry metrics, and ultralow False Alarm and Missed Alarm Probabilities (FAPs and MAPS)...

Asynchronous Memory Access Chaining

In-memory databases rely on pointer-intensive data struc- tures to quickly locate data in memory. A single lookup op- eration in such data structures often exhibits long-latency memory stalls due to dependent pointer dereferences. Hid- ing the memory latency by launching additional memory ac- cesses for other lookups is an e ective way of improving per- formance of pointer-chasing codes (e.g., hash table probes, tree traversals). The ability to exploit such inter-lookup par- allelism is beyond the reach of modern out-of-order cores due to the limited size of their instruction window. Instead, re- cent work has proposed software prefetching techniques that exploit inter-lookup parallelism by arranging a set of inde- pendent lookups into a group or a pipeline, and navigate their respective pointer chains in a synchronized fashion. While these techniques work well for highly regular access patterns, they break down in the face of irregularity across lookups. Such irregularity includes variable-length pointer chains, early exit, and read/write dependencies. This work introduces Asynchronous Memory Access Chaining (AMAC), a new approach for exploiting inter- lookup parallelism to hide the memory access latency. AMAC achieves high dynamism in dealing with irregular- ity across lookups by maintaining the state of each lookup separately from that of other lookups. This feature en- ables AMAC to initiate a new lookup as soon as any of the in- ight lookups complete. In contrast, the static ar- rangement of lookups into a group or pipeline in existing techniques precludes such adaptivity. Our results show that AMAC matches or outperforms state-of-the-art prefetch- ing techniques on regular access patterns, while delivering up to 2.3x higher performance under irregular data struc- ture lookups. AMAC fully utilizes the available micro- architectural resources, generating the maximum number of memory accesses allowed by hardware in both single- and multi-threaded execution modes

Efficient analysis using Soufflé - An experience report

Souffle is an open-source programming framework for static program analysis. It enables the analysis designer to express static program analysis on very large code bases such as a points-to analysis for the Java Development Kit (JDK) which has more than 1.5 million variables and 600 thousand call sites. Souffle employs a Datalog-like language as a domain specific language for static program analysis. Its finite domain semantics lends to efficient execution on parallel hardware using various levels of program specialisations. A specialization hierarchy is applied to a Datalog program. As a result, highly specialized and optimised C++ code is produced that harvests the computational power of modern shared-memory/multi-core computer architectures. We have been using Souffle to explore and develop vulnerability detection analyses on the Java platform, using JDK 7, 8 and 9. These vulnerability detection analyses make use of points-to analysis (reusing parts of the DOOP framework), taint analysis, escape analysis, and other data flow-based analyses. In this talk we report on the types of analyses used, the sizes of the input relations and computed relations, as well as the the runtime and memory requirements for the analyses of such large codebases. For the program specialization, we use several translation steps. In each translation step, new optimisation opportunities open up that would not be able to exploit in the previous translation step. The first translation uses a Futamura projection to translate a declarative Datalog program to an imperative relational program for an abstract machine which we call the Relational Algebra Machine (RAM). The RAM program contains relational algebra operations to compute results produced by clauses, relation management operations to keep track of previous, current and new knowledge in the semi-naive evaluation, and imperative constructs including statement composition for sequencing the operations, and loop construction with loop exit condition to express fixed-points computations for recursively-defined relations. It also has support for parallelism. The next translation steps, translates the optimized RAM program into a C++ program that uses meta-programming techniques with templates. The last translation step, is performed by a C++ program that compiles the C++ program to a executable binary. Operations for emptiness and existence checks, range queries, insertions and unions are highly efficient because portions of the operations are pushed from runtime to compile-time using meta-programming techniques. We now outline some of the novel aspects that are in the implementation of Souffle. The first is related to indices. Since indices are costly, a minimal set of indices for a given relation is desired. We employ a discrete optimization problem to minimize indices creating only the required indices for the execution is required and hence avoiding redundancies. The second is the choice of data-structures to represent large relations...

Frappé: a code comprehension tool for large codebases

Code comprehension is an integral part of a developer’s everyday programming tasks. Today’s modern, graphical IDEs have many features that facilitate this process. Unfortunately, using these IDEs is often impractical when working with large C/C++ code bases (in the order of millions to tens of millions of lines of code) as they are difficult to integrate with the custom and often complex build systems commonly employed in such projects, and their start-up time and memory usage can be excessive due the sheer volume of code involved. For these reasons, it has been our experience that developers often fall back to lightweight editors, such as Emacs or Vim, and use regex-based text scanning tools like Grep or CScope to aid in code comprehension tasks. But while these text scanning tools are fast, they are also largely unaware of symbol types, scopes and linking information, handle the C pre-processor poorly, and deal only with direct dependencies. We are developing a code comprehension tool called Frappé that aims to address the limitations of these tools while maintaining a comparable level of scalability and ease-of-integration. Our approach is based around building and incrementally maintaining a dependency graph of the program. The nodes in the graph represent source entities such as functions, global variables, types, macros, files, etc. and the edges represent the relations between them, e.g. calls, reads, writes, uses, contains, etc. Code comprehension questions then become graph-matching problems. This works both for questions involving direct dependencies, like finding all functions that write to a particular global variable (match all function nodes with an outgoing writes edge to the global variable of interest) and transitive dependencies, like estimating the impact of a code change (match all functions in the transitive closure of calls edges from the function or functions to be modified). Integration with custom builds is made easy by providing wrapper scripts that serve as drop-in replacements for the most common compilers (e.g. gcc, icc, cc, clang). These scripts still execute the native compiler they wrap, but also run a modified version of the clang compiler to write out precise information on the various source entities and dependencies in the given compilation unit. Frappé reads this information in as it becomes available and incrementally constructs (or updates) the dependency graph of the system as it does so. Frappé provides several UIs for exploring the dependency data it generates, including editor plugins for Emacs and Vim and a web UI that allows users to navigate the dependencies in their code and write their own custom queries from their web browser. The web UI overlays query results on a 2D spatial visualisation of the code called a Code Map that gives an immediate general impression of the location, locality and quantity of results. The prototype version of Frappé has seen a positive initial response from internal development organisations. A formal evaluation is planned.

Code Maps: A Scalable Visualisation Technique for Large Codebases

Large codebases (in the order of millions to 10s of millions of lines of code) are notoriously difﬁcult to understand, modify and maintain. They are the product of hundreds of developers working simultaneously over several decades and are often poorly documented. For new developers, building up a workable mental model of these systems can take a considerable amount of time. Software visualisation is one area that seems ideally suited to address this problem, but that in practice, does not see much use beyond high-level, hand-crafted architecture diagrams and node-link diagrams that typically scale poorly to larger systems. We are developing a scalable, spatial visualisation for large codebases based on a world-map metaphor. The core of the idea is mapping the continent/country/state/city/etc. hierarchy to the equivalent in a code base – the high-level architectural components down to the individual ﬁles and functions that comprise them – and laying these out so as to maintain the intuitive notion that the proximity of any two entities is proportional to their coupling. Our approach takes as input a dependency graph of the system and, optionally, a predeﬁned abstraction hierarchy to group the low-level source code entities into their higher-level system components. If a hierarchy is not supplied, we recover one from the dependency graph using a graph clustering algorithm. From there we use a combination of force-directed graph layout, implicit surface generation, and Voronoi treemaps to produce a map of the codebase. Our approach allows users to browse the system at any level of detail using the familiar pan/zoom interaction model of web-based mapping services. It provides strong visual landmarks for faster navigation thanks to the distinctive shapes and positions of regions on the map, and lends itself to easy data overlay – the size and colour of regions are proportional to supplied code metrics, while bug locations, search results, and dependency edges can be superimposed. A formal evaluation of the prototype has not yet been undertaken, but initial feedback from internal development organisations has been very positive, particularly for the data overlay capabilities and intuitive zoom-for-detail interaction model.

Code Maps: A Scalable Visualisation Technique for Large Codebases

Large codebases (in the order of millions to 10s of millions of lines of code) are notoriously difﬁcult to understand, modify and maintain. They are the product of hundreds of developers working simultaneously over several decades and are often poorly documented. For new developers, building up a workable mental model of these systems can take a considerable amount of time. Software visualisation is one area that seems ideally suited to address this problem, but that in practice, does not see much use beyond high-level, hand-crafted architecture diagrams and node-link diagrams that typically scale poorly to larger systems. We are developing a scalable, spatial visualisation for large codebases based on a world-map metaphor. The core of the idea is mapping the continent/country/state/city/etc. hierarchy to the equivalent in a code base – the high-level architectural components down to the individual ﬁles and functions that comprise them – and laying these out so as to maintain the intuitive notion that the proximity of any two entities is proportional to their coupling. Our approach takes as input a dependency graph of the system and, optionally, a predeﬁned abstraction hierarchy to group the low-level source code entities into their higher-level system components. If a hierarchy is not supplied, we recover one from the dependency graph using a graph clustering algorithm. From there we use a combination of force-directed graph layout, implicit surface generation, and Voronoi treemaps to produce a map of the codebase. Our approach allows users to browse the system at any level of detail using the familiar pan/zoom interaction model of web-based mapping services. It provides strong visual landmarks for faster navigation thanks to the distinctive shapes and positions of regions on the map, and lends itself to easy data overlay – the size and colour of regions are proportional to supplied code metrics, while bug locations, search results, and dependency edges can be superimposed. A formal evaluation of the prototype has not yet been undertaken, but initial feedback from internal development organisations has been very positive, particularly for the data overlay capabilities and intuitive zoom-for-detail interaction model.

A framework for reasoning about inherent parallelism in modern object-oriented languages

With the emergence of multi-core processors into the mainstream, parallel programming is no longer the specialized domain it once was. There is a growing need for systems to allow programmers to more easily reason about data dependencies and inherent parallelism in general purpose programs. Many of these programs are written in popular imperative programming languages like Java and C]. In this thesis I present a system for reasoning about side-effects of evaluation in an abstract and composable manner that is suitable for use by both programmers and automated tools such as compilers. The goal of developing such a system is to both facilitate the automatic exploitation of the inherent parallelism present in imperative programs and to allow programmers to reason about dependencies which may be limiting the parallelism available for exploitation in their applications. Previous work on languages and type systems for parallel computing has tended to focus on providing the programmer with tools to facilitate the manual parallelization of programs; programmers must decide when and where it is safe to employ parallelism without the assistance of the compiler or other automated tools. None of the existing systems combine abstraction and composition with parallelization and correctness checking to produce a framework which helps both programmers and automated tools to reason about inherent parallelism. In this work I present a system for abstractly reasoning about side-effects and data dependencies in modern, imperative, object-oriented languages using a type and effect system based on ideas from Ownership Types. I have developed sufficient conditions for the safe, automated detection and exploitation of a number task, data and loop parallelism patterns in terms of ownership relationships. To validate my work, I have applied my ideas to the C# version 3.0 language to produce a language extension called Zal. I have implemented a compiler for the Zal language as an extension of the GPC# research compiler as a proof of concept of my system. I have used it to parallelize a number of real-world applications to demonstrate the feasibility of my proposed approach. In addition to this empirical validation, I present an argument for the correctness of the type system and language semantics I have proposed as well as sketches of proofs for the correctness of the sufficient conditions for parallelization proposed.

Productive Petascale Computing: Requirements, Hardware, and Software

Supercomputer designers traditionally focus on low-level hardware performance criteria such as CPU cycle speed, disk bandwidth, and memory latency. The High-Performance Computing (HPC) community has more recently begun to realize that escalating hardware performance is, by itself, contributing less and less to real productivity—the ability to develop and deploy high-performance supercomputer applications at acceptable time and cost.

The Defense Advanced Research Projects Agency (DARPA) High Productivity Computing Systems (HPCS) initiative challenged industry vendors to design a new generation of supercomputers that would deliver a 10x improvement in this newly acknowledged but poorly understood domain of real productivity. Sun Microsystems, choosing to abandon customary evolutionary approaches, responded with two revolutionary decisions. The first was to investigate the nature of supercomputer productivity in the full context of use, which includes people, organizations, goals, practices, and skills as well as processors, disks, memory, and software. The second decision was to rethink completely the design of supercomputing systems, informed by productivity-based requirements and driven by recent technological breakthroughs. Crucial to the implementation of these decisions was the establishment of multidisciplinary, closely collaborating teams that conducted research into productivity and developed the many closely intertwined design decisions needed to meet DARPA’s challenge.

Among the most significant results from Sun’s productivity research was a detailed diagnosis of software development as the dominant barrier to productivity improvements in the HPC community. The level of expertise required, combined with the amount of effort needed to develop conventional HPC codes, has already created a crisis of productivity. Even worse, there is no path forward within the existing paradigm that will significantly increase productivity as hardware systems scale up. The same issues also prevent HPC from “scaling out” to a broader class of applications. This diagnosis led to design requirements that address specific issues behind the expertise and effort bottlenecks.

Sun’s design teams explored complex, system-wide tradeoffs needed to meet these requirements in all aspects of the design, including reliability, performance, programmability, and ease of administration. These tradeoffs drew on technological advances in massive chip multithreading, extremely high-performance interconnects, resource virtualization, and programming language design. The outcome was the design for a machine to operate at petascale, with extremely high reliability and a greatly simplified programming model. Although this design supports existing codes and software technologies—crucial requirements—it also anticipates that the greatest productivity breakthroughs will follow from dramatic changes in how HPC codes are developed, changes that require a system of the type designed by Sun’s HPCS team.

A Transformational Approach to Binary Translation of Delayed Branches with Applications to SPARC® and PA-RISC Instructions Sets

A binary translator examines binary code for a source machine, optionally builds an intermediate representation, and generates code for a target machine. Understanding what to do with delayed branches in binary code can involve tricky case analyses, e.g., if there is a branch instruction in a delay slot. Correctness of a translation is of utmost importance. This paper presents a disciplined method for deriving such case analyses. The method identifies problematic cases, shows the translations for the non-problematic cases, and gives confidence that all cases are considered.The method supports such common architectures as SPARC®, MIPS, and PA-RISC.

We begin by writing a very simple interpreter for the source machine's code. We then transform the interpreter into an interpreter for a target machine without delayed branches. To maintain the semantics of the program being interpreted, we simultaneously transform the sequence of source-machine instructions into a sequence of target-machine instructions. The transformation of the instructions becomes our algorithm for binary translation. We show the translation is correct by reasoning about corresponding states on source and target machines.

Instantiation of this algorithm to the SPARC V8 and PA-RISC V1.1 architectures is shown. Of interest, these two machines share seven of 11 classes of delayed branching semantics; the PA-RISC has three classes which are not available in the SPARC architecture, and the SPARC architecture has one class which is not available in the PA-RISC architecture.

Although the delayed branch is an architectural idea whose time has come and gone, the method is significant to anyone who must write tools that deal with legacy binaries. For example, translators using this method could run PA-RISC on the new IA-64 architecture, or they may enable architects to eliminate delayed branches from a future version of the SPARC architecture.

*This report is a very extended version of TR 440, Department of Computer Science and Electrical Engineering, The University of Queensland, Dec 1998, and describes applications of the technique to translations of SPARC® and PA-RISC codes. This report fully documents the translation algorithms for these machines.

User Interaction in Language-Based Editing Systems

Ph.D. Dissertation, Computer Science Division, EECS, University of California Berkeley, December 1992. Published as Technical Report UCB/CSD-93-726

Abstract:

Language-based editing systems allow users to create, browse, and modify structured documents (programs in particular) in terms of the formal languages in which they are written. Many such systems have been built, but despite steady refinement of the supporting technology few programmers use them today. In this dissertation it is argued that realizing the potential of these systems demands a it user-centered approach to their design and construction. Pan, a fully-implemented experimental language-based editing and browsing system, demonstrates the viability of the approach.

Careful consideration of the intended user population, drawing on evidence from psychological studies of programmers, from current software engineering practice, and from experience with earlier systems, motivates Pan's design. Important aspects of that design include functional requirements, metaphors that capture the feel of the system from the perspective of users, and an architectural framework for implementation.

Unlike many earlier systems, Pan's design hides the complexity of language-based technology behind a set of simple and appropriate conceptual models --- models of the system and of the documents being viewed. Responding to the true bottleneck in software production, Pan's services are designed to help users understand software rather than save keystrokes writing it. Furthermore, Pan's design framework provides services that degrade gracefully in the presence of malformed documents, incomplete documents, and inconsistent information.

This research has yielded new insight into the design problem at all levels: the suitability of current language-based technology for interactive, user-centered applications; appropriate kernel mechanisms for building coherent user services; new conceptual models of editing that blend textual and structural operations without undue complexity; and the crucial role of local, site-specific design in the delivery of language-based editing services.

Resources For

Partners

Emerging Technology

What’s New

Contact Us