Scalable Data and Workload Generation for Regression Testing, Performance Evaluation and Benchmarking
Technische Universität Berlin
Oracle Fellowship Recipient
Oracle Principal Investigator
An important part of the maintenance lifecycle of a commercial data management system like Oracle is devoted to the diagnosis of performance regressions observed by customers in a production setting. When trying to reproduce the problematic behavior in a test environment, database developers often face the problem of missing data – even though the database schema, the problematic queries and even the workload itself can be provided by the customer, the actual database instance typically cannot be obtained (for example, due to privacy restrictions or size restrictions). What typically is available immediately, though, is the database catalog, which contains a statistical approximation of the reference database in the form of value distributions, cardinalities and histograms on columns or column groups.
As a fallback solution, developers currently “trick” the optimizer of a test database by feeding it with this customer catalog data in order to obtain the query access paths of the actual production system. As the underlying data is missing and the database catalog is usually lacking crucial information, e.g. on multivariate distributions, synthetic datasets generated in the lab setting are not representative. Thus, information on how the query access paths perform requires further assistance and feedback from the client. Such assistance/feedback could be obtained from using recorded workloads, performance repository data (AWR in Oracle) and various application data models that maintain extra information about object relations outsided the database catalogs. Using these extra sources generating data suitable for testing can become feasible.
In addition to diagnostic purposes, data generation can be useful to evaluate what-if scenarios that may affect a mission critical system in the future. Then generated data possibly in combination with existing real data can be used in a test system along with proper workloads that simulate the what-if scenario. Such experiments can provide a set of proactive actions that can be taken in anticipation of future growth. Furthermore, along with the generation of new data from catalog information and existing data, the generation of workloads that simulate the anticipated future scenarios becomes very important. This generation can be based on existing workloads that are currently running on the systems and input from the users that define their future expectations about specific carefully chosen aspects of the current workloads.
We propose to address this problem in stages: (1) data generation from the database catalog statistics and additional data synopsis such as the ones that can be extracted from existing data, workloads (sequence mining), performance data (top sql, sampled session history) and application data models. (2) workload generation from existing real workloads by declaratively defining trends of specific aspects of workloads (number of orders, number of customers, trends of request sequences discovered during workload characterization).