A Comprehensive Database Support for Time Series
Libera Università di Bolzano
Oracle Fellowship Recipient
Oracle Principal Investigator
Kenny Gross, Architect
Matthias Brantner, Senior Director
Zhen Hua Liu
An efficient management and processing of time series data is extremely important, and its importance will continue to grow rapidly. Time series data is the most important class of data to capture the dynamics of the world. The amount of such data (deriving usually from sensors) is rapidly growing in almost all application areas, e.g., meteorology, financial world, and IoT. While sensor data typically represents regular time series with values at regular time intervals, there exists a wide range of ‘event’ data that can be considered as irregular time series, e.g., log files. In event data, the measurements follow a temporal sequence but not necessarily at regular time intervals.
There have been intensive research activities on time series data in the past decade. However, they have almost always been done outside (relational) database systems. A wide range of techniques for processing time series data have been developed (e.g., MSET/SPRT, ARP, iSAX), as well as ad-hoc systems optimized for time series data (e.g., Prometheus, OpenTSDB, HBase). Similarly, array databases provide a data model that natively supports the efficient storage and processing of multidimensional time series (e.g, SciDB, Rasdaman, EXTASCID). Despite these intensive research activities, there exist no serious attempts to integrate the processing of time series data into (relational) database systems. How to seamlessly integrate time series in RDBMSs is therefore less clear.
Processing time series data is a very complex process, starting from data ingestion followed by transformation, analysis/projection and notification to the user providing‘actionable’ information. Each step in this process might involve complex operations and mathematics, which are often based on similarity rather than equality. Past research concentrated on specific operations and individual steps, delivering highly efficient and specialized solutions, based on assumptions about the data that are not always met in real applications. For instance, data originating from sensors are often incomplete and noisy and need to be cleaned before more advanced analysis can be applied. Supporting the entire life cycle is challenging as each step is highly complex and poses specific problems.
Each phase of processing time series data poses a number of challenges. As time series data are frequently coming from sensors, they might be incomplete or recorded at different granularity levels or with disparate, and in some cases with time-varying, sampling intervals. A pre-processing or transformation step is usually required before more advanced operations can be applied. For instance, missing values need to be imputed as they diminish prognostic accuracy for some operations and increase the false-positives and false-negatives for anomaly detection; when multiple time series are involved, they first should be synchronized; and sensor data often require verification and the detection of anomalies. Oracle has developed some innovative preprocessing algorithms to discover, flag, and optimally correct sensor anomalies, e.g., MSET/SPRT and ARP. After the pre-processing phase, a large variety of more advanced operators are applied for a thorough analysis of data. A core operation is to compute the similarity between time series. Many different similarity measures have been investigated and should be supported among the core operations, most notably Euclidean distance and dynamic time warping distance. Other advanced queries include subsequence mining, pattern matching, motif discovery, summarizing, aggregation, abstraction and situation awareness.
Modern (relational) database systems are extremely powerful analytics engines, however they provide little support for time series data. For example, Oracle Advanced Analytics (OAA) extends SQL with R and integrates a number of advanced data analysis and data mining algorithms into SQL, e.g., classification, regression, association rules mining and clustering. The key technology for this are well optimized UDFs. Over 15 years ago, Oracle8i Time Series offered a commercial product that leveraged Oracle RDBMS to manage time series data. Due to compatibility problems with BI tools, the time series support has been superseded by SQL Analytic Functions (aka Windowing Functions). They deliver some basic concepts that are at the core of processing time series data, e.g., sliding windows. Optimized UDFs and Analytic Functions are basic building blocks for time-series solutions. A typical time-series application requires significant amount of procedural code to control the flow of processes and to manage the end to end life cycle of the processes composed of the UDFs and Analytic Functions. Despite the potential to express a wide range of time-series process flows in SQL, not much work has been done to identify the essential time-series process flows for support in a SQL framework.
In the light of the above, the overarching objective of this research project is to deeply integrate time series and database technologies. This will leverage existing technologies from both fields and advance SQL and relational database systems towards a powerful high-level shell for processing time series data in a declarative way.
The project will benefit from the experience of Oracle in areas such as high-speed data analytics and specific time series technologies as well as from the experience of the EPI in extending relational databases with temporal features. More specifically, MSET/SPRT and ARP are powerful and mature time series technologies developed at Oracle that can support several pre-processing operations. The EPIs have many years of experience in developing algorithmic solutions for temporal databases. They were the first to integrate comprehensive query support for temporal data into the kernel of a relational database system. The EPIs conducted also research on time series data, such as the imputation of missing values and similarity search.