Provenance for Big Data


Provenance for Big Data

Principal Investigator

Boris Glavic

Illinois Institute of Technology

Oracle Fellowship Recipient

Seokki Lee, Xing Niu

Oracle Principal Investigator

Dieter Gawlick
Vasudha Krishnaswamy
Zhen Hua Liu


The sheer amount of data available in the Big Data age necessitates the application of techniques such as classification that extract meaningful information from this data to be consumed by humans. To understand the validity of extracted information and how it was derived from input data, a human analyst would need to be able to investigate the extraction process and explore which inputs lead to a particular result, i.e., provenance is the technology of choice to review and verify extracted information. In principal, existing provenance technology already supports this use case. However, the large amount of information that would be returned by provenance requests in the context of Big Data analytics would be practically useless to reviewers. The objective of the project is to understand how to make provenance useable for Big Data environments through the development of compact representations of provenance (e.g., based on aggregation, approximation, and relevancy) as well as techniques that enable interactive exploration and drill-down into provenance information in a pay-as-you-go fashion.

There are only few approaches that study summarization of provenance and most of these approaches are quite theoretical in nature.  Recently, we - a group of researchers at IIT including the EPI - have outlined how generalization may be implemented in a declarative query language. Nonetheless, efficient generalization and aggregation techniques do not exist so far and the current approaches lack generality and tight integration with provenance computation. One major challenge that has yet to be addressed is how to support such techniques without having access to all available input data, i.e., how to compute (approximate) provenance without access to all inputs. This is critical for Big Data applications where it is unrealistic to assume that we can store all inputs indefinitely.