Distinct Value Estimation from a Sample: Statistical Methods vs. Machine Learning
Tomas Karnagel, Suratna Budalakoti, Onur Kocberber, Marc Jolles, Farhan Tauheed, Nipun Agarwal, Alan Wood
12 June 2022
Estimating the number of distinct values (NDV) in a dataset is an important operation in modern database systems for many tasks, including query optimization. In large scale systems, tables often contain billions of rows and wrong optimizer decisions can cause severe deterioration in query performance. Additionally in many situations, such as having large tables or NDV estimation after the application of filters, it is not feasible to scan the entire dataset to compute the number of distinct values. In such cases, the only available option is to use a dataset sample to estimate the NDV. This, however, is not trivial as data properties of the sample usually do not mirror the properties of the full dataset. Approaches in related work have shown that this kind of estimation is connected to large errors. In this paper, we present two novel approaches for the problem of estimating the number of distinct values from a dataset sample. Our first approach presents a novel statistical estimator that shows good and robust results across a broad range of datasets. The second approach is based on Machine Learning (ML), hence being the first time that ML is applied to this problem. Both approaches outperform the state-of-the-art, with the ML approach reducing the average error by 3x for real-world datasets. Beyond pure prediction quality, both our approaches have their own set of advantages and disadvantages, and we show that the right approach actually depends on the specific application scenario.
Venue : The ACM SIGMOD/PODS Conference (Industrial Track), 2022, at Philadelphia, PA, USA