Conference Publication

A Two-List Framework for Accurate Detection of Frequent Items in Data Streams
July 2018

The problem of detecting the most frequent items in large data sets and providing accurate frequency estimates for those items is becoming more and more important in a variety of domains. We propose a new two-list framework for addressing this problem, which extends the state-of-the-art Filtered Space-Saving (FSS) algorithm. An algorithm called FSSA giving an efficient array-based implementation of this framework is presented. An adaptive version of this algorithm is also presented, which adjusts the relative sizes of the two lists based on the estimated number of distinct keys in the data set. Analytical comparison with the FSS algorithm showed that FSSA has smaller expected frequency esti-mation errors, and experiments on both artificial and real workloads confirm this result. A theoretical analysis of space and time complexity for FSSA and its benchmark algorithms was performed. Finally, we showed that FSS2L frame-work can be naturally parallelized, leading to a linear decrease in the maximum frequency estimation error.

Authors: David Vengerov

Venue: 14th International Conference on Machine Learning and Data Mining (MLDM 2018)


Hardware and Software, Engineered to Work Together