A Two-List Framework for Accurate Detection of Frequent Items in Large Data Sets

A Two-List Framework for Accurate Detection of Frequent Items in Large Data Sets

David Vengerov

25 March 2017

The problem of detecting the most frequent items in large data sets and providing accurate frequency estimates for those items is becoming more and more important in a variety of domains. We propose a new two-list framework for addressing this problem, which extends the state-of-the-art Filtered Space-Saving (FSS) algorithm. Two algorithms implementing this framework are presented: FSSAL and FSSA. An adaptive version of these algorithms is presented, which adjusts the relative sizes of the two lists based on the estimated number of distinct keys in the data set. Analytical comparison with the FSS algorithm showed that FSS2L algorithms have smaller expected frequency estimation errors, and experiments on both artificial and real workloads confirm this result. A theoretical analysis of space and time complexity for all considered algorithms was performed. Finally, we showed that FSS2L algorithms can be naturally parallelized, leading to a linear decrease in the maximum frequency estimation error.


Venue : ICDM 2017: http://icdm2017.bigke.org/