Bloom filters have recently received a significant amount of interest due to their many applications in the large data and analytics spaces. The original use of the Bloom filter (by Bloom and in the years that followed) was in the form of a data structure that could answer queries of the form “Does x exist in the given set S?” Bloom filters can answer such set membership queries very efficiently with a low false positive rate, and with no false negatives.
This paper tries to present probabilistic analysis of usage of Bloom filters for set intersection questions (queries of the form “Do the sets S1 and S2 overlap?”) instead of set membership questions. This paradigm shift in the usage of Bloom filters is significant, and has happened even though the probabilistic models and the analysis were presented mainly for set membership queries.
The authors consider both the partitioned Bloom filters (in which each hash function maps to a separate partition within the range) and unpartitioned Bloom filters (in which all hash functions map to the entire range). They present some very interesting results. For example, they show that, compared to unpartitioned Bloom filters, partitioned Bloom filters have a lower probability of false positives when one considers set intersection questions.
In the introductory section, the authors also present a very thorough overview of the state of the art in the field of parallel processing and concurrency. Though this is interesting, it is not central to the study of Bloom filters. Parallel processing is a broad field, and not all challenges relevant to parallel processing apply to Bloom filters.
The empirical results section presents the probability values by Bloom filter length, but stops short of actually recommending the number of hash functions and the filter length depending on the sizes of the sets. This could be considered a slightly negative aspect of this otherwise sound paper.
Written in a style that is easy to understand, and using basic concepts in probability, the paper does a good job of deriving the results, and the presented theorems are short and crisp.