SIGIR 2017 Best Paper

1.Introduction

a. Since the mid-90s there has been a widely-held belief that signature files are inferior to inverted files for text indexing.
Justin Zobel, Alistair Moffat, and Kotagiri Ramamohanarao. 1998. Inverted files versus signature files for text indexing
b. Signature-based approach(BitFunnel) faces some challenges:

  • Excessive Memory Reads
  • Wide Range of Term Frequencies
  • Wide Range of Document Lengths
  • False Positives
  • Hard to Configure

2.Related Work

a. Inverted Indexes
A mapping from each term in the lexicon to the the set of documents containing the term
b. Bit-String Signatures
The signature is essentially the sequence of bits that make up a Bloom filter representing the set of terms in the document.
--1-1
c. Bit-Sliced Signatures
This approach transposes signature vectors from rows to columns to allow multiple documents to be searched simultaneously
--2
d. Bit-Sliced Blocked Signatures
Create shorter rows by assigning multiple documents to each column in the signature matrix.

3.Proposed Methods

A. Higher Row Rank
Each term simultaneously hashes to multiple bit-sliced signatures with different
blocking factors.
A row of rank r ≥ 0 has a blocking factor of 2r
--6
B. Frequency Conscious Bloom Filter
--3
--4
C. Sharding Across Cluster
Sharding by Document Length
--5
D. Handing False Positive
Reject those documents that score low while never rejecting documents that score high.
--7

Performance Model And Optimization

----_20171229185640

4.Experiment Results

----_20171229185823
----_20171229185853

5.Conclusion

In order to address speed and space problems associated with bit-string and bit-sliced signatures. The article introduce higher rank rows to reduce query execution time, employ frequency-conscious signatures to reduce the memory footprint, and shard the corpus to reduce the variability in document size. Compared to inverted lists,BitFunnel improved server query capacity by a factor of 10.