inverted index compression

Published on 11/13,2016

Most techniques for inverted index compression  first replace each docID (except the first in a list) by the difference between it and the preceding docID, called d-gap, and then encode the d-gap using some integer compression algorithm. Using d-gaps instead of docIDs decreases the average value that needs to be compressed, resulting in a higher compression ratio. Of course, these values have to be summed up again during decompression, but this can usually be done very efficiently. Thus, inverted index compression techniques are concerned with compressing sequences of integers whose average value is small inverted index compression techniques are concerned with compressing sequences of integers whose average value is small. The resulting compression ratio depends on the exact properties of these sequences, which depend on the way in which docIDs are assigned to documents. The basic idea here is that if we assign docIDs such that many similar documents (i.e., documents that share a lot of terms) are close to each other in the docID assignment, then the resulting sequence of d-gaps will become more skewed, with large clusters of many small values interrupted by a few larger values, resulting in better compression. In contrast, if docIDs are assigned at random, the distribution of gaps will be basically exponential, and small values will not be clustered together. 


Comments

Leave a Reply

Add comment
Info

unmoderate_note

Comments are moderated to prevent spam. This may cause a delay before your post appears.

 authimage