### inverted index compression

Most techniques for inverted index compression ﬁrst replace each docID (except the ﬁrst in a list) by the diﬀerence between it and the preceding docID, called d-gap, and then encode the d-gap using some integer compression algorithm. Using d-gaps instead of docIDs decreases the average value that needs to be compressed, resulting in a higher compression ratio. Of course, these values have to be summed up again during decompression, but this can usually be done very eﬃciently. Thus, inverted index compression techniques are concerned with compressing sequences of integers whose average value is small inverted index compression techniques are concerned with compressing sequences of integers whose average value is small. The resulting compression ratio depends on the exact properties of these sequences, which depend on the way in which docIDs are assigned to documents. The basic idea here is that if we assign docIDs such that many similar documents (i.e., documents that share a lot of terms) are close to each other in the docID assignment, then the resulting sequence of d-gaps will become more skewed, with large clusters of many small values interrupted by a few larger values, resulting in better compression. In contrast, if docIDs are assigned at random, the distribution of gaps will be basically exponential, and small values will not be clustered together.