calgorithmsearchcompressioninverted-index

Compressing a list of integers for search access


I have an prefix-to-rows mapping that looks something like this:

{
    "x":   [9],
    "far": [1,2,3,4,5],
    "car": [1,4,5]
}

The key is the indexed search term and the array is a sorted list of the rows that have a match. Simple enough, a basic inverted index. And for this question, let's suppose a-z0-9 characters with a maximum length of three characters (upper bound or 36+(36^2)+(36^3)=47,988 combinations, though probably much less in practice, let's say around 10k combinations).

However, the tricky part is that I may have ~ 10M rows, and low-cardinality items could have a (meaningless) list of all 10M rows. In my calculations an 10M-row array itself comes out to 88.9MB uncompressed.

What would be a suggested way to compress these often-repeated arrays? It seems that this must be a very common occurrence in search, and I'd like to learn a bit more about the best method of handling large and repeating prefix maps, such as with the above.


Solution

  • I would recommend using something like Roaring Bitmaps (described in this paper). There are implementations in Python, Java, and C, and they automatically switch between 3 different formats for optimal storage density. If you want to implement something similar yourself, it essentially combines:

    1. Arrays (the default)
    2. Bitsets (good for more dense collections)
    3. Run-length encoding (store the start of every "run" of continuous numbers and the length of that run)

    The concept works for 32-bit unsigned integers, which comfortably could contain information on 10M rows without any additional work necessary for customizing your own solution.