I have an prefix-to-rows mapping that looks something like this:
{
"x": [9],
"far": [1,2,3,4,5],
"car": [1,4,5]
}
The key is the indexed search term and the array is a sorted list of the rows that have a match. Simple enough, a basic inverted index. And for this question, let's suppose a-z0-9
characters with a maximum length of three
characters (upper bound or 36+(36^2)+(36^3)=47,988 combinations, though probably much less in practice, let's say around 10k combinations).
However, the tricky part is that I may have ~ 10M rows, and low-cardinality items could have a (meaningless) list of all 10M rows. In my calculations an 10M-row array itself comes out to 88.9MB uncompressed.
What would be a suggested way to compress these often-repeated arrays? It seems that this must be a very common occurrence in search, and I'd like to learn a bit more about the best method of handling large and repeating prefix maps, such as with the above.
I would recommend using something like Roaring Bitmaps (described in this paper). There are implementations in Python, Java, and C, and they automatically switch between 3 different formats for optimal storage density. If you want to implement something similar yourself, it essentially combines:
The concept works for 32-bit unsigned integers, which comfortably could contain information on 10M rows without any additional work necessary for customizing your own solution.