gobitset

Is Serialized Roaring Bitmaps compressed?


According to the definition, roaring bitmap is a "better compressed bitset". Which means it is compressed by nature. But, according to my test, the size of Dgraph's Serialized Roaring Bitmap is not compressed?


func main() {
    //github.com/dgraph-io/sroar
    sr := sroar.NewBitmap()
    for {
        if sr.GetCardinality() > 999 {
            break
        }
        sr.Set(uint64(rand.Int()))
    }
    buf := sr.ToBuffer()
    fmt.Printf("sroar size: %d\n", len(buf))
    
    //github.com/RoaringBitmap/roaring
    r := roaring.New()
    for {
        if r.GetCardinality() > 999 {
            break
        }
        r.AddInt(rand.Int())
    }
    var bs bytes.Buffer
    r.WriteTo(&bs)
    fmt.Printf("roar size: %d\n", bs.Len())
}

output:

sroar size: 152704
roar size: 9936

i.e. the "original" roaring bitmap requires 9936 bytes for one thousand uint32 (that library will cast int to uint32) which is 4000 bytes of "raw data", while the Dgraph SRB will require 152704 bytes for one thousand of uint64 which is 8000 bytes of "raw data".

Is it true that a Roaring Bitmap is by nature compressed, and can be used as-is (without decompress etc)?


Solution

  • Roaring is itself compressed insofar as it is close to the lowest entropy possible for some use cases. If you add values close to random there won't be much compressibility.

    There are three main roaring data structures which underlie the primary datastructure users see.

    Runs, Arrays and Bitmaps.

    Runs represent contiguous numbers such as [1,1000]. These are represented using run-length encoding. This is compressed. Namely instead of representing this as 1,2,3..1000. It is represented as [1,1000]. You could have a very large number in this range but it will only take two ints to represent it.

    Arrays are sorted and for most implementations the max size of the array is 4096.There is probably some compressibility in this depending on the distribution in the array. If it's random, then not much, if it has some structure then some more compression is gained.

    For arrays larger than 4096 these will be represented by packed bitset. In these all the numbers within a 6 bit range will be packed into a 64 bit int. Likely compressible given the correct distribution. As an extreme example if you fill alternating skip pattern wise 1,0,0,n times,0,1... This would likely be compressible as the compression could apply run-length compression to this.

    On top of these is a key structure. Every 32 bit int is split on the first 16 bits. The first 16 bits hashes into one of the above data structures, then remaining 16 is place into one of the data structure.. This forms some size savings.

    Source: I contribute to Roaring