pythonhdfraggedrhdf5

Space efficient data store for list of list of lists. Elements are integers, and size of all lists varies in length


Say my data looks like this

thisList = [
     [[13, 43, 21, 4], [33, 2, 111, 33332, 23, 43, 2, 2], [232, 2], [23, 11]] ,
     [[21, 2233, 2], [2, 3, 2,1, 32, 22], [3]], 
     [[3]], 
     [[23, 12], [55, 3]],
     ....
]

What is the most space-efficient way to store this time of data?

I looked at Numpy files, but numpy only supports uniform length data

I looked at Hdf5, which has support for 1d ragged tensors, but not 2d

https://stackoverflow.com/a/42659049/3259896

So there's an option of creating a separate hdf5 file for every list in thisList, but I would have potentially 10-20 million those lists.


Solution

  • I ran benchmarks saving a ragged nested list with JSON, BSON, Numpy, and HDF5.

    TLDR: use compressed JSON, because it is the most space efficient and easiest to encode/decode.

    On the synthetic data, here are the results (with du -sh test*):

    4.5M    test.json.gz
    7.5M    test.bson.gz
    8.5M    test.npz
    261M    test_notcompressed.h5
    1.3G    test_compressed.h5
    

    Compressed JSON is the most efficient in terms of storage, and it is also the easiest to encode and decode because the ragged list does not have to be converted to a mapping. BSON comes in second, but it has to be converted to a mapping, which complicates encoding and decoding (and negating the encoding/decoding speed benefits of BSON over JSON). Numpy's compressed NPZ format is third best, but like BSON, the ragged list must be made into a dictionary before saving. HDF5 is surprisingly large, especially compressed. This is probably because there are many different datasets, and compression adds overhead to each dataset.


    Benchmarks

    Here is the relevant code for the benchmarking. The bson package is part of pymongo. I ran these benchmarks on a Debian Buster machine with an ext4 filesystem.

    def get_ragged_list(length=100000):
        """Return ragged nested list."""
        import random
    
        random.seed(42)
        l = []
        for _ in range(length):
            n_sublists = random.randint(1, 9)
            sublist = []
            for i in range(n_sublists):
                subsublist = [random.randint(0, 1000) for _ in range(random.randint(1, 9))]
                sublist.append(subsublist)
            l.append(sublist)
        return l
    
    def save_json_gz(obj, filepath):
        import gzip
        import json
    
        json_str = json.dumps(obj)
        json_bytes = json_str.encode()
        with gzip.GzipFile(filepath, mode="w") as f:
            f.write(json_bytes)
    
    def save_bson(obj, filepath):
        import gzip
        import bson
    
        d = {}
        for ii, n in enumerate(obj):
            for jj, nn in enumerate(n):
                key = f"{ii}/{jj}"
                d[key] = nn
        b = bson.BSON.encode(d)
        with gzip.GzipFile(filepath, mode="w") as f:
            f.write(b)
    
    def save_numpy(obj, filepath):
        import numpy as np
    
        d = {}
        for ii, n in enumerate(obj):
            for jj, nn in enumerate(n):
                key = f"{ii}/{jj}"
                d[key] = nn
        np.savez_compressed(filepath, d)
    
    def save_hdf5(obj, filepath, compression="lzf"):
        import h5py
    
        with h5py.File(filepath, mode="w") as f:
            for ii, n in enumerate(obj):
                for jj, nn in enumerate(n):
                    name = f"{ii}/{jj}"
                    f.create_dataset(name, data=nn, compression=compression)
    
    ragged = get_ragged_list()
    
    save_json_gz(ragged, "ragged.json.gz")
    save_bson(ragged, "test.bson.gz")
    save_numpy(ragged, "ragged.npz")
    save_hdf5(ragged, "test_notcompressed.h5", compression=None)
    save_hdf5(ragged, "test_compressed.h5", compression="lzf")
    

    Versions of relevant packages:

    python 3.8.2 | packaged by conda-forge | (default, Mar 23 2020, 18:16:37) [GCC 7.3.0]
    pymongo bson 3.10.1
    numpy 1.18.2
    h5py 2.10.0