pythondillbrotli

Serializing, compressing and writing large object to file in one go takes too much memory


I have a list of very large objects objects, that I want to compress and save to the hard drive.

My current approach is

import brotli
import dill
# serialize list of objects
objects_serialized = dill.dumps(objects, pickle.HIGHEST_PROTOCOL)
# compress serialized string
objects_serialized_compressed = brotli.compress(data=objects_serialized, quality=1)
# write compressed string to file
output.write(objects_serialized_compressed)

However, if objects is very large, this leads to a memory error, since -- for some time -- I simultaneously carry objects, objects_serialized, objects_serialized_compressed around in their entirety.

Is there a way to do this chunk-wise? Presumably the first step -- serializing the objects -- has to done in one go, but perhaps the compression and writing to file can be done chunk-wise?


Solution

  • I'd try this, after many attemps:

    import brotli
    import dill
    import io
    import pickle
    
    # The following serialized object is 30kb
    objects = ["234r234r234", "3f234f2343f3", "234ff234f234f234rf32"]*5000
    objects_serialized = dill.dumps(objects, pickle.HIGHEST_PROTOCOL)
    
    # Set up a buffer for reading chunks of serialized data
    chunk_size = 1024 * 1024
    buffer = io.BytesIO(objects_serialized)
    
    # Create compressor for repeated use
    compressor = brotli.Compressor(quality=1)
    with open('output.brotli', 'wb') as output:
        # Read chunks from the buffer and compress them
        while True:
            chunk = buffer.read(chunk_size)
            if not chunk:
                break
            compressed_chunk = compressor.process(chunk)
            output.write(compressed_chunk)
    
        # Flush the remaining compressed data
        compressed_remainder = compressor.finish()
        # 4kb in my computer
        # I decompressed, de-serialized, and retrieved the original object
        output.write(compressed_remainder)
    

    This requires brotli 1.0.9, as provided by pip -- it does not work with brotlipy, as provided by anaconda.