I have a list of very large objects objects
, that I want to compress and save to the hard drive.
My current approach is
import brotli
import dill
# serialize list of objects
objects_serialized = dill.dumps(objects, pickle.HIGHEST_PROTOCOL)
# compress serialized string
objects_serialized_compressed = brotli.compress(data=objects_serialized, quality=1)
# write compressed string to file
output.write(objects_serialized_compressed)
However, if objects
is very large, this leads to a memory error, since -- for some time -- I simultaneously carry objects
, objects_serialized
, objects_serialized_compressed
around in their entirety.
Is there a way to do this chunk-wise? Presumably the first step -- serializing the objects -- has to done in one go, but perhaps the compression and writing to file can be done chunk-wise?
I'd try this, after many attemps:
import brotli
import dill
import io
import pickle
# The following serialized object is 30kb
objects = ["234r234r234", "3f234f2343f3", "234ff234f234f234rf32"]*5000
objects_serialized = dill.dumps(objects, pickle.HIGHEST_PROTOCOL)
# Set up a buffer for reading chunks of serialized data
chunk_size = 1024 * 1024
buffer = io.BytesIO(objects_serialized)
# Create compressor for repeated use
compressor = brotli.Compressor(quality=1)
with open('output.brotli', 'wb') as output:
# Read chunks from the buffer and compress them
while True:
chunk = buffer.read(chunk_size)
if not chunk:
break
compressed_chunk = compressor.process(chunk)
output.write(compressed_chunk)
# Flush the remaining compressed data
compressed_remainder = compressor.finish()
# 4kb in my computer
# I decompressed, de-serialized, and retrieved the original object
output.write(compressed_remainder)
This requires brotli 1.0.9
, as provided by pip
-- it does not work with brotlipy
, as provided by anaconda.