javascriptpythoncompressiondeflatelossless-compression

Using a preset deflate dictionary to reduce compressed archive file size


I have a requirement where text files are send from one location to other. Both location are in our control. The nature of content and the words that could appear in this are mostly the same. Which means, if I keep the delate dictionary in both location once, there is no need to send it with file.

I have been reading about this last 1 week and experimenting with some available codes such as this & this.

However, I am still in dark.

Few questions I still have:

  1. Can we generate and use custom deflate dictionary from a preset of words?
  2. Can we send file without the deflate dictionary and use local one?
  3. If not gzip, are there any such compression library that can be used for this purpose?

Some references I stumbled upon so far:

  1. https://medium.com/iecse-hashtag/huffman-coding-compression-basics-in-python-6653cdb4c476
  2. https://blog.cloudflare.com/improving-compression-with-preset-deflate-dictionary/
  3. https://www.euccas.me/zlib/#zlib_optimize_cloudflare_dict

Solution

  • Below are the specific answers I found along with example codes.

    1. Can we generate and use custom deflate dictionary from a preset of words?

    Yes, this can be done. A quick example in python will as below:

    import zlib
    
    #Data for compression
    hello = b'hello'    
    
    #Compress with dictionary
    co = zlib.compressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
    compress_data = co.compress(hello) + co.flush()
    

    2. Can we send a file without the deflate dictionary and use local one?

    Yes, you can send just the data without dictionary. The compressed data is in compress_data in above example code. However, to decompress you will need the zdict value passed during compression. Example of how it is decompressed:

    hello = b'hello'  #for passing to zdict  
    do = zlib.decompressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
    data = do.decompress(compress_data)
    

    A full example code with and without dict data:

    import zlib
    
    #Data for compression
    hello = b'hello'
    
    #Compression with dictionary
    co = zlib.compressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
    compress_data = co.compress(hello) + co.flush()
    
    #Compression without dictionary
    co_nodict = zlib.compressobj(wbits=-zlib.MAX_WBITS, )
    compress_data_nodict = co_nodict.compress(hello) + co_nodict.flush()
    
    #De-compression with dictionary
    do = zlib.decompressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
    data = do.decompress(compress_data)
    
    #print compressed output when dict used
    print(compress_data)
    
    #print compressed output when dict not used
    print(compress_data_nodict)
    
    #print decompressed output when dict used
    print(data)
    

    Above code doesn't works with unicode data. For unicode data you have to do something as below:

    import zlib
    
    #Data for compression
    unicode_data = 'റെക്കോർഡ്'
    hello = unicode_data.encode('utf-16be')
    
    #Compression with dictionary
    co = zlib.compressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
    compress_data = co.compress(hello) + co.flush()
    ...
    

    JS based approach references:

    1. How to find a good/optimal dictionary for zlib 'setDictionary' when processing a given set of data?
    2. Compression of data with dictionary using zlib in node.js