pythonc#gzip

Compressing with GZIP in c# and decompressing in python fails


I have a flow where some data (like an image/video) gets compressed using GZip like this:

await using var outputStream = new MemoryStream();
await using var compressionStream = new GZipStream(outputStream, CompressionMode.Compress);

await compressionStream.WriteAsync(payload);
await compressionStream.FlushAsync();

outputStream.Position = 0;
return outputStream.ToArray()

The above code is not from my team but it can be changed if needed.

If I get the output into a base64 string and test decompressing it with this simple code, it works perfectly:

var bytes = Convert.FromBase64String("H4sIAAAAAAAACirOz01VKEmtKAEAAAD//w=="); // "some text"
using var ms = new MemoryStream(bytes);
using var ds = new GZipStream(ms, CompressionMode.Decompress);
using var output = new MemoryStream();
ds.CopyTo(output);
ds.Flush();

var result = output.ToArray();

However, my requirement is to get the compressed payload in a python script and decompress it before passing it to another system. I'm not at all familiar with python, so I made this very simple script:

import base64
import gzip

encodedBase64 = "H4sIAAAAAAAACirOz01VKEmtKAEAAAD//w=="
decodedBytes = base64.standard_b64decode(encodedBase64)
decompressedBytes = gzip.decompress(decodedBytes)

The above fails with: EOFError: Compressed file ended before the end-of-stream marker was reached

I have of course done research and found posts like this Q&A but nothing has helped (for example, using that answer fails with gzip.BadGzipFile: Not a gzipped file (b'\x00\x00'). Other attempts have yielded different gzip errors.


Solution

  • The GzipStream needs to be disposed before you read its output. Gzip has a footer which needs to be added to the end of the stream, and this is added by Dispose() (and also by Close()) but not by Flush(). I guess this makes sense, as Flush() may be called multiple times during the encoding process, so it would be the wrong place to add a footer.

    I rewrote your code to dispose objects at the appropriate points, and also got rid of the async since you're dealing with purely synchronous operations:

    public static string Encode()
    {
        var payload = Encoding.ASCII.GetBytes("some text");
        using (var outputStream = new MemoryStream())
        {
            using (var compressionStream = new GZipStream(outputStream, CompressionMode.Compress))
            {
                compressionStream.Write(payload);
            }
            var result = outputStream.ToArray();
            return Convert.ToBase64String(result);
        }   
    }
    

    This produces the output

    H4sIAAAAAAAAAyvOz01VKEmtKAEAur26TwkAAAA=
    

    Which is slightly longer than the output you saw, suggesting that it contains the footer that Python is expecting. It still decodes to the same result, at least in .NET, however.

    It's interesting that .NET's GzipStream is robust to the footer missing, but Python's version is not.