pythonpython-3.xgzipzlibtransfer-encoding

Content-Encoding: gzip + Transfer-Encoding: chunked with gzip/zlib gives incorrect header check


How do you manage chunked data with gzip encoding? I have a server which sends data in the following manner:

HTTP/1.1 200 OK\r\n
...
Transfer-Encoding: chunked\r\n
Content-Encoding: gzip\r\n
\r\n
1f50\r\n\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xec}\xebr\xdb\xb8\xd2\xe0\xef\xb8\xea\xbc\x03\xa2\xcc\x17\xd9\xc7\xba\xfa\x1e\xc9r*\x93\xcbL\xf6\xcc\x9c\xcc7\xf1\x9c\xf9\xb6r\xb2.H ... L\x9aFs\xe7d\xe3\xff\x01\x00\x00\xff\xff\x03\x00H\x9c\xf6\xe93\x00\x01\x00\r\n0\r\n\r\n

I've had a few different approaches to this but there's something i'm forgetting here.

data = b''
depleted = False
while not depleted:
    depleted = True
    for fd, event in poller.poll(2.0):
        depleted = False
        if event == select.EPOLLIN:
            tmp = sock.recv(8192)
            data += zlib.decompress(tmp, 15 + 32)

Gives (also tried decoding only data after \r\n\r\n obv):
zlib.error: Error -3 while decompressing data: incorrect header check

So I figured the data should be decompressed once the data has been recieved in it's whole format..

        ...
        if event == select.EPOLLIN:
            data += sock.recv(8192)
data = zlib.decompress(data.split(b'\r\n\r\n',1)[1], 15 + 32)

Same error. Also tried decompressing data[:-7] because of the chunk ID at the very end of the data and with data[2:-7] and other various combinations, but with the same error.

I've also tried the gzip module via:

with gzip.GzipFile(fileobj=Bytes(data), 'rb') as fh:
    fh.read()

But that gives me "Not a gzipped file".

Even after recording down the data as recieved by the servers (headers + data) down into a file, and then creating a server-socket on port 80 serving the data (again, as is) to the browser it renders perfectly so the data is intact. I took this data, stripped off the headers (and nothing else) and tried gzip on the file: enter image description here

Thanks to @mark-adler I produced the following code to un-chunk the chunked data:

unchunked = b''
pos = 0
while pos <= len(data):
    chunkLen = int(binascii.hexlify(data[pos:pos+2]), 16)
    unchunked += data[pos+2:pos+2+chunkLen]
    pos += 2+len('\r\n')+chunkLen

with gzip.GzipFile(fileobj=BytesIO(data[:-7])) as fh:
    data = fh.read()

This produces OSError: CRC check failed 0x70a18ee9 != 0x5666e236 which is one step closer. In short I clip the data according to these four parts:

I'm probably getting there, but not close enough.

Footnote: Yes, the socket is far from optimal, but it looks this way because i thought i didn't get all the data from the socket so i implemented a huge timeout and a attempt at a fail-safe with depleted :)


Solution

  • You can't split on \r\n since the compressed data may contain, and if long enough, certainly will contain that sequence. You need to dechunk first using the length provided (e.g. the first length 1f50) and feed the resulting chunks to decompress. The compressed data starts with the \x1f\x8b.

    The chunking is hex number, crlf, chunk with that many bytes, crlf, hex number, crlf, chunk, crlf, ..., last chunk (of zero length), [possibly some headers], crlf.