pythongzipseek

Do failures seeking backwards in a gzip.GzipFile mean it's broken?


I have files with a small header (8 bytes, say zrxxxxxx), followed by a gzipped stream of data. Reading such files works fine most of the time. However in very specific cases, seeking backwards fails. This is a simple way to reproduce:

from gzip import GzipFile

f = open('test.bin', 'rb')
f.read(8)  # Read zrxxxxxx

h = GzipFile(fileobj=f, mode='rb')
h.seek(8192)
h.seek(8191)  # gzip.BadGzipFile: Not a gzipped file (b'zr')

Unfortunately I cannot share my file, but it looks like any similar file will do.

Debugging the situation, I noticed that DecompressReader.seek (in Lib/_compression.py) sometimes rewinds the original file, which I suspect causes the issue:

#...
# Rewind the file to the beginning of the data stream.
def _rewind(self):
    self._fp.seek(0)
    #...

def seek(self, offset, whence=io.SEEK_SET):
    #...
    # Make it so that offset is the number of bytes to skip forward.
    if offset < self._pos:
        self._rewind()
    else:
        offset -= self._pos
    #...

Is this a bug? Or is it me doing it wrong?

Any simple workaround?


Solution

  • Looks like a bug in Python. When you ask it to seek backwards, it has to go all the way back to the start of the gzip stream and start over. However the library did not take note of the offset of the file object it was given, so instead of rewinding to the start of the gzip stream, it is rewinding to the start of the file.

    As for a workaround, you would need to give GzipFile a custom file object with a replaced seek() operation, such that seek(0) goes to the right place. This seemed to work:

    from gzip import GzipFile
    f = open('test.bin', 'rb')
    f.read(8)  # Read zrxxxxxx
    class shift():
        def __init__(self, f):
            self.f = f
            self.to = f.tell()
        def seek(self, offset):
            return self.f.seek(self.to + offset)
        def read(self, size=-1):
            return self.f.read(size)
    s = shift(f)
    h = GzipFile(fileobj=s, mode='rb')
    h.seek(8192)
    h.seek(8191)
    

    (I don't really know Python, so I'm sure there's a better way. I tried to subclass file so that I would only need to intercept seek(), but somehow file is not actually a class.)