cgzipzlibinflate

zlib: uncompressing large file leads to "invalid code lengths set " error


I am trying to use zlib to detect the end of a compressed gz data stream.

I do not need the uncompressed contents. My goal to get pointers to the beginning and end of a stream. My code works on small files, but fails on large files. I tried allocating more memory for outbuf with no success. It's mainly copy and pasted from zlib examples. What is wrong?

// fopen(filename,"rb")...fread(inbuf, inlen, 1, fd);

int gzip_dctest(unsigned char *inbuf, unsigned int inlen) {

    unsigned int outlen = 262144;
    unsigned char *outbuf = malloc(outlen);
    int ret = 0;

    z_stream infstream;

    /* allocate inflate state */
    infstream.zalloc = Z_NULL;
    infstream.zfree = Z_NULL;
    infstream.opaque = Z_NULL;
    infstream.avail_in = 0;
    infstream.next_in = Z_NULL;

    ret = inflateInit2(&infstream, MAX_WBITS | 32); // gzip/zlib header autodetect
    if (ret != Z_OK) {
        fprintf(stderr, "gzip_test: init fail (%d: %s)\n", ret, infstream.msg);
        return ret;
    }
    
    unsigned int ptr = 0;
    /* decompress until deflate stream ends or end of file */
    do {
        infstream.next_in = inbuf + ptr;
        infstream.avail_in = inlen - ptr;

        if (infstream.avail_in > (outlen / 8)) infstream.avail_in = (outlen / 8); // get chunk size
        if (infstream.avail_in == 0)
            break;

        /* run inflate() on input until output buffer not full */
        do {
            ptr += infstream.avail_in;

            infstream.avail_out = outlen;
            infstream.next_out = outbuf;

            ret = inflate(&infstream, Z_NO_FLUSH);
            if (ret < 0) {
                    fprintf(stderr, "gzip_test: inflate fail at %u (%d: %s)\n", ptr - infstream.avail_in, ret, infstream.msg);
                    return ret;
            }

        } while (infstream.avail_out == 0);
    
        /* done when inflate() says it's done */
    } while (ret != Z_STREAM_END);

    inflateEnd(&infstream);
    return ptr - infstream.avail_in;
}

Example output with a problem file (403709952 uncompressed, 99152355 compressed size):

gzip_test: inflate fail at 2374700 (-3: invalid code lengths set)

"gzip -d" on this file gives no error:

1.cpio.gz:       75.4% -- replaced with 1.cpio

If I compress it again (gzip command), I get another error in my code:

gzip_test: inflate fail at 2381863 (-3: invalid block type)

I am expecting the code to work on any file size.


Solution

  • Your ptr += infstream.avail_in; needs to be moved outside of the inner do loop. Then it works fine.

    There is no need to throttle avail_in based on the amount of output space. Your inner do loop will just keep going until avail_in is consumed.

    For this to work on archives larger than 4GB, you'd need to use size_t instead of int for your inlen and ptr, and take care to set avail_in to a value in the range of an unsigned. I would recommend something much simpler with no ptr, like:

        infstream.next_in = inbuf;
        do {
            infstream.avail_in = inlen > UINT_MAX ? UINT_MAX : inlen;
            inlen -= infstream.avail_in;
            do {
            ...
        return infstream_next_in - inbuf;
    

    Note that next_in is updated by inflate() for you.

    Better still would be to not load the entire .gz file into memory, but rather to read a small buffer at a time and inflate as you go. Then keep track of the number of consumed bytes in a size_t or uintmax_t total.

    You also need to add a free(outbuf); before you return, both on error and on success.

    Note that this will not detect the end of a gzip stream. It will detect the end of a gzip member. A gzip stream can contain multiple members. You would need to loop on the whole thing until you got to the end or encountered an error, with the latter indicating some non-gzip data after the end of the gzip stream.