c++zlib

Zlib crc function is inconsistent when reading files in larger blocks


I'm using the crc function in zlib to calculate the crc of a large file. I'm mapping the file in 1GB chunks to pass to the crc function, and I noticed that if I change the size of the chunks to 5GB or larger, the crc value returned is no longer the same as the crc value calculated with smaller chunks (1GB - 4GB all return the same crc). Also, for 5GB chunk sizes and larger, the crc calculation is much faster: 3 seconds on the 14GB file vs 22 seconds when using 1-4GB chunks.

Does anyone know why this happens? The manual https://zlib.net/manual.html doesn't seem to mention this type of behaviour for the crc functions.

uLong crc = crc32(0L, Z_NULL, 0);
size_t offset = 0;
size_t page_size = sysconf(_SC_PAGE_SIZE);
const size_t num_maps = 1000000000 / page_size;
const size_t MAX_MAP_SIZE = num_maps * page_size;  // map 1GB to mmap at a time

size_t file_size= fs::file_size(file);
size_t map_size = 0;

while (offset < file_size)
{
    if (file_size - offset > MAX_MAP_SIZE)
        map_size = MAX_MAP_SIZE;
    else
        map_size = file_size - offset;

    void* buf = mmap(NULL, map_size, PROT_READ, MAP_PRIVATE, fd, offset);
    crc = crc32(crc, (unsigned char *)buf, map_size);
    munmap(buf, map_size);
   
    offset += map_size;
}

Solution

  • This is apparent if you simply look at the types. zlib's crc32() function takes an unsigned length. That can only support up to 4GiB-1 for the usual 32-bit unsigned.

    You can instead use zlib's crc32_z(), which takes a size_t length.