I'm using the crc function in zlib to calculate the crc of a large file. I'm mapping the file in 1GB chunks to pass to the crc function, and I noticed that if I change the size of the chunks to 5GB or larger, the crc value returned is no longer the same as the crc value calculated with smaller chunks (1GB - 4GB all return the same crc). Also, for 5GB chunk sizes and larger, the crc calculation is much faster: 3 seconds on the 14GB file vs 22 seconds when using 1-4GB chunks.
Does anyone know why this happens? The manual https://zlib.net/manual.html doesn't seem to mention this type of behaviour for the crc functions.
uLong crc = crc32(0L, Z_NULL, 0);
size_t offset = 0;
size_t page_size = sysconf(_SC_PAGE_SIZE);
const size_t num_maps = 1000000000 / page_size;
const size_t MAX_MAP_SIZE = num_maps * page_size; // map 1GB to mmap at a time
size_t file_size= fs::file_size(file);
size_t map_size = 0;
while (offset < file_size)
{
if (file_size - offset > MAX_MAP_SIZE)
map_size = MAX_MAP_SIZE;
else
map_size = file_size - offset;
void* buf = mmap(NULL, map_size, PROT_READ, MAP_PRIVATE, fd, offset);
crc = crc32(crc, (unsigned char *)buf, map_size);
munmap(buf, map_size);
offset += map_size;
}
This is apparent if you simply look at the types. zlib's crc32()
function takes an unsigned
length. That can only support up to 4GiB-1 for the usual 32-bit unsigned
.
You can instead use zlib's crc32_z()
, which takes a size_t
length.