pythongzipfastqcmpfile-comparison

Python 2.7 filecmp.cmp returns false even though the gzipped files are identical


I'm comparing a bunch of fastq.gz files. Each file is ~4G:

if filecmp.cmp(f1,f2,shallow=False)

It returns false, as in f1 and f2 are different. But when I compare the files using diff/comm I get 0 output (I unzip and then compare). I tried both shallow=True and False. I'm trying to print out the difference but it's running out of memory.

diff=difflib.ndiff((gzip.open(f1)).readlines(),(gzip.open(f2)).readlines())
print [i for i in diff if i.startswith('+')]

Is it because the files are gzipped? any ideas on how to compare them without unzipping them? (each file is 200M lines)

Thank you!


Solution

  • In general you would need to compare the uncompressed output. That is the only way to definitively determine if the two gzip files have the same uncompressed contents. They could have been compressed with different compression levels or different gzip software, giving different compressed results. The only guarantee is that when you compress and then decompress, you get the original input. There is no guarantee whatsoever that when you decompress and then compress that you get the original input.

    If you are in control of the gzip process, using the same code and the same compression levels and other options, you can still get different output due to the header contents. The headers may have different time stamps, different file names, or other variations. In that case you can skip the headers for each (using RFC 1952 as your guide to when the headers end), and the compare the remainder of each. Given the stated conditions, the remainders of the two files will then be identical.

    Another thing that you can do, again if you are in control of the compression and you know that each gzip file consists of a single gzip member, is that you can check the last eight bytes of each file. If those are not identical, then the compressed data is different. If they are the same, then the contents may be identical, so you would then need to decompress and compare, or use the method above. This can save a lot of time in almost never having to compare gzip files that have different uncompressed content. Those last eight bytes are the four-byte CRC of the uncompressed data, and the length of the uncompressed data modulo 232.