I have downloaded 1000 genome .vcf files from the 1000genomes website using:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502//*.gz
I tried to use gzip to unzip these files but they unzipped to a much larger size than the originals. For instance, the first file (for chromosome 1) was 1.1gb zipped, but expanded to 65.78gb.
Thinking it may be an issue with gzip, I tried two other methods. One was to run the annotation tool snpEff directly on the .gz file and the other was to use zcat to unzip the file. However in both instances the file sizes were similarly huge.
I am assuming this cannot be right, but do not know why this is the case. Has anyone experienced anything similar?
I checked out the chromosome 1 file and it's fine. I presume all the rest are as well. Yes, data that highly redundant can compress that much. It's only compressed 60:1, where gzip is capable of compressing as much as 1032:1.
The stream is broken up into individually gzipped pieces of 64K of uncompressed data each for the purpose of indexing. (The associated "tbi" files contain the locations of each piece in the big gzip file.) Had they just compressed it as a single stream, or with index points a good bit farther apart, it would have compressed about 68:1.