compressiontheoryfilesizefloppy

Calculating size of a theoretical text file


I'm writing an article about the Census Bureau's population projections through 2060, which consists of a 3.3 MB .csv file when uncompressed.

The file consists of 539,781 values, each of which is 5-7 digits, and takes up 3,455,372 characters. When I gzip the file it comes down to 1550063 bytes, or 1.47 MB.

I want to be able to truthfully state that it would fit on a 3.5-inch floppy, max capacity 1.44 MB. This is just a reference point, not advice to a user that requires instructions on how to do so.

Is there a way to calculate the theoretical size of a text file based on the character count above? If we actually had a 3.5-inch floppy and a drive for it, would it be possible to get this file on the disk without information loss? Thanks!


Solution

  • No, it is not possible to estimate the size of a compressed version of a file based purely on its character count. Different strings can be compressed at different levels of efficiency; a string made purely of one character will be much more easily compressed than a string of purely randomly generated characters.

    In information theory, there is a concept of Kolmogorov complexity, which is (more or less) the smallest amount of information necessary to reconstruct a string. Not all strings an be compressed into smaller strings, and it is impossible to build a general algorithm to find the Kolmogorov complexity of an arbitrary string. Moreover, it's impossible to prove that you have found the optimal encoding for a string once the string ets sufficiently long.

    Hope this helps!