c++algorithmcompressionhuffman-codelossless-compression

Compute the compression ratio for Huffman algorithm


I used this code HuffmanCodeto compress this text

AndwhenthishappenswhenweallowfreedomringwhenweletitringfromeveryvillageandeveryhamletfromeverystateandeverycitywewillbeabletospeedupthatdaywhenallofGod'schildrenblackmenandwhitemenJewsandGentilesProtestantsandCatholicswillbeabletojoinhandsandsinginthewordsoftheoldNegrospiritual

When I save this text in txt file, the size of the file was 1K (original file). Then, I saved the output of the Huffman algorithm that represent the compressed data in another txt file(I copied the output from the command prompt and then pasted it into a file), I found the size of encoded data in the file was 3K!

This is the compressed data that should be used in decode_file() to get on the original data.

0110001111100101111110001100111010100001011100100001110111000111000110011100010111110001100111011111100110110111011010011111111100001110010001010100111101001101111110110000111110001100111011111100101110010100111101000110111111011000011110000110100111101100110011100001100000110011011110111011110111000010011011110010110011001110000110000000011101111101101110010101111000011010011110110011001110000110000000101010110110101001101111001011001100111000011000000000110111101000000111111001111101111011101111001010011011100101011100101001000010110001100100010101100001100011010000111011010010111010000011111000110011101101101110110100111100000010001000101011000100010000011000101111011010100111001110110010101111010000110110011111110110011101101111001011111100010111101010011110110011100110010110011111001011011110010100001001001110101001111011100001000001011001101001010100001010101101111010100010110111100101000010101101101000010100101101110000110010111110111101110110110111001010001101110111001010111000110110100100011010110011001000111111001101000111011110010100100110111011110010101101001001111110110000011111101010000110011111010000110101001001001111001010000110001001011010101100100100110000001101000010110001011100110111101001100001101101101101

How do I deal with the compressed data correctly in order to get a size smaller than the original file size? to compute the

CR = original data / compressed data

Because I am trying to calculate the compression ratio, but it is illogical that the size of the compressed data is greater than the size of the original data. Any advice,please?

Update:

To compute the compression ratio:

  1. The size of input: if the input was:
    int input[]={1,2,3,4,1,2,3,1,2,3,4,5,2};
    int inputsize = 13 * sizeof(int);
  1. The size of output:
    symbol :1 freq : 3
    symbol :2 freq : 4
    symbol :3 freq : 3 
    symbol :4 freq : 2 
    symbol :5 freq : 1 
    (1 * 3) + (2 * 4) + (3 * 4) + (4 * 2)+(5*1)
    3+ 8 + 12 + 8 + 5 = 36 bits/8 = ~5 bytes.
    int outputsize = 5;
  1. CRatio = inputsize /outputsize

Would this method be correct?


Solution

  • Your input is 278 characters, and your output is 1246 bits. If you assume each character takes eight bits, then you compressed the input to 56% of its original size. Seems about right. (Though see note below.)

    If you are measuring your output as one byte for each 0 and each 1, then indeed it will appear to have expanded. But only because you are storing the result incorrectly. You need to store one bit per bit (eight bits in each byte), not one bit per byte.

    Though the output length looks about right for Huffman coding, it does not have everything a decoder will need to decode the bits. To get a real measure of the compression, you need to include a description of the Huffman code as well as the encoded data. That could take on the order of another 300 bits.