c++huffman-codehuffman-tree

How to store string generated from Huffman tree to a text file


I am trying to implement Huffman Coding in C++ for text file compression. I am able to build huffman tree from the frequencies of each character in the file. When I try to traverse the tree and get huffman codes for different characters, I am storing the huffman codes as string, so the output string is getting bigger than the input string.

unordered_map<char, string> encoding;

void store_huffman_codes(Node* root, string s){
    if(root == NULL) return;
    if(root->val != '$') encoding[root->val] = s;
    store_huffman_codes(root->left, s + '0');
    store_huffman_codes(root->right, s + '1');
}

unordered_map<char, int> m;
for(char c : test) m[c]++;
priority_queue<pair<int, Node*>, vector<pair<int, Node*>>, greater<pair<int, Node*>>> pq;
for(auto x : m){
    Node* temp = new Node(x.first);
    pq.push({x.second, temp});
}
while(pq.size() > 1){
    pair<int, Node*> a = pq.top(); pq.pop();
    pair<int, Node*> b = pq.top(); pq.pop();
    Node* temp = new Node('$');
    int val = a.first + b.first;
    temp->left = a.second; temp->right = b.second;
    pq.push({val, temp});
}
Node* root = pq.top().second;
store_huffman_codes(root, "");

string output = "";
for(char c : test){
    output += encoding[c];
}

How to store the codes in binary rather than string?


Solution

  • The point here is that you use entire bytes for storing just one single bit.

    Instead you should compress multiple bits into one single byte; there arises a question with, though: How to handle the unused bits you cannot fill for not having sufficient data (i.e. data length not being a multiple of byte size in bits)?

    You could do something similar to utf-8 for encoding multi-byte sequences: The number of leading one bits in a byte indicates the number of unused bits. Advantage: All information required for encoding is stored in one single byte. Disadvantage: You only can use 7 bits to encode in all bytes preceding the last one – which probably over-weighs the advantage.

    Alternatively you store the number of used or unused bits in a separate byte; my recommendation: Number of unused bits in the very first data byte and skipping the unused bytes right at the beginning (i.e. least significant bits in second byte of the output), which might then look as follows:

    uint8_t byte = (8 - numberOfNodes % 8) % 8;
    // assuming you tracked...
    // second modulo maps 8 to 0, if that occurs
    
    // output byte to file (number of unused bits)!
    
    unsigned index = byte;
    byte = 0;
    
    auto append = [&encoded, &byte, &index](uint8_t bit)
    {
        byte |= bit << index;
        if(++index == 8)
        {
            encoded.push_back(byte);
            index = 0;
            byte = 0;
        }
    }
    
    // replace s + 'X' by append(X)
    
    

    At this point you'll notice that, additionally to the already encoded data, you need to forward byte and index from one recursive call to the next as well; doing so by parameters appears unhandy to me, though, instead I recommend writing a dedicated class for the entire process:

    class Encoder
    {
    public:
        // suitable interface allowing to add new bytes
    
        // part of the public interface is a member function to trigger encoding,
        // e.g. one of:
        std::vector<uint8_t> const& encode();
        bool /* or some kind of error code */
        writeToFile(std::string path);
    
    private:
        Node* root; // implementation detail, should not be accessible from outside!
    
        std::vector<uint8_t> encoded;
        // preferably instead of std::string – you're not storing text, but binary data!
    
        uint8_t byte;
        unsigned int index;
    
        // now instead of the lambda above I'd actually prefer a member function:
        void append(uint8_t bit)
        {
            // ...
        }
    };
    

    encode would now calculate and append the first byte indicating the number of unused bits and initialise byte and index appropriately as shown before, then start iterating recursively over the nodes, beginning with root, just as you did yourself, too – with the minimal change applied as indicated above.

    With this, decoding gets just as simple: Read this initial byte, initialise some index to this number and start iterating the further bytes, for each one getting the bit by (byte & 1u << index++) != 0 or alternatively by uint8_t bit = byte & 1u; ++index; byte >>= 1; (though building the tree top down might not be the most efficient variant, but at least it's rather easy to implement).