java compression byte huffman-code extended-ascii

Java - How to handle special characters when compressing bytes (Huffman encoding)?

I am writing a Huffman Compression/Decompression program. I have started writing my compression method and I am stuck. I am trying to read all bytes in the file and then put all of the bytes into a byte array. After putting all bytes into the byte array I create an int[] array that will store all the frequencies of each byte (with the index being the ASCII code).

It does include the extended ASCII table since the size of the int array is 256. However I encounter issues as soon as I read a special character in my file (AKA characters with a higher ASCII value than 127). I understand that a byte is signed and will wrap around to a negative value as soon as it crosses the 127 number limit (and an array index obviously cant be negative) so I tried to counter this by turning it into a signed value when I specify my index for the array (array[myByte&0xFF]).

This kind of worked but it gave me the wrong ASCII value (for example if the correct ASCII value for the character is 134 I instead got 191 or something). The even more annoying part is that I noticed that special characters are split into 2 separate bytes, which I feel will cause problems later (for example when I try to decompress).

How do I make my program compatible with every single type of character (this program is supposed to be able to compress/decompress pictures, mp3's etc).

Maybe I am taking the wrong approach to this, but I don't know what the right approach is. Please give me some tips for structuring this.

Tree:

package CompPck;

import java.util.TreeMap;

abstract class Tree implements Comparable<Tree> {
    public final int frequency; // the frequency of this tree
    public Tree(int freq) { frequency = freq; }

    // compares on the frequency
    public int compareTo(Tree tree) {
        return frequency - tree.frequency;
    }
}

class Leaf extends Tree {
    public final int value; // the character this leaf represents

    public Leaf(int freq, int val) {
        super(freq);
        value = val;
    }
}

class Node extends Tree {
    public final Tree left, right; // subtrees

    public Node(Tree l, Tree r) {
        super(l.frequency + r.frequency);
        left = l;
        right = r;
    }
}

Build tree method:

public static Tree buildTree(int[] charFreqs) {
        PriorityQueue<Tree> trees = new PriorityQueue<Tree>();

        for (int i = 0; i < charFreqs.length; i++){
            if (charFreqs[i] > 0){
                trees.offer(new Leaf(charFreqs[i], i));
            }
        }

        //assert trees.size() > 0;

        while (trees.size() > 1) {
            Tree a = trees.poll();
            Tree b = trees.poll();

            trees.offer(new Node(a, b));
        }
        return trees.poll();
    }

Compression method:

public static void compress(File file){
        try {
            Path path = Paths.get(file.getAbsolutePath());
            byte[] content = Files.readAllBytes(path);
            TreeMap<Integer, String> treeMap = new TreeMap<Integer, String>();
            File nF = new File(file.getName() + "_comp");
            nF.createNewFile();
            BitFileWriter bfw = new BitFileWriter(nF);

            int[] charFreqs = new int[256];

            // read each byte and record the frequencies
            for (byte b : content){
                charFreqs[b&0xFF]++;
                System.out.println(b&0xFF);
            }

            // build tree
            Tree tree = buildTree(charFreqs);

            // build TreeMap
            fillEncodeMap(tree, new StringBuffer(), treeMap);

        } catch (IOException e) {
            e.printStackTrace();
        }
    }

Solution

Encodings matter

If I take the character "ö" and read it in my file it will now be represented by 2 different values (191 and 182 or something like that) when its actual ASCII table value is 148.

That really depends, which kind of encoding was used to create your text file. Encodings determine how text messages are stored.

In UTF-8 the ö is stored as hex [0xc3, 0xb6] or [195, 182]
In ISO/IEC 8859-1 (= "Latin-1") it would be stored as hex [0xf6], or [246]
In Mac OS Central European, it would be hex [0x9a] or [154]

Please note, that the basic ASCII table itself doesn't really describe anything for that kind of character. ASCII only uses 7 bits, and by doing so only maps 128 codes.

Part of the problem is that in layman's terms, "ASCII" is sometimes used to describe extensions of ASCII as well, (e.g. like Latin-1)

History

There's actually a bit of history behind that. Originally ASCII was a very limited set of characters. When those weren't enough, each region started using the 8th bit to add their language-specific characters. Leading to all kind of compatibility issues.

Then there was some kind of consortium that made an inventory of all characters in all possible languages (and beyond). That set is called "unicode". It contains not just 128 or 256 characters, but thousands of them.

From that point on you would need more advanced encodings to cover them. UTF-8 is one of those encodings that covers that entire unicode set, and it does so while being kind-of backwards compatible with ASCII.

Each ASCII character is still mapped in the same way, but when 1-byte isn't enough, it will use the 8th bit to indicate that a 2nd byte will follow, which is the case for the ö character.

Tools

If you're using a more advanced text editor like Notepad++, then you can select your encoding from the drop-down menu.

In programming

Having said that, your current java source reads bytes, it's not reading characters. And I would think that it's a plus when it works on byte-level, because then it can support all encodings. Maybe you don't need to work on character level at all.

However, if it does matter for your specific algorithm. Let's say you've written an algorithm that is only supposed to handle Latin-1 encoding. So, then it's really going to work on "character-level" and not on "byte-level". In that case, consider reading directly to String or char[].

Java can do the heavy-lifting for you in that case. There are readers in java that will let you read a text-file directly to Strings/char[]. However, in those cases you should of course specify an encoding when you use them. Internally a single java character can contain up to 2 bytes of data.

Trying to convert bytes to characters manually is a tricky business. Unless you're working with plain old ASCII of course. The moment you see a value above 0x7F (127), (which are presented by negative values in byte) you're no longer working with simple ASCII. Then consider using something like: new String(bytes, StandardCharsets.UTF_8). There's no need to write a decoding algorithm from scratch.