javaencodingbytebithuffman-code

My program reads my file, coded using huffman encoding wrong! Most of the bytes end up being '11111101', even though they're not. Why is that?


Im currently working on my huffman decoder in java. I've been stuck on wrong decoding for few days now, and only yesterday I realised it's bacause my program reads input wrong! I've downloaded outside software to read files bit by bit, and it end up being completly different, from what they should be. Please could you help me resolve this issue?

Heres my code in java, i use to test what input bites I read:

public static void main(String[] args) {
        ...
        try {
            FileReader plikin = new FileReader(property);
            BufferedReader pinh = new BufferedReader(plikin);
            FileWriter plikout = new FileWriter("out.txt");
            BufferedWriter pout = new BufferedWriter(plikout);
            printRemainingBits(pinh);
            pout.close();
        }
        ...
    }

static void printRemainingBits(BufferedReader pinh) throws IOException {
        System.out.println("\nRemaining Bits:");
        int c;
        while ((c = pinh.read()) != -1) {
            printCharBits((char) c);
        }
    }

public static void printCharBits(char c) {
        for (int i = 7; i >= 0; i--) {
            int bit = (c >> i) & 1;
            System.out.print(bit);
        }
        System.out.print(" ");
    }

In my file im testing it on correct bit representation is:

00000100 10011011 01110011 00101001 11100110 01110010 00110111 11100100 01111101 01111111

But java code reads them like that:

00000100 11111101 01110011 00101001 11111101 01110010 00110111 11111101 01111101 01111111

I really dont know what to do. Thank you in advance!


Solution

  • You are reading your file as if it is text, but it's not text.

    There is a special unicode character:

    U+FFFD � REPLACEMENT CHARACTER used to replace an unknown, unrecognized, or unrepresentable character

    Note that 11111101 is FD, the lowest 8 bits of that value.

    Rather than using FileReader and BufferedReader, you should be using FileInputStream and BufferedInputStream. That way, your binary data won't be corrupted with replacement characters and other unicode things.