jtidy

JTidy not handling some characters correctly


Certain characters get mangled after I call Tidy.parse. Two examples are: ’ instead of ' and ∼ instead of ~

I'm guessing that these must have come from Word or something similar but the tidy handles them very badly. Specifically, it converts them to their individual entity representations for the diacritics which then get converted to meaningless junk later in my process. I'm sure there are others but these are the ones I have found so far. Is there any known way to convert these before hand or ignore them as part of the tidy?

        Tidy tidy = new Tidy();
        tidy.setXHTML(true);
        tidy.setForceOutput(true);
        tidy.parse(inputStream, outputStream);

Solution

  • After printing out the config, I could see that the input and output encodings were not set to UTF-8 as I had thought so I just had to add this:

    tidy.setInputEncoding("UTF-8");
    tidy.setOutputEncoding("UTF-8");