javaunicodejsoupcharacterjxl

Weird character conversion, need help clarifying


So i am writing a program that takes data extracted to an excel sheet from a web page and then prints it to a text note. However, a weird problem that I have encountered is that from the excel sheet to the text note, a character has changed. the - has turned into a ? . My solution was to iterate through the word and when it gets to the ? and change it to a - . I've tried using unicodes that I've found online and done a

.replace("(question mark unicode) ", " - ") 

to no avail. Does anyone have any idea as to why it is doing that? and can you confirm the unicodes for ? and - . So if the word was "Leo‑III 1.3" it is now "Leo?III 1.3" Thank you for any help


Solution

  • The ? is a result of character set encoding issues, and can occur at many places in the data pipeline.

    It could actually be in the printing of the string, and the string itself is valid.

    To find out what the actual character value is, try running this code to print the Unicode escape for all non-ASCII characters found in the string:

    public static void printNonAscii(String s) {
        TreeSet<Character> nonAscii = new TreeSet<>();
        for (char ch : s.replaceAll("[\r\n\\x20-\\x7E]", "").toCharArray())
            nonAscii.add(ch);
        for (char ch : nonAscii)
            System.out.printf("\\u%04X  %s%n", (int) ch, ch);
    }
    

    Test (source in UTF-8)

    printNonAscii("Foo ? \uFFFD ç ñ © ¼");
    

    Output

    \u00A9  ©
    \u00BC  ¼
    \u00E7  ç
    \u00F1  ñ
    \uFFFD  �