javautf-8character-encodingutf-16saxparser

Parsing UTF-8 XML using DefaultHandler: when / how does it become UTF-16 in Java?


I have a Java program that was working perfectly in Corretto 17, but is now having character set encoding issues in Corretto 25.

I am reading a UTF-8 encoded XML from an external API. The code is quite simple: I form an HTTPUrlConnection and I have a class that extends DefaultHandler:

URI uri = new URI(url);
URL authenticatedURL = uri.toURL();
HttpURLConnection connection = (HttpURLConnection) authenticatedURL.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("Authorization", "Bearer " + BEARER_TOKEN);
// connection.setRequestProperty("Accept-Charset", "UTF-8");  // This line of code seems to have no impact.
[...]
InputStream connectionInputStream = connection.getInputStream();
InputSource connectionInputSource = new InputSource(connectionInputStream);
connectionInputSource.setEncoding(StandardCharsets.UTF_8.displayName());
parser.parse(connectionInputSource, dh);
// parser.parse(connection.getInputStream(), dh);  // This 1 line seems to work the same as the above 4 lines.

My understanding is that Java uses UTF-16 for all Strings, but also it assumes some inputs (e.g. XML) will be in UTF-8, for instance the Attributes class used by DefaultHandler. I'm assuming this UTF-8 assumption is why explicitly setting the charset / encoding in the code above makes no difference. Is this correct?

The issue I'm having is I don't understand when / how the UTF-8 I read in is converted to UTF-16. For instance, in my extension of DefaultHandler, attributes.getValue() seems to return a UTF-8 encoded String, but Float.parseFloat() works perfectly:

public void startElement(String uri, String localName, String qName, Attributes attributes) {
  if ("name".equals(qName)) {
    if ("primary".equals(attributes.getValue("type"))) {
      if (attributeHasValue(attributes, "value")) {
        primaryName = attributes.getValue("value");  // Seems to store UTF-8 string.
      }
    }
  } else if ("averageweight".equals(qName)) {
    if (attributeHasValue(attributes, "value")) {
      averageWeight = Float.parseFloat(attributes.getValue("value"));
    }
  [...]

Outputs (when I print the values to System.out.println):

Primary name = Orl�ans   // NOT GOOD!
Average weight = 3.0137  // But Floats and Integers are parsed just fine.

I suppose I have to explicitly convert the value returned by attributes.getValue() from UTF-8 to UTF-16, is that correct?

But, if so, why are numeric values being parsed correctly?

And, I currently assume the qName and localName parameters are provided in UTF-16, is that correct?

I'm just confused in general / don't have the right mental model of what the SAXParser + DefaultHandler are doing encoding-wise, because I don't understand how most of my code is working if the encoding is wrong everywhere.


Solution

  • Thanks to Slaw for putting me on the right path. It turns out adding -Dstdout.encoding=UTF-8 to the Run configuration (VM arguments) in Eclipse fixed the issue.

    When I didn't set that flag, the standard output encoding was Cp1252:

    System.out.println("Standard Output Encoding: " + System.getProperty("stdout.encoding"));
    
    Outputs "Standard Output Encoding: Cp1252" before adding "-Dstdout.encoding=UTF-8"
    Outputs "Standard Output Encoding: UTF-8" after adding "-Dstdout.encoding=UTF-8"
    

    Searching further on the internet, I found the following related resources:

    1. Eclipse bug report #530 in 2023.

    2. Note that this bug only applies to Java 18+ because of the changes in JEP 400. Hence why my Java (Corretto) update from 17 to 25 caused the issue.

    So, with the flag, all my code works perfectly fine once more. No issues with the character encodings, logic, output, etc.

    [Side note: The above bug was closed in July 2023, so this only happened because I've been neglecting my Eclipse updates for far longer than I'd thought.]