javaxmlsoapunmarshallingcastor

Unmarshalling XML with foreign characters requires entity declarations with Castor


I have two applications that have to work in conjunction with one another. One is an application I built myself using Java 17, the other is something I have far less control over, one that uses Java 8.

My Java 17 application must take a POJO and marshal it to XML, sending it to the Java 8 application via SOAP. The Java 8 application then receives this XML and unmarshals with Castor into a POJO, and then works on it.

With foreign (non-ASCII) characters, however, all of this falls apart. I have made sure encoding on both sides is set to UTF-8 In the Java 17 application, I use Jakarta/JAXB to marshal to the XML.

        JAXBContext context = JAXBContext.newInstance(PojoClass.class);
        Marshaller marshaller = context.createMarshaller();
        marshaller.setProperty(Marshaller.JAXB_ENCODING, "utf-8");
        marshaller.setProperty(Marshaller.JAXB_FRAGMENT, true);
        marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, false);
        //marshaller.setProperty("org.glassfish.jaxb.characterEscapeHandler", new NonAsciiEscapeHandler());
        StringWriter sw = new StringWriter();
        marshaller.marshal(pojoClassInstance, sw);
        return sw.toString();

I have tried adding a separate Character Escape Handler to convert all foreign characters to their hex values. As far as I understand, if non-ASCII characters in XML are in hex, they do not need entity declarations.

In the Java 8 application, however, Castor unmarshals it and Breeze later parses the XML. But unfortunately, as a result, I get the error in com.tbf.xml.XmlObjectFactory: "The entity 'Otilde' was referenced, but not declared." Yes, I know this means that I need to declare it as an XML entity, but why does it do that when I explicitly send the hex values???? Or if I don't send the hex values, I send the UTF-8 encoded foreign characters which should NOT be randomly converted to entities.

Can someone please help me either: 1. Get the Java 8 application with Castor to accept the UTF-8 or hex values and not complain about undeclared entities, or 2. Use the JAXB marshaller in Java 17 to add entity declarations in the generated XML? Thank you!

EDIT: Upon receiving a comment, this is the sample of what appears after turning foreign characters into hex, after the code above (with the NonAsciiCharacterHandler functioning, as in not commented out).

<PostalAddr>
    <AddrLine1>&#xd5;888 Sample Rd.</AddrLine1>
    <AddrLine2>&#xc8;021 Apartment 55</AddrLine2>
    <CityNm>&#xf3;Doyle</CityNm>
    <PostalZIPCd>99999</PostalZIPCd>
    <StateProvCd>AK</StateProvCd>
</PostalAddr>

This is then turned into a SOAP object and sent via Apache HttpClient.

try(CloseableHttpClient client = HttpClients.createDefault()) {
            HttpPost httpPost = new HttpPost(url);
            HttpEntity entity = new StringEntity(xmlMessage, StandardCharsets.UTF_8);
            httpPost.setEntity(entity);
            httpPost.setHeader("Content-type", "application/soap+xml");
            httpPost.setHeader("Accept", "application/soap+xml");
            httpPost.addHeader("Accept-Charset", "utf-8");
            httpPost.setHeader("SOAPAction", "SOAP");

            CloseableHttpResponse response = client.execute(httpPost);

However, I included a sample here to see what happens when I take the HttpEntity in the same Java 17 application and extract it.

HttpEntity entity2 = httpPost.getEntity();
String entityContents = EntityUtils.toString(entity2, StandardCharsets.UTF_8);

And when I do so, this is what the XML shows:

<PostalAddr>
    <AddrLine1>Õ888 Sample Rd.</AddrLine1>
    <AddrLine2>È021 Apartment 55</AddrLine2>
    <CityNm>óDoyle</CityNm>
    <PostalZIPCd>99999</PostalZIPCd>
    <StateProvCd>AK</StateProvCd>
</PostalAddr>

When unmarshalling in Java 8 with Castor, though, this is what appears:

<PostalAddr>
    <AddrLine1>&Otilde;888 Sample Rd.</AddrLine1>
    <AddrLine2>&Egrave;021 Apartment 55</AddrLine2>
    <CityNm>&oacute;Doyle</CityNm>
    <PostalZIPCd>99999</PostalZIPCd>
    <StateProvCd>AK</StateProvCd>
</PostalAddr>

This then causes problems down the line when Breeze is used to parse this.

But either way, I believe Castor's unmarshalling is the cause of this discrepancy. I need Castor to not unmarshal it as such, and actually unmarshal with either the hex codes or the actual characters.


Solution

  • All right, I was finally able to solve this with the help of two of my coworkers.

    The problem was that when parsing the XML in Java 8, for whatever reason all foreign characters were being converted to XML entities without any entity declaration. Digging into the Java 8 application - an application that is very old and held together with paperclips and glue - it turns out that the base class of everything the XML was unmarshalled to would run the method StringEscapeUtils.escapeHtml() (from org.apache.commons.lang.StringEscapeUtils) on any string element.

    public void setValue(String s) {
        value = StringEscapeUtils.escapeHtml(s);
    }
    

    However, I could not just get rid of this, as this would break the application.

    In the end, @nullpointer's suggestion of running StringEscapeUtils.escapeJava() turned out to be partially correct. I did not realize I had to add the dependency org.apache.commons.commons-text to get the non-deprecated StringEscapeUtils. (Yes, he did mention it, but I accidentally overlooked it. I apologize.)

    In my Java 17 application, when marshalling the XML, my NonAsciiEscapeHandler class turned out to be useful, but instead of converting all foreign characters to hex, this is instead what I did:

    // 0x7F is everything above the ASCII plane
    if (ch > 0x7F) {
        writer.write(StringEscapeUtils.escapeJava(String.valueOf(ch)));
        continue;
    }
    

    This would write the character in unicode. When sent to the Java 8 application, then, the base string element had to be modified. While I still had to keep the escapeHtml() method, I modified the getValue() method:

    if(isEscaped(value)) {
        return StringEscapeUtils.unescapeJava(value);
    } else {
        return value;
    }
    
    // etc etc.
    
    private boolean isEscaped(String str) {
        return !str.equals(StringEscapeUtils.unescapeJava(str));
    }
    

    Again, the Java 8 application used org.apache.commons.lang.StringEscapeUtils while the Java 17 application used org.apache.commons.text.StringEscapeUtils. Thank you everyone for your suggestions and help!