javaxml-parsingapache-camelxml-encoding

Apache Camel to handle encoding declared in XML-File


I'm trying to parse an UTF-16 encoded document using Apache Camel Splitter with xtokenize, this delegates to Woodstox (com.ctc.wstx.sr.BasicStreamReader), also I cannot know the encoding of a file before I read it, currently some files are UTF-16, others UTF-8:

.split().xtokenize(getToken(), 'w', NAMESPACES)

The problem I encounter is that Camel tells Woodstox which encoding to use:

String charset = IOHelper.getCharsetName(exchange);

It sets the default UTF-8 as encoding, so BasicStreamReader tries to read BOM bytes as UTF-8 and fails with

com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character '�' (code 65533 / 0xfffd) in prolog; expected '<'

As specified in https://www.w3.org/TR/xml/#sec-guessing XML Parser (Woodstox) should be able to autodetect the file encoding if only Camel lets it do the work.

Is there a way not to implement the encoding detection myself?


Solution

  • Created a Camel JIRA ticket: https://issues.apache.org/jira/browse/CAMEL-11846 From my comments you can see there is no easy solution for splitting UTF-16 XML with Camel without knowing it's UTF-16 in advance.

    Though subclassing XMLTokenExpressionIterator, which is an ExpressionAdapter and switching to InputStream works in the first place, there are several other places with xslt & xpath & conversion to StaxSource where it will break for the same reason.

    As a workaround I consider it's easier to let XmlStreamReader find out encoding in advance (happens at the initialization) and setting Exchange.CHARSET_NAME header or property.