I'm trying to parse an UTF-16 encoded document using Apache Camel Splitter with xtokenize, this delegates to Woodstox (com.ctc.wstx.sr.BasicStreamReader), also I cannot know the encoding of a file before I read it, currently some files are UTF-16, others UTF-8:
.split().xtokenize(getToken(), 'w', NAMESPACES)
The problem I encounter is that Camel tells Woodstox which encoding to use:
String charset = IOHelper.getCharsetName(exchange);
It sets the default UTF-8 as encoding, so BasicStreamReader tries to read BOM bytes as UTF-8 and fails with
com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character '�' (code 65533 / 0xfffd) in prolog; expected '<'
As specified in https://www.w3.org/TR/xml/#sec-guessing XML Parser (Woodstox) should be able to autodetect the file encoding if only Camel lets it do the work.
Is there a way not to implement the encoding detection myself?
Created a Camel JIRA ticket: https://issues.apache.org/jira/browse/CAMEL-11846 From my comments you can see there is no easy solution for splitting UTF-16 XML with Camel without knowing it's UTF-16 in advance.
Though subclassing XMLTokenExpressionIterator, which is an ExpressionAdapter and switching to InputStream works in the first place, there are several other places with xslt & xpath & conversion to StaxSource where it will break for the same reason.
As a workaround I consider it's easier to let XmlStreamReader find out encoding in advance (happens at the initialization) and setting Exchange.CHARSET_NAME header or property.