One of my JUnit tests uses (behind the scenes) the Woodstox parser.
When I run the test from within Eclipse, the test succeeds as expected.
But running the same test on the command line, using
mvn clean test -Dtest=com.example.MyClassTest#someParserTest
results in the test to fail with the following exception messages:
Error on line 114 column 21
SXXP0003: Error reported by XML parser: Invalid UTF-8 middle byte 0x3f (at char #4174, byte #3999)
...
at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:314)
at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:205)
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:55)
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:961)
at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4580)
at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3657)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1063)
at com.ctc.wstx.sax.WstxSAXParser.fireEvents(WstxSAXParser.java:524)
at com.ctc.wstx.sax.WstxSAXParser.parse(WstxSAXParser.java:452)
at net.sf.saxon.event.Sender.sendSAXSource(Sender.java:440)
at net.sf.saxon.event.Sender.send(Sender.java:171)
at net.sf.saxon.jaxp.IdentityTransformer.transform(IdentityTransformer.java:363)
I took a look at the to-be-parsed InputStream
. The InputStream
s are identical in both cases.
Also, there is no "line 114 column 21" in the InputStream
. Line 114 ends on column 11.
How can I investigate what causes the different behavior?
It turned out that a library I used made wrong assumptions about the environment's default character encoding (also called platform's default charset).
In the Eclipse environment, calling Charset.defaultCharset()
returned UTF-8, while in the command line environment it returned CP1252.
Many standard and third-party Java APIs behave differently depending on the platform's default charset, among them:
String.getBytes()
ByteArrayOutputStream.toString()
XMLOutputFactory.createXMLStreamWriter(OutputStream stream)
IOUtils.toString(InputStream input)
To resolve my issue, I had to update that library to explicitly use the correct character set:
String.getBytes(StandardCharsets.UTF_8)
ByteArrayOutputStream.toString( StandardCharsets.UTF_8.name() )
XMLOutputFactory.createXMLStreamWriter( OutputStream stream, StandardCharsets.UTF_8.name() )
IOUtils.toString(InputStream input, StandardCharsets.UTF_8)