I have an XML file which starts with <?xml version="1.0" encoding="iso-8859-2"?>
. I read it the following way:
SAXParserFactory.newInstance().newSAXParser().parse(is, handler);
where is
is an InputStream and handler
is some arbitrary handler.
Then I get this exception:
org.apache.harmony.xml.ExpatParser$ParseException: At line 41152, column 17: not well-formed (invalid token)
Actually there is a degree sign at that position, enclosed in a CDATA like this:
<![CDATA[something °]]>
Using the charset iso-8859-2, the parser should accept almost any character, including this one. This seems not to be the case. What am I doing wrong?
EDIT
I'm doing all this on Android.
Weird: it seems that the parser completely ignores the encoding attribute. I converted the file to UTF-8 while leaving the header as is, and now my program can read it without error. Why is that??
(I'm making the InputStream like this: new BufferedInputStream(new FileInputStream(filename))
, i.e. without a reader, so that cannot be the error.)
I worked around the error by recognizing the encoding manually. I peeked the XML header and looked for the encoding
attribute (if available), extracted as a String, created a Java Charset
object from it by Charset.forName()
, then made a Reader with the given encoding and an InputSource over that Reader like this:
String encoding;
Charset charset;
[...]
Reader reader = new BufferedReader(new InputStreamReader(inputStream, charset));
InputSource inputSource = new InputSource(reader);
inputSource.setEncoding(encoding);
SAXParserFactory.newInstance().newSAXParser().parse(inputSource, myHandler);
Unfortunately I still don't know why the encoding could not be recognized automatically by the parser.