I am trying to parse a document that is formatted similar to Xml (but not XML format) it works for everything except when I reach an HTML code like &ldquo. Then I get an error and everything crashes. How can I work around this?
Edit Here's the error and the line it happens on:
03-25 17:56:26.540: W/System.err(21265): org.apache.harmony.xml.ExpatParser$ParseException: At line 68, column 354: undefined entity
<F_S_INGREDIENTS>Pale ale malt (well-modified and suitable for single-temperature infusion mashing); American hops; American yeast that can give a clean or slightly fruity profile. Generally all-malt, but mashed at lower temperatures for high attenuation. Water character varies from soft to moderately sulfate. Versions with a noticeable Rye character (“RyePA”) should be entered in the Specialty category.</F_S_INGREDIENTS>
I've narrowed it down to “RyePA”
&ldquo
is a valid HTML entity, but not a valid XML entity. You aren't going to be able to parse it with a stock XML parser.
The defineEntityReplacement() method looks promising. If you can't get that to work for you, you can simply read the string into memory (if it's not too bug) and before you hand it off to the parser, replace the text yourself,
String s = xml.replaceAll("&ldpos;", "\"").replaceAll("&rdpos;", "\"");