I am attempting to parse the JMDict_e.xml file from the JMDict project using VTD-XML. However, I am running into a parsing error.
The only error message that appears is:
ParserException: com.ximpleware.EntityException: Errors in Entity: Illegal entity char
A short excerpt from the xml looks like:
<entry>
<ent_seq>1279770</ent_seq>
<k_ele>
<keb>構成要素</keb>
</k_ele>
<r_ele>
<reb>こうせいようそ</reb>
</r_ele>
<sense>
<pos>&n;</pos>
<pos>&adj-no;</pos>
<field>∁</field>
<gloss>components</gloss>
<gloss>elements</gloss>
<gloss>parts</gloss>
</sense>
</entry>
I believe that in the pos
fields, the illegal characters are likely the ampersands. Is there a way to have vtd-xml to not treat these ampersands as special characters? Or is there a different approach to this problem?
VTD-XML only recognizes those built-in character entities. It seems to me that most of the entities are invalid. You probably need to fix those problems before feeding it to the parser.