I have XML that look as follows:
<StartTag>
<MyValueTag>And the value itself contains a < bracket that makes the XML invalid</MyValueTag>
</StartTag>
The XML contains a '<' character that makes the XML invalid.
Now the easiest way is to fix the source of the XML but unfortunately I don't have control over the XML creation. It has messages like “ The value is < than 10” suppose to be “less than”.
Is there anyway how I can check the XML for things like this and escape those characters it?
I tried Looking at this post where the guy indicated that we should use JTidy. But when I tried it it doesn't remove the <:
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setWraplen(Integer.MAX_VALUE);
tidy.setPrintBodyOnly(true);
tidy.setXmlOut(true);
tidy.setSmartIndent(true);
ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8"));
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
tidy.parseDOM(inputStream, outputStream);
The fact that the XML is invalid means you aren't going to be able to use a valid XML parser to read it and fix it. If you can't get the authors of the software that writes the file to fix the bug, then you will have to come up with some application specific solution.
For example, if you knew that the stray < char only occurs in the text of a <MyValue>
element, and if you knew that no other elements could occur as children of <MyValue>
, then it would be pretty easy to write a program that recognizes the start and end tags, and replaces any < characters that occur between them with <
Of course, if the problem isn't that simple, then the solution won't be that simple; but hopefully, you can make it simpler than solving the general problem for XML.
After you've fixed a few files "by hand," stop and ask yourself, "How did I know that < char needed to be escaped?" Then write a program that operates on that same knowledge.