I'm having a hard time escaping xml to be processed by Java. I'm using JTidy to escape unwanted characters, but struggle to remove "<" and ">" from values such as <tag> capacity < 1000 </tag>
I'm using below code to escape the input
public String CleanXML(String input){
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-16");
tidy.setOutputEncoding("UTF-16");
tidy.setWraplen(Integer.MAX_VALUE);
tidy.setXmlOut(true);
tidy.setSmartIndent(true);
tidy.setXmlTags(true);
tidy.setMakeClean(true);
tidy.setForceOutput(true);
tidy.setQuiet(true);
tidy.setShowWarnings(false);
StringReader in = new StringReader(input);
StringWriter out = new StringWriter();
tidy.parse(in, out);
return out.toString();
}
use following function
private static final Pattern TAG_REGEX = Pattern.compile("<tag>(.+?)</tag>", Pattern.DOTALL);
public String CleanXML(String input){
final Matcher matcher = TAG_REGEX.matcher(input);
while (matcher.find()) {
String value = matcher.group(1);
String valueReplace = value.replaceAll("[^a-zA-Z0-9\\s]", "");
input.replace(value,valueReplace);
}
return input;
}
It uses regular expression search to get values between tags then, remove all non alphanumeric characters. Regular expressions and basic idea was gained from Java regex to extract text between tags