I am writing a function that I will use for my unit tests. I want to compare XML files, but as one of them will be created by a Third party library I want to mitigate any possible differences because of different indentation. Thus I wrote the following function:
private String normalizeXML(String xmlString) {
String res = xmlString.replaceAll("[ \t]+", " ");
// leading whitespaces are inconsistent in the resulting xmls.
res = res.replaceAll("^\\s+", "");
return res.trim();
}
However this function is not removing the leading interval on each line of the XML.
When I write the function in this way (difference in the first regex):
private String normalizeXMLs(String xmlString) {
String res = xmlString.replaceAll("\\s+", " ");
// leading whitespaces are inconsistent in the resulting xmls.
res = res.replaceAll("^\\s+", "");
return res.trim();
}
It does remove the trailing white space, but it also makes the xml appear as a single line which is very troubling when you need to compare the differences.
I just can not justify why the first implementation does not displace the leading interval. Any ideas?
EDIT: Even more interesting is that if I make a single line manipulation:
String res = xmlString.replaceAll("^\\s+", "");
This line does not remove any of identation!
Rather than trying to manipulate the string representations, it would be safer to use a dedicated XML comparison tool such as XMLUnit that allows you to define exactly which differences are significant and which aren't. Trying to modify XML data using regular expressions is rarely a good idea, you should use a proper XML parser that knows all the rules of what makes well formed XML.