javaregexremoving-whitespace

Remove indentation of xml files


I am writing a function that I will use for my unit tests. I want to compare XML files, but as one of them will be created by a Third party library I want to mitigate any possible differences because of different indentation. Thus I wrote the following function:

private String normalizeXML(String xmlString) {
    String res = xmlString.replaceAll("[ \t]+", " ");
    // leading whitespaces are inconsistent in the resulting xmls.
    res = res.replaceAll("^\\s+", "");
    return res.trim();
}

However this function is not removing the leading interval on each line of the XML.

When I write the function in this way (difference in the first regex):

private String normalizeXMLs(String xmlString) {
    String res = xmlString.replaceAll("\\s+", " ");
    // leading whitespaces are inconsistent in the resulting xmls.
    res = res.replaceAll("^\\s+", "");
    return res.trim();
}

It does remove the trailing white space, but it also makes the xml appear as a single line which is very troubling when you need to compare the differences.

I just can not justify why the first implementation does not displace the leading interval. Any ideas?

EDIT: Even more interesting is that if I make a single line manipulation:

String res = xmlString.replaceAll("^\\s+", "");

This line does not remove any of identation!


Solution

  • Rather than trying to manipulate the string representations, it would be safer to use a dedicated XML comparison tool such as XMLUnit that allows you to define exactly which differences are significant and which aren't. Trying to modify XML data using regular expressions is rarely a good idea, you should use a proper XML parser that knows all the rules of what makes well formed XML.