javaxmlregex

Regex for quoting unquoted XML attributes


Edit: The 100% correct theory is that you don't want to do this at all. However I have accepted the answer that helped the most.

So I'm being given ugly XML from a client that promises to fix it. In the meantime I need to clean it up myself. I'm looking for a regex to use in Java to add quotes around unquoted attributes. The general case is better, but so far it is only one attribute that is broken so the regex can specifically refer to "attr1". The value of the attribute is unknown, so I can't include that in the search.

<tag attr1 = VARIABLETEXT>
<tag attr1 = "VARIABLETEXT">not quoted</tag>
<tag attr1 = VARIABLETEXT attr2 = "true">
<otherTag>buncha junk</otherTag>
<tag attr1 = "VARIABLETEXT">"quoted"</tag>

Should turn into

<tag attr1 = "VARIABLETEXT">
<tag attr1 = "VARIABLETEXT">not quoted</tag>
<tag attr1 = "VARIABLETEXT" attr2 = "true">
<otherTag>buncha junk</otherTag>
<tag attr1 = "VARIABLETEXT">"quoted"</tag>

EDIT: Thank you very much for telling me not to do what I'm trying to do. However, this isn't some random, anything goes XML, where I'll run into all the "don't do it" issues. I have read the other threads. I'm looking for specific help for a specific hack.


Solution

  • OK, given your constraints, you could:

    Search for

    <tag attr1\s*=\s*([^" >]+)
    

    and replace with

    <tag attr1 = "\1"
    

    So, in Java, that could be (according to RegexBuddy):

    String resultString = subjectString.replaceAll("<tag attr1\\s*=\\s*([^\" >]+)", "<tag attr1 = \"$1\"");
    

    EDIT: Simplified regex a bit more.