javaxmltag-soup

tagsoup breaks good xml


Cleaning an xml file I have obtained unexpected results: tagsoup has orphaned some properties closing the parent tag too soon. It also downcases the parent tag's name.

Before tagsoup:

<Objects>
    <Object>
      <ObjectID>240</ObjectID>
      [...]
      <Status>Not Ready</Status>
      <Title>Some description which includes word/word, 22,000</Title>
      <Url>http://example.com/withquerystring?id=240&amp;other=1&amp;url=http%3A%2F%2Fredirected.example.com%2F40</Url>
      [...]
      <Owner>
        <Name>JOHN MARSHALL, MR</Name>
      </Owner>
    </Object>
    <Object>
      <ObjectID>122</ObjectID>
      [...]

After tagsoup:

<Objects>
    <object>
      <ObjectID>240</ObjectID>
      [...]
      <Status>Not Ready</Status>
    </object>
    <Title>Some description which includes word/word, 22,000</Title>
    <Url>http://example.com/withquerystring?id=240&amp;other=1&amp;url=http%3A%2F%2Fredirected.example.com%2F40</Url>
    [...]
    <Owner>
        <Name>JOHN MARSHALL, MR</Name>
    </Owner>
    <object>
      <ObjectID>122</ObjectID>
      [...]

I'm in a java project that uses this libraries:

import org.ccil.cowan.tagsoup.Parser;
import org.ccil.cowan.tagsoup.XMLWriter;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

I'm using Java 6.

Any clues for that?
The desired output of a valid xml file would be the same file (maybe just changing details, but not the structure), wouldn't it?


Solution

  • Tagsoup is intended as an HTML parser and to clean up poor HTML. For tag names that are defined by HTML tagsoup knows which elements are allowed inside which other elements and will try and correct any that are wrongly nested. Also remember that in HTML, unlike XML, tag names are not case sensitive.

    In this case it seems to have decided that it knows what object and title should mean in HTML (respectively an embedded object of some kind, and the title of the page), and it knows that title is not allowed inside object. But ObjectID and Status are not known HTML element names, so it gives the benefit of the doubt and leaves them alone.