pythonparsingbeautifulsoupelementtreesgml

Parse self-closing tags missing the '/'


I'm trying to parse some old SGML code using BeautifulSoup4 and build an Element Tree with the data. It's mostly working fine, but some of the tags that should be self-closing are aren't marked as such. For example:

<element1>
    <element2 attr="0">
    <element3>Data</element3>
</element1>

When I parse the data, it ends up like:

<element1>
    <element2 attr="0">
        <element3>Data</element3>
    </element2>
</element1>

What I'd like is for it to assume that if it doesn't find a closing tag for such elements, it should treat it as self-closing tag instead of assuming that everything after it is a child and putting the closing tag as late as possible, like so:

<element1>
    <element2 attr="0"/>
    <element3>Data</element3>
</element1>

Can anyone point me to a parser that could do this, or some way to modify an existing one to act this way? I've dug through a few parsers (lxml, lxml-xml, html5lib) but I can't figure out how to get these results.


Solution

  • What I ended up doing was extracting all empty elements where the end tag can be omitted from the DTD (eg. <!ELEMENT elem_name - o EMPTY >), creating a list from those elements, then using regex to close all the tags in the list. The resulting text is then passed to the XML parser.

    Here's a boiled down version of what I'm doing:

    import re
    from lxml.html import soupparser
    from lxml import etree as ET
    
    empty_tags = ['elem1', 'elem2', 'elem3']
    
    markup = """
    <elem1 attr="some value">
    <elem2/>
    <elem3></elem3>
    """
    
    for t in empty_tags:
        markup = re.sub(r'(<{0}(?:>|\s+[^>/]*))>\s*(?:</{0}>)?\n?'.format(t), r'\1/>\n', markup)
    
    tree = soupparser.fromstring(markup)
    print(ET.tostring(tree, pretty_print=True).decode("utf-8"))
    

    The output should be:

    <elem1 attr="some value"/>
    <elem2/>
    <elem3/>
    

    (This will actually be enclosed in tags, but the parser adds those in.)

    It will leave attributes alone, and won't touch tags that are already self-closed. If the tag has a closing tag, but is empty, it will remove the closing tag and self-close the tag instead, just so it's standardized.

    It's not a very generic solution but, as far as I can tell, there's no other way to do this without knowing which tags should be closed. Even OpenSP needs the DTD to know which tags it should be closing.