pythonxmltagslxmliterparse

Tag unrecognized during iterparsing using lxml


I have a really weird problem with lxml, I try to parse my xml file with iterparse as follow:

for event, elem in etree.iterparse(input_file, events=('start', 'end')):
    if elem.tag == 'tuv' and event == 'start':
        if elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'en':
            if elem.find('seg') is not None:
                write_in_some_file
        elif elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'de':
            if elem.find('seg') is not None:
                write_in_some_file

It is pretty simple and works almost perfectly, shortly it goes through my xml file, if an elem is it checks if the language attribute is 'en' or 'de', it then checks if the got a child, if yes it writes its value into a file

There is ONE < seg > in the file that seems not existing, returning None when doing elem.find('seg'), you can see it here and you have it in its context below <seg>! keine Spalten und Ventile</seg>.

I don't understand why this tag which seems perfectly fine creates a problem (since I can't use its .text), note that every other tag is find well

<tu tuid="235084307" datatype="Text">
<prop type="score">1.67647</prop>
<prop type="score-zipporah">0.6683</prop>
<prop type="score-bicleaner">0.7813</prop>
<prop type="lengthRatio">0.740740740741</prop>
<tuv xml:lang="en">
 <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
 <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
 <seg>! no gaps and valves</seg>
</tuv>
<tuv xml:lang="de">
 <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
 <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
 <seg>! keine Spalten und Ventile</seg>
</tuv>
</tu>

Solution

  • In the lxml docs there is this warning:

    WARNING: During the 'start' event, any content of the element, such as the descendants, following siblings or text, is not yet available and should not be accessed. Only attributes are guaranteed to be set.

    Maybe instead of using find() from tu to get the seg element, change your "if" statement to match seg and the "end" event.

    You can use getparent() to get the xml:lang attribute value from the parent tu.

    Example ("test.xml" with an additional "tu" element for testing)

    <tus>
        <tu tuid="235084307" datatype="Text">
            <prop type="score">1.67647</prop>
            <prop type="score-zipporah">0.6683</prop>
            <prop type="score-bicleaner">0.7813</prop>
            <prop type="lengthRatio">0.740740740741</prop>
            <tuv xml:lang="en">
                <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
                <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
                <seg>! no gaps and valves</seg>
            </tuv>
            <tuv xml:lang="de">
                <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
                <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
                <seg>! keine Spalten und Ventile</seg>
            </tuv>
        </tu>
        <tu tuid="235084307A" datatype="Text">
            <prop type="score">1.67647</prop>
            <prop type="score-zipporah">0.6683</prop>
            <prop type="score-bicleaner">0.7813</prop>
            <prop type="lengthRatio">0.740740740741</prop>
            <tuv xml:lang="en">
                <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
                <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
                <seg>! no gaps and valves #2</seg>
            </tuv>
            <tuv xml:lang="de">
                <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
                <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
                <seg>! keine Spalten und Ventile #2</seg>
            </tuv>
        </tu>
    </tus>
    

    Python 3.x

    from lxml import etree
    
    for event, elem in etree.iterparse("test.xml", events=("start", "end")):
    
        if elem.tag == "seg" and event == "end":
            current_lang = elem.getparent().get("{http://www.w3.org/XML/1998/namespace}lang")
            if current_lang == "en":
                print(f"Writing en text \"{elem.text}\" to file...")
            elif current_lang == "de":
                print(f"Writing de text \"{elem.text}\" to file...")
            else:
                print(f"Unable to determine language. Not writing \"{elem.text}\" to any file.")
    
        if event == "end":
            elem.clear()
    

    Printed Output

    Writing en text "! no gaps and valves" to file...
    Writing de text "! keine Spalten und Ventile" to file...
    Writing en text "! no gaps and valves #2" to file...
    Writing de text "! keine Spalten und Ventile #2" to file...