pythonxmlelementtreeiterparse

Processing large xml files. Only root tree children attributes are relevant


I'm new to xml and python and I hope that I phrased my problem right:

I have xml files with a size of one gigabyte. The files look like this:

<test name="LongTestname" result="PASS">
    <step ID="0" step="NameOfStep1" result="PASS">
        Stuff I dont't care about
    </step>
    <step ID="1" step="NameOfStep2" result="PASS">
        Stuff I dont't care about
    </step>
</test>

For fast analysis I want to get the name and the result of the steps which are the children of the root element. Stuff I dont't care about are lots of nested elements.

I have already tried following:

tree = ET.parse(xmlLocation)
root = tree.getroot()
for child in root:
    print(child.tag, child.attrib)

Here I get a memory error because the files are to big

Then I tried:

try:
    for event, elem in ET.iterparse(pathToSteps, events=("start","end")):
       if elem.tag == "step" and event == "start":
                        
           stepAndResult.append([elem.attrib['step'],elem.attrib['result'],"System1"])
       elem.clear()

This works but is really slow. I guess it iterates through all elements and this takes a very long time.

Then I found a solution looking like this:

try:
    tree = ET.iterparse(pathToSteps, events=("start","end"))
    _, root = next(tree)  
    print('ROOT:', root.tag)
except:
   print("ERROR: Unable to open and parse file !!!")


for child in root:
   print(child.attrib)

But this prints only the attributes of the first step.

Is there a way to speed up the working solution? Since I'm pretty new to this stuff I would appreciate a complete example or a reference where I can figure it out by myself with an example.


Solution

  • I think you're on the right track with iterparse().

    Maybe try specifying the step element name in the tag argument and only processing "start" events...

    from lxml import etree
    
    for event, elem in etree.iterparse("input.xml", tag="step", events=("start",)):
        print(elem.attrib)
        elem.clear()
    

    EDIT: For some reason I thought you were using lxml and not ElementTree. My answer would require you to switch to lxml.