xmlpython-3.xelementtreecelementtreeiterparse

How can I remove XML parts with iterparse with parents included using ElementTree in Python?


I have multiple large files that I need to import and iterate through them - all of them are xmls and have the same tree structure. The structure is something like this with some extra text apart from the ID so under the Start there are more children element tags: What I would like to do, is to input a list of Ids which I know is wrong and remove that report from the whole XML file. One report is between two "T"s.

<Header>
        <Header2>
           <Header3>
           <T>
              <Start> 
                <Id>abcd</Id>
              </Start>
           </T>
           <T>
              <Start> 
                <Id>qrlf</Id>
              </Start>
           </T>
           </Header3>
        </Header2>
</Header>

What I have so far:

from xml.etree import cElementTree as ET

file_path = '/path/to/my_xml.xml'
to_remove = []
root = None
for event, elem in ET.iterparse(file_path, events=("start", "end")):
if event == 'end':
    if elem.tag == 'Id':
        new_root = elem
        #print([elem.tag for elem in new_root.iter()])
        for elem2 in new_root.iter('Id'):
             id = elem2.text
             if id =='abcd':
                print(id)
                to_remove.append(new_root)
root = elem
for item in to_remove:
    root.remove(item)

So the above code obviously doesn't work as the root is the whole xml file starting with Header and it can't find exactly the subelement that I am referring to remove, as its parent is Header3 not Header.

So the desired output would be:

<Header>
        <Header2>
           <Header3>
           <T>
              <Start> 
                <Id>qrlf</Id>
              </Start>
           </T>
           </Header3>
        </Header2>
</Header>

Going forward it is not a single value that I am to input to remove but thousands of values, so going to be a list, I just thought it is easier to represent the problem this way. Any help is appreciated.


Solution

  • I think you can use

    ids_to_remove = ['abcd']
    
    elements_to_remove = []
    
    for event, element in ET.iterparse('file.xml'):
        if element.tag == 'T' and element.find('Start/Id').text in ids_to_remove:
            elements_to_remove.append(element)
        if element.tag == 'Header3':
            for el in elements_to_remove:
                element.remove(el)
                el.clear()
        if element.tag == 'Header':
            root = element
    
    ET.dump(root)
    

    I haven't tested how that works with huge files, obviously it collects all elements to be removed first and finally removes them, I am not sure there is a way in the ElementTree API to detach element in the if element.tag == 'T' and element.find('Start/Id').text in ids_to_remove: branch, perhaps the following frees the element earlier:

    ids_to_remove = ['abcd', 'baz', 'bar']
    
    
    for event, element in ET.iterparse('file.xml', events = ['start', 'end']):
        if event == 'end' and element.tag == 'T' and element.find('Start/Id').text in ids_to_remove:
            header3.remove(element)
            element.clear()
        if event == 'start' and element.tag == 'Header3':
            header3 = element;
        if element.tag == 'Header':
            root = element
    
    
    ET.dump(root)