Is there a way that I can just get specified data that I want from similar tags inside an XML file?

I have this XML file that contains a lot of data. It is in a very bad format which have multiple value inside one attribute.

<Person> 
    <GenericItem html="Name:John&lt;br/&gt;ID: ID-001&lt;br/&gt;Position: Manager&lt;a href=&quot;mailto: john@person.com&quot;&gt;john@person.com&lt;/a&gt;&lt;br/&gt;Division: chicken-01">
Employee:
   </GenericItem>
    <GenericItem string="Hardworking and leader of the chicken division">
Summary
    </GenericItem>
    <GenericItem link ="person.com/john01">
Profile
    </GenericItem>
 </Person>
<Person> 
    <GenericItem html="Name:Anna&lt;br/&gt;ID: ID-002&lt;br/&gt;Position: Fryer&lt;a href=&quot;mailto: anna@person.com&quot;&gt;anna@person.com&lt;/a&gt;&lt;br/&gt;Division: chicken-01">
Employee:
   </GenericItem>
    <GenericItem string="Chicken fryer of the month">
Summary
    </GenericItem>
    <GenericItem link ="person.com/anna02">
Profile
    </GenericItem>
 </Person>
<Person> 
    <GenericItem html="Name:Kent&lt;br/&gt;ID: ID-003&lt;br/&gt;Position: Cleaner&lt;a href=&quot;mailto: kent@person.com&quot;&gt;kent@person.com&lt;/a&gt;&lt;br/&gt;Division: chicken-02">
Employee:
   </GenericItem>
    <GenericItem string="chicken and office cleaner">
Summary
    </GenericItem>
    <GenericItem link ="person.com/kent03">
Profile
    </GenericItem>
 </Person>

Now, the data is not all , as it will be too much. What I want to get is just the "Name","ID", and "Position". That means inside the GenericItem other than that 3 is not needed and needs to be removed and GenericItem with attribute "string" and "link" is useless and I want to delete it. I tried using Etree del method but I does not remove the both of them.

import xml.etree.ElementTree as ET
tree = ET.parse('NewestReport.xml')
root = tree.getroot()
for GenericItem in tree.findall('GenericItem'):
    del(GenericItem.attrib['string'])
for neighbor in root.iter('GenericItem'):
    print(neighbor.attrib)

Is there any other method that I can try to do?

Solution

You need to HTML-parse the attribute values.

Your best bet is switching from the built-in ElementTree to lxml, because that includes both an XML and an HTML parser, and proper XPath support.

Here I'm parsing your test input as XML, and each @html attribute separately as HTML. After that, picking the text nodes that contain a ':' seems to be a good first approximation. Of course you can dissect the HTML tree differently.

from lxml import etree as ET

html_parser = ET.HTMLParser()

tree = ET.parse('test.xml')

for person in tree.xpath('./Person'):
    print('-' * 40)
    for html in person.xpath('./GenericItem/@html')
        data = ET.fromstring(html, html_parser)
        for text in data.xpath('.//text()[contains(., ":")]'):
            print(text.strip())

prints

----------------------------------------
Name:John
ID: ID-001
Position: Manager
Division: chicken-01
----------------------------------------
Name:Anna
ID: ID-002
Position: Fryer
Division: chicken-01
----------------------------------------
Name:Kent
ID: ID-003
Position: Cleaner
Division: chicken-02