I have this XML file that contains a lot of data. It is in a very bad format which have multiple value inside one attribute.
<Person>
<GenericItem html="Name:John<br/>ID: ID-001<br/>Position: Manager<a href="mailto: john@person.com">john@person.com</a><br/>Division: chicken-01">
Employee:
</GenericItem>
<GenericItem string="Hardworking and leader of the chicken division">
Summary
</GenericItem>
<GenericItem link ="person.com/john01">
Profile
</GenericItem>
</Person>
<Person>
<GenericItem html="Name:Anna<br/>ID: ID-002<br/>Position: Fryer<a href="mailto: anna@person.com">anna@person.com</a><br/>Division: chicken-01">
Employee:
</GenericItem>
<GenericItem string="Chicken fryer of the month">
Summary
</GenericItem>
<GenericItem link ="person.com/anna02">
Profile
</GenericItem>
</Person>
<Person>
<GenericItem html="Name:Kent<br/>ID: ID-003<br/>Position: Cleaner<a href="mailto: kent@person.com">kent@person.com</a><br/>Division: chicken-02">
Employee:
</GenericItem>
<GenericItem string="chicken and office cleaner">
Summary
</GenericItem>
<GenericItem link ="person.com/kent03">
Profile
</GenericItem>
</Person>
Now, the data is not all , as it will be too much. What I want to get is just the "Name","ID", and "Position". That means inside the GenericItem other than that 3 is not needed and needs to be removed and GenericItem with attribute "string" and "link" is useless and I want to delete it. I tried using Etree del method but I does not remove the both of them.
import xml.etree.ElementTree as ET
tree = ET.parse('NewestReport.xml')
root = tree.getroot()
for GenericItem in tree.findall('GenericItem'):
del(GenericItem.attrib['string'])
for neighbor in root.iter('GenericItem'):
print(neighbor.attrib)
Is there any other method that I can try to do?
You need to HTML-parse the attribute values.
Your best bet is switching from the built-in ElementTree to lxml, because that includes both an XML and an HTML parser, and proper XPath support.
Here I'm parsing your test input as XML, and each @html
attribute separately as HTML. After that, picking the text nodes that contain a ':'
seems to be a good first approximation. Of course you can dissect the HTML tree differently.
from lxml import etree as ET
html_parser = ET.HTMLParser()
tree = ET.parse('test.xml')
for person in tree.xpath('./Person'):
print('-' * 40)
for html in person.xpath('./GenericItem/@html')
data = ET.fromstring(html, html_parser)
for text in data.xpath('.//text()[contains(., ":")]'):
print(text.strip())
prints
----------------------------------------
Name:John
ID: ID-001
Position: Manager
Division: chicken-01
----------------------------------------
Name:Anna
ID: ID-002
Position: Fryer
Division: chicken-01
----------------------------------------
Name:Kent
ID: ID-003
Position: Cleaner
Division: chicken-02