xmlxml-parsingelementtreexml.etree

Is there a way that I can just get specified data that I want from similar tags inside an XML file?


I have this XML file that contains a lot of data. It is in a very bad format which have multiple value inside one attribute.

<Person> 
    <GenericItem html="Name:John&lt;br/&gt;ID: ID-001&lt;br/&gt;Position: Manager&lt;a href=&quot;mailto: john@person.com&quot;&gt;john@person.com&lt;/a&gt;&lt;br/&gt;Division: chicken-01">
Employee:
   </GenericItem>
    <GenericItem string="Hardworking and leader of the chicken division">
Summary
    </GenericItem>
    <GenericItem link ="person.com/john01">
Profile
    </GenericItem>
 </Person>
<Person> 
    <GenericItem html="Name:Anna&lt;br/&gt;ID: ID-002&lt;br/&gt;Position: Fryer&lt;a href=&quot;mailto: anna@person.com&quot;&gt;anna@person.com&lt;/a&gt;&lt;br/&gt;Division: chicken-01">
Employee:
   </GenericItem>
    <GenericItem string="Chicken fryer of the month">
Summary
    </GenericItem>
    <GenericItem link ="person.com/anna02">
Profile
    </GenericItem>
 </Person>
<Person> 
    <GenericItem html="Name:Kent&lt;br/&gt;ID: ID-003&lt;br/&gt;Position: Cleaner&lt;a href=&quot;mailto: kent@person.com&quot;&gt;kent@person.com&lt;/a&gt;&lt;br/&gt;Division: chicken-02">
Employee:
   </GenericItem>
    <GenericItem string="chicken and office cleaner">
Summary
    </GenericItem>
    <GenericItem link ="person.com/kent03">
Profile
    </GenericItem>
 </Person>

Now, the data is not all , as it will be too much. What I want to get is just the "Name","ID", and "Position". That means inside the GenericItem other than that 3 is not needed and needs to be removed and GenericItem with attribute "string" and "link" is useless and I want to delete it. I tried using Etree del method but I does not remove the both of them.

import xml.etree.ElementTree as ET
tree = ET.parse('NewestReport.xml')
root = tree.getroot()
for GenericItem in tree.findall('GenericItem'):
    del(GenericItem.attrib['string'])
for neighbor in root.iter('GenericItem'):
    print(neighbor.attrib)

Is there any other method that I can try to do?


Solution

  • You need to HTML-parse the attribute values.

    Your best bet is switching from the built-in ElementTree to lxml, because that includes both an XML and an HTML parser, and proper XPath support.

    Here I'm parsing your test input as XML, and each @html attribute separately as HTML. After that, picking the text nodes that contain a ':' seems to be a good first approximation. Of course you can dissect the HTML tree differently.

    from lxml import etree as ET
    
    html_parser = ET.HTMLParser()
    
    tree = ET.parse('test.xml')
    
    for person in tree.xpath('./Person'):
        print('-' * 40)
        for html in person.xpath('./GenericItem/@html')
            data = ET.fromstring(html, html_parser)
            for text in data.xpath('.//text()[contains(., ":")]'):
                print(text.strip())
    

    prints

    ----------------------------------------
    Name:John
    ID: ID-001
    Position: Manager
    Division: chicken-01
    ----------------------------------------
    Name:Anna
    ID: ID-002
    Position: Fryer
    Division: chicken-01
    ----------------------------------------
    Name:Kent
    ID: ID-003
    Position: Cleaner
    Division: chicken-02