pythonxmlxpathcelementtree

Parsing XML with cElementTree


I have been tasked with re-writing some old XML parsing code to Python and I stumbled into the joy that is cElementTree and I love it because I can do so much in so few lines.

My experience level with xpath is not that extensive and this question is more about drilling further down the structure.

I have this in test.xml

<?xml version="1.0"?>
   <ownershipDocument>
     <issue>
         <ic>0000030305</ic>
         <iname>DUCOMM</iname>
         <its>DCP</its>
     </issue>
     <ndt>
         <ndtran>
             <tc>
                 <tft>4</tft>
                 <tc>P</tc>
                 <esi>0</esi>
             </tc>
         </ndtran>
         <ndtran>
             <tc>
                 <tft>4</tft>
                 <tc>P</tc>
                 <esi>0</esi>
             </tc>
          </ndtran>
     </ndt>
 </ownershipDocument>

I wrote this script in Python :

import xml.etree.cElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
print root.tag
print root.attrib
for child in root:
    print(child.tag, child.attrib)

for issue in root.findall('issue'):
    ic = issue.find('ic').text
    iname= issue.find('iname').text
    print(ic,iname)

This gives me :

ownershipDocument
{}
('issue', {})
('ndt', {})
('0000030305', 'DUCOMM')

That successfully gets me the info I need in the "issue".

Problem is that I need to access multiple "ndtran" nodes ( in the "ndt" node ). While parsing I can extract the "tft", "tc" and "esi" values as groups but I need to iterate over each "tc" node, extract the "tft","tc","esi" values, insert them into a database and then move to the next "tc" node and do it again.

What I tried to use to iterate over each of these was this :

for tc in root.findall("./ndt/ndtran/tc"):
    tft = tc.find('tft').text
    tc = tc.find('tc').text
    esi = tc.find('esi').text
    print(tft,tc,esi)

This almost gets me there ( I think ) but it does give me an error.

esi = tc.find('esi').text
AttributeError: 'int' object has no attribute 'text'

I hope that makes sense. I believe what I am after is the DOM parsing methodology which is fine since these documents aren't that big.

I appreciate any advice or pointers in the right direction.


Solution

  • You were replacing value of tc attribute to be string in the previous line :

    for tc in root.findall("./ndt/ndtran/tc"):
        tft = tc.find('tft').text
        tc = tc.find('tc').text
       #^^ use different variable name here
        esi = tc.find('esi').text
             #^^ at this point, `tc` is no longer referencing the outer <tc> elements
    

    Interesting coincidence that string also has find() method which return int (-1) when the keyword is not found, hence the 'int' object has no attribute 'text' error.