I have been tasked with re-writing some old XML parsing code to Python and I stumbled into the joy that is cElementTree
and I love it because I can do so much in so few lines.
My experience level with xpath
is not that extensive and this question is more about drilling further down the structure.
I have this in test.xml
<?xml version="1.0"?>
<ownershipDocument>
<issue>
<ic>0000030305</ic>
<iname>DUCOMM</iname>
<its>DCP</its>
</issue>
<ndt>
<ndtran>
<tc>
<tft>4</tft>
<tc>P</tc>
<esi>0</esi>
</tc>
</ndtran>
<ndtran>
<tc>
<tft>4</tft>
<tc>P</tc>
<esi>0</esi>
</tc>
</ndtran>
</ndt>
</ownershipDocument>
I wrote this script in Python :
import xml.etree.cElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
print root.tag
print root.attrib
for child in root:
print(child.tag, child.attrib)
for issue in root.findall('issue'):
ic = issue.find('ic').text
iname= issue.find('iname').text
print(ic,iname)
This gives me :
ownershipDocument
{}
('issue', {})
('ndt', {})
('0000030305', 'DUCOMM')
That successfully gets me the info I need in the "issue".
Problem is that I need to access multiple "ndtran" nodes ( in the "ndt" node ). While parsing I can extract the "tft", "tc" and "esi" values as groups but I need to iterate over each "tc" node, extract the "tft","tc","esi" values, insert them into a database and then move to the next "tc" node and do it again.
What I tried to use to iterate over each of these was this :
for tc in root.findall("./ndt/ndtran/tc"):
tft = tc.find('tft').text
tc = tc.find('tc').text
esi = tc.find('esi').text
print(tft,tc,esi)
This almost gets me there ( I think ) but it does give me an error.
esi = tc.find('esi').text
AttributeError: 'int' object has no attribute 'text'
I hope that makes sense. I believe what I am after is the DOM parsing methodology which is fine since these documents aren't that big.
I appreciate any advice or pointers in the right direction.
You were replacing value of tc
attribute to be string
in the previous line :
for tc in root.findall("./ndt/ndtran/tc"):
tft = tc.find('tft').text
tc = tc.find('tc').text
#^^ use different variable name here
esi = tc.find('esi').text
#^^ at this point, `tc` is no longer referencing the outer <tc> elements
Interesting coincidence that string
also has find()
method which return int
(-1
) when the keyword is not found, hence the 'int' object has no attribute 'text' error.