I have an xml file from which I want to count a number of tags with the name 'neighbor'. To be more specific, I want to count only the neighbor-tags, that are direct children of any of the country-tags.
Here are the contents of my xml file:
<?xml version="1.0"?>
<data>
<country name="Austria">
<rank>1</rank>
<year>2008</year>
<neighbor name="Liechtenstein"/>
<neighbor name="Switzerland"/>
<neighbor name="Italy"/>
</country>
<country name="Iceland">
<hasnoneighbors/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<neighbor name="Malaysia"/>
<someothertag>
<neighbor name="Germany"/>
</someothertag>
</country>
<neighbor name="Jupiter"/>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<neighbor name="Costa Rica"/>
<neighbor name="Colombia"/>
<country name="SubCountry">
<rank>12</rank>
<year>2023</year>
<neighbor name="NeighborOfSubCountry"/>
</country>
</country>
</data>
The expected result should be 7. Germany and Jupiter should be left out of the total of 9 tags.
I've written the following piece of code:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
totalneighbors = 0
neighborlist = []
for country in root.iter('country'):
print(f'Country {country.attrib["name"]} contains these neighbors:')
for index, neighbor in enumerate(country.findall('neighbor')):
neighborname = neighbor.attrib['name']
print(f'neighbor no {index+1}, with name {neighbor.attrib["name"]}')
neighborlist.append(neighbor.attrib['name'])
print(f"total for this country is {index+1}\n")
totalneighbors += index+1
print(f'total nr of neighbors in country-nodes is {totalneighbors} according to index-counting')
print(f"but the neighborlist says it's {len(neighborlist)}")
I wanted to count the tags with the enumerate-functionality from python, but it's giving me the wrong result (10 instead of 7). I put another way of counting in the code, by adding the 'findall' results to a list, and then using the length of that list. This does give me the correct number.
After adding some print statements in the code, I figured out where things go wrong; Iceland has no neighbors, but the print-statement shows that the index is still 3. It looks as if the index from the previous loop was never reset, and it just uses that 3 again, even though 'findall' should find nothing.
So my question is: What am I doing wrong? Why does 'enumerate' not give me 0 when 'findall' finds nothing? Am I using it wrong? Or is it just not possible when combined with an empty search result?
I hope someone can clarify what's going wrong here.
The problem lies in Iceland not having a neighbor, as you said. The first country has three neighbors, so the index
will have the value of 2 after running the first for
loop. But the loop won't execute for Iceland, because findall returns an empty list. so the index
value would still have the value of the previous country.
You can set the index
to -1
before the for
loop. That way your code works fine. Because nothing will be added to the totalneighbors
if the country has no neighbor.
# ...
print(f'Country {country.attrib["name"]} contains these neighbors:')
index = -1
for index, neighbor in enumerate(country.findall('neighbor')):
# remiander of the code
But overall, I recommend using the lxml
package and XPath.
here you can find the docs: https://lxml.de/parsing.html
for your purpose using XPath is the best option. You can find more information here: https://www.w3schools.com/xml/xpath_intro.asp
the code using lxml
would look like something like this:
from lxml import etree
root = etree.parse("/path/to/file.xml")
neighbors = root.findall(".//country/neighbor") # this xpath finds all the neighbors exactly after country
hope this helps.