python-2.7eventslxmliterparse

python lxml iterparse() is skipping first event


I am using iterparse() from python lxml to parse through a large XML file and get relevant data. This works perfectly fine, except for the first time an event occurs. The data for the first node is not captured. The same thing happens for when I want to get the tag "way" (not in this code snippet). Why does the first event element not get captured?

tree = etree.iterparse(state_file_xml, events=("start", "end"),tag=('node'))

context = iter(tree)

event, root = context.next()

nodes = {}
for event, elem in context:

    if ((event == 'end') and (elem.tag == 'node')) :
        id = elem.get("id")
        lat = float(elem.get("lat"))
        lon = float(elem.get("lon"))
        nodes[id] = [lat,lon]

my xml file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="Overpass API 0.7.55.4 3079d8ea">
<note>The data included in this document is from www.openstreetmap.org. The data is made available under ODbL.</note>
<meta osm_base="2018-11-09T21:23:02Z"/>

  <way id="46916568">
      <nd ref="286427634"/>
      <nd ref="3371562694"/>
      <nd ref="3371562693"/>
      <nd ref="1044837456"/>
      <nd ref="1299487829"/>
      <nd ref="1299487860"/>
      <nd ref="284132018"/>
      <tag k="highway" v="secondary"/>
      <tag k="lit" v="yes"/>
      <tag k="maxspeed" v="50"/>
      <tag k="name" v="Zürcherstrasse"/>
      <tag k="surface" v="asphalt"/>
  </way>

  <node id="30228243" lat="47.4030908" lon="8.4049015"/>
  <node id="283533527" lat="47.4016971" lon="8.4036696"/>
  <node id="284132018" lat="47.4034413" lon="8.4042634"/>
  <node id="286427571" lat="47.4037481" lon="8.4058661"/>
  <node id="286427634" lat="47.4043045" lon="8.4032429"/>
  <node id="318217124" lat="47.4044289" lon="8.4054211"/>
  <node id="428076175" lat="47.4027948" lon="8.4045078"/>
  <node id="460527594" lat="47.4027445" lon="8.4055605"/>
  <node id="460527973" lat="47.4029993" lon="8.4040697"/>
  <node id="984783907" lat="47.4027808" lon="8.4054934"/>

Solution

  • context.next() consumes the first node:

    In [14]: tree = etree.iterparse(state_file_xml, events=("start", "end"),tag=('node'))
    
    In [15]: context = iter(tree)
    
    In [16]: event, root = next(context)
    
    In [17]: root.attrib
    Out[17]: {'id': '30228243', 'lon': '8.4049015', 'lat': '47.4030908'}
    

    (I changed context.next() to next(context) to allow the code to work with both Python2 and Python3.)


    By the way, iterparse returns an iterator, so context = iter(tree) is unnecessary. And since you only need to processes each node once, events=("end",) suffices:

    import lxml.etree as ET
    
    context = ET.iterparse(state_file_xml, events=("end",), tag=('node'))
    nodes = {}
    
    for event, elem in context:
    
        id = elem.get("id")
        lat = float(elem.get("lat"))
        lon = float(elem.get("lon"))
        nodes[id] = [lat,lon]