pythonlxmliterparse

lxml iterparse in python can't handle namespaces


from lxml import etree
import StringIO

data= StringIO.StringIO('<root xmlns="http://some.random.schema"><a>One</a><a>Two</a><a>Three</a></root>')
docs = etree.iterparse(data,tag='a')
a,b = docs.next()


Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "iterparse.pxi", line 478, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:95348)
  File "iterparse.pxi", line 534, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:95938)
StopIteration

Works fine untill I add the namespace to the root node. Any ideas as to what I can do as a work around, or the correct way of doing this? I need to be event driven due to very large files.


Solution

  • When there is a namespace attached, the tag isn't a, it's {http://some.random.schema}a. Try this (Python 3):

    from lxml import etree
    from io import BytesIO
    
    xml = '''\
    <root xmlns="http://some.random.schema">
      <a>One</a>
      <a>Two</a>
      <a>Three</a>
    </root>'''
    data = BytesIO(xml.encode())
    docs = etree.iterparse(data, tag='{http://some.random.schema}a')
    for event, elem in docs:
        print(f'{event}: {elem}')
    

    or, in Python 2:

    from lxml import etree
    from StringIO import StringIO
    
    xml = '''\
    <root xmlns="http://some.random.schema">
      <a>One</a>
      <a>Two</a>
      <a>Three</a>
    </root>'''
    data = StringIO(xml)
    docs = etree.iterparse(data, tag='{http://some.random.schema}a')
    for event, elem in docs:
        print event, elem
    

    This prints something like:

    end: <Element {http://some.random.schema}a at 0x10941e730>
    end: <Element {http://some.random.schema}a at 0x10941e8c0>
    end: <Element {http://some.random.schema}a at 0x10941e960>
    

    As @mihail-shcheglov pointed out, a wildcard * can also be used, which works for any or no namespace:

    from lxml import etree
    from io import BytesIO
    
    xml = '''\
    <root xmlns="http://some.random.schema">
      <a>One</a>
      <a>Two</a>
      <a>Three</a>
    </root>'''
    data = BytesIO(xml.encode())
    docs = etree.iterparse(data, tag='{*}a')
    for event, elem in docs:
        print(f'{event}: {elem}')
    

    See lxml.etree docs for more.