I would like to read quite big XML as a stream. But could not find any way to use my old XPathes to find elements. Previously files were of moderate size, so in was enough to:
all_elements = []
for xpath in list_of_xpathes:
all_elements.append(etree.parse(file).getroot().findall(xpath))
Now I am struggling with iterparse. Ideally the solution would be to compare path of current element with desired xpath:
import lxml.etree as et
xml_file = r"my.xml" # quite big xml, that i should read
xml_paths = ['/some/arbitrary/xpath', '/another/xpath']
all_elements = []
iter = et.iterparse(xml_file, events = ('end',))
for event, element in iter:
for xpath in xml_paths:
if element_complies_with_xpath(element, xpath):
all_elements.append(element)
break
How is it possible to implement element_complies_with_xpath function using lxml?
If first part of the xpath can be extracted then the rest could be tested as follows. Instead of a list of strings, a dict of <first element name>: <rest of the xpath>
could be used. Parent element could be used as dict key also.
Full xpath: /some/arbitrary/xpath
dict : {'some': './arbitrary/xpath'}
import lxml.etree as et
def element_complies_with_xpath(element, xpath):
children = element.xpath(xpath)
print([ "child:" + x.tag for x in children])
return len(children) > 0
xml_file = r"/home/lmc/tmp/test.xml" # quite big xml, that i should read
xml_paths = [{'membership': './users/user'}, {'entry':'author/name'}]
all_elements = []
iter1 = et.iterparse(xml_file, events = ('end',))
for event, element in iter1:
for d in xml_paths:
if element.tag in d and element_complies_with_xpath(element, d[element.tag]):
all_elements.append(element)
break
print([x.tag for x in all_elements])
count()
xpath function could be used also
def element_complies_with_xpath(element, xpath):
children = element.xpath(xpath)
print( f"child exist: {children}")
return children
xml_file = r"/home/luis/tmp/test.xml" # quite big xml, that i should read
xml_paths = [{'membership': 'count(./users/user) > 0'}, {'entry':'count(author/name) > 0'}]