I would like to use an xpath to get a list of list (or sequence of sequence) that groups extracted xml tags by parent element in order.
Here are my attempts so far using a minimal example..
import elementpath, lxml.etree
xml = '''<a>
<b c="1">
<d e="3"/>
<d e="4"/>
</b>
<b c="2">
<d e="5"/>
<d e="6"/>
</b>
</a>'''
tree = lxml.etree.fromstring(str.encode(xml))
xpath1 = '/a/b/d/@e'
xpath2 = 'for $b in (/a/b) return concat("[", $b/string-join(d/@e, ", "), "]")'
print('1:', elementpath.select(tree, xpath1))
print('2:', elementpath.select(tree, xpath2))
print('3:', [['3', '4'], ['5', '6']])
Which outputs..
1: ['3', '4', '5', '6']
2: ['[3, 4]', '[5, 6]']
3: [['3', '4'], ['5', '6']]
xpath1 returns a flattened list/sequence, with no grouping by parent element.
xpath2 is the closest I have come so far, but gives sub-arrays as string rather than array.
option 3 is what I am after
Anyone able to advise on a better way of doing this with just an xpath?
Thanks, Mark
ElementPath supports XPath 3.1 with XPath/XDM arrays so I think you want, in terms of XPath
/a!array { b ! array { d/@e/string() } }
which should give [["3","4"],["5","6"]]
.
That is the output with SaxonC HE (12.3) of
from saxonche import PySaxonProcessor
xml = '''<a>
<b c="1">
<d e="3"/>
<d e="4"/>
</b>
<b c="2">
<d e="5"/>
<d e="6"/>
</b>
</a>'''
with PySaxonProcessor(license=False) as saxon:
xdm_doc = saxon.parse_xml(xml_text=xml)
xpath_processor = saxon.new_xpath_processor()
xpath_processor.set_context(xdm_item=xdm_doc)
xdm_value = xpath_processor.evaluate_single('/a!array { b ! array { d/@e/string() } }')
print(xdm_value)
At that stage you don't have a Python list of lists, however, rather a PyXdmItem which is an XDM array of arrays, to get a nested Python list I think you can do
list_of_lists = [inner_array.head.as_list() for inner_array in xdm_array.as_list()]
print(list_of_lists)
I will need to check whether ElementPath allows that too and perhaps a bit more elegantly; the simplest I have found is
import elementpath, lxml.etree
from elementpath.xpath3 import XPath3Parser
xml = '''<a>
<b c="1">
<d e="3"/>
<d e="4"/>
</b>
<b c="2">
<d e="5"/>
<d e="6"/>
</b>
</a>'''
tree = lxml.etree.fromstring(str.encode(xml))
array_of_arrays = elementpath.select(tree, '/a!array { b ! array { d/@e/string() } }', parser=XPath3Parser)
print(array_of_arrays)
list_of_lists = [array.items() for array in array_of_arrays[0].items()]
print(list_of_lists)
giving [['3', '4'], ['5', '6']]
for the final pint(list_of_lists)
.
Or using a sequence of arrays in XPath gives you a list of arrays in Python which you can more easily convert into a list of lists in Python:
sequence_of_arrays = elementpath.select(tree, '/a/b ! array { d/@e/string() }', parser=XPath3Parser)
print(sequence_of_arrays)
list_of_lists = [array.items() for array in sequence_of_arrays]
print(list_of_lists)