pythonxpathxpath-2.0

Extracting a list of list with xpath


I would like to use an xpath to get a list of list (or sequence of sequence) that groups extracted xml tags by parent element in order.

Here are my attempts so far using a minimal example..

import elementpath, lxml.etree
xml = '''<a>
<b c="1">
  <d e="3"/>
  <d e="4"/>
</b>
<b c="2">
  <d e="5"/>
  <d e="6"/>
</b>
</a>'''
tree = lxml.etree.fromstring(str.encode(xml))
xpath1 = '/a/b/d/@e'
xpath2 = 'for $b in (/a/b) return concat("[", $b/string-join(d/@e, ", "), "]")'
print('1:', elementpath.select(tree, xpath1))
print('2:', elementpath.select(tree, xpath2))
print('3:', [['3', '4'], ['5', '6']])

Which outputs..

1: ['3', '4', '5', '6']
2: ['[3, 4]', '[5, 6]']
3: [['3', '4'], ['5', '6']]

xpath1 returns a flattened list/sequence, with no grouping by parent element.

xpath2 is the closest I have come so far, but gives sub-arrays as string rather than array.

option 3 is what I am after

Anyone able to advise on a better way of doing this with just an xpath?

Thanks, Mark


Solution

  • ElementPath supports XPath 3.1 with XPath/XDM arrays so I think you want, in terms of XPath

    /a!array { b ! array { d/@e/string() } }
    

    which should give [["3","4"],["5","6"]].

    That is the output with SaxonC HE (12.3) of

    from saxonche import PySaxonProcessor
    
    xml = '''<a>
    <b c="1">
      <d e="3"/>
      <d e="4"/>
    </b>
    <b c="2">
      <d e="5"/>
      <d e="6"/>
    </b>
    </a>'''
    
    with PySaxonProcessor(license=False) as saxon:
        xdm_doc = saxon.parse_xml(xml_text=xml)
        xpath_processor = saxon.new_xpath_processor()
        xpath_processor.set_context(xdm_item=xdm_doc)
        xdm_value = xpath_processor.evaluate_single('/a!array { b ! array { d/@e/string() } }')
        print(xdm_value)
    

    At that stage you don't have a Python list of lists, however, rather a PyXdmItem which is an XDM array of arrays, to get a nested Python list I think you can do

        list_of_lists = [inner_array.head.as_list() for inner_array in xdm_array.as_list()]
        print(list_of_lists)
    

    I will need to check whether ElementPath allows that too and perhaps a bit more elegantly; the simplest I have found is

    import elementpath, lxml.etree
    from elementpath.xpath3 import XPath3Parser
    
    xml = '''<a>
    <b c="1">
      <d e="3"/>
      <d e="4"/>
    </b>
    <b c="2">
      <d e="5"/>
      <d e="6"/>
    </b>
    </a>'''
    
    tree = lxml.etree.fromstring(str.encode(xml))
    
    array_of_arrays = elementpath.select(tree, '/a!array { b ! array { d/@e/string() } }', parser=XPath3Parser)
    
    print(array_of_arrays)
    
    list_of_lists = [array.items() for array in array_of_arrays[0].items()]
    
    print(list_of_lists)
    

    giving [['3', '4'], ['5', '6']] for the final pint(list_of_lists).

    Or using a sequence of arrays in XPath gives you a list of arrays in Python which you can more easily convert into a list of lists in Python:

    sequence_of_arrays = elementpath.select(tree, '/a/b ! array { d/@e/string() }', parser=XPath3Parser)
    
    print(sequence_of_arrays)
    
    list_of_lists = [array.items() for array in sequence_of_arrays]
    
    print(list_of_lists)