pythonxmllxml

Get xpath of all nodes in XML tree with attributes - Python


Suppose I have the following test.xml:

<?xml version="1.0" encoding="UTF-8"?>
<test:myXML xmlns:test="http://com/my/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Parent>
  <FirstNode name="FirstNodeName"></FirstNode>
    <Child1>Test from Child1</Child1>
  <SecondNode name="SecondNodeName" type="SecondNodeType">
    <Child2>
      <GrandChild>Test from GrandChild</GrandChild>
    </Child2>
  </SecondNode>
</Parent>
</test:myXML>

I'd like to iterate over the whole tree, and get the path of each node, including the attributes. I am able to iterate over the tree and retrieve the path to each node as follows:

from lxml import etree

xmlDoc = etree.parse("test.xml")
root = xmlDoc.getroot()

for node in xmlDoc.iter():
    print("path: ", xmlDoc.getpath(node))

As expected, this prints out:

path:  /test:myXML
path:  /test:myXML/Parent
path:  /test:myXML/Parent/FirstNode
path:  /test:myXML/Parent/Child1
path:  /test:myXML/Parent/SecondNode
path:  /test:myXML/Parent/SecondNode/Child2
path:  /test:myXML/Parent/SecondNode/Child2/GrandChild

However, as I mentioned, I'd like to somehow print the attributes of said node, and its parents, along with its path. For example, if I want to print the element "Child2", then I'd like for the attributes of each of its parent elements to be displayed as well. Something like:

path:  /test:myXML/Parent/SecondNode{name="SecondNodeName" type="SecondNodeType"}/Child2

Is this possible? I'm not too fussed about the namespaces of the root element if that makes it easier.


Solution

  • I don't know of any prepackaged method to do that, but with all the enforced "working from home" going on, I figured I might as well try to come up with something. It's inelegant, but seems to do the job...

    Try this on your actual code and see if it works:

    att = """
    <test:myXML xmlns:test="http://com/my/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Parent>
      <FirstNode name="FirstNodeName"></FirstNode>
        <Child1>Test from Child1</Child1>
      <SecondNode name="SecondNodeName" type="SecondNodeType">
        <Child2>
          <GrandChild>Test from GrandChild</GrandChild>
        </Child2>
      </SecondNode>
    </Parent>
    </test:myXML>
    """
    
    from lxml import etree
    
    bef = []
    xps = []
    
    xmlDoc = etree.fromstring(att)
    root = etree.ElementTree(xmlDoc)
    
    for node in xmlDoc.iter():        
        ats = "{"
        for a in range(len(node.keys())):
            mystr = node.keys()[a]+'="'+node.values()[a]+'" '
            ats +=mystr
        ats+='}'
        xp = root.getpath(node)    
        bef.append(xp)
        ent = ''
        if len(ats)>2:
            ent+=xp
            ent+=ats.replace(' }','}')        
        else:
            ent+=xp
        xps.append(ent)
    
    for b,  f in zip(bef,xps):
        prev = bef.index(b)-1
        if prev >=0:
            cur = b.rsplit("/",1)[0]
            new_cur = f.rsplit("/",1)[1]
            if bef[prev]==cur:
                new_f = xps[prev]+'/'+new_cur
                xps[prev+1]=new_f
                print(new_f)
            else:
                print(f)  
    

    Output:

    /test:myXML/Parent
    /test:myXML/Parent/FirstNode{name="FirstNodeName"}
    /test:myXML/Parent/Child1
    /test:myXML/Parent/SecondNode{name="SecondNodeName" type="SecondNodeType"}
    /test:myXML/Parent/SecondNode{name="SecondNodeName" type="SecondNodeType"}/Child2
    /test:myXML/Parent/SecondNode{name="SecondNodeName" type="SecondNodeType"}/Child2/GrandChild
    

    If it works and you're interested, I can try to explain what all this does...