pythonpython-3.xxmllxml

Find the index of a child in lxml


I am using Python 3.12 and lxml.

I want to find a particular tag, and I can do it with elem.find("tag"). elem is of type Element.

But I want to move child elements of this child into the parent where the child was. For that, I need the index of the child. ANd I can't find a way to find that index.

lxml's API description has the _Element.index() method, but I have no idea how to get an _Element instance from an Element instance.

Please advise how to determine that index. (Using a loop instead of find() can do that but I'd like a neater way).

EDIT: here is a sample XML element

<parent>
  <child-a/>
  <container>
     <child-b/>
     <child-c/>
  </container>
  <child-d/>
  <child-e/>
</parent>

I am writing code that finds , which is a child of but I don't know its position in advance (there can be several of them too), and moves its children into the parent where it was, then deletes , to get this:

<parent>
  <child-a/>
  <child-b/>
  <child-c/>
  <child-d/>
  <child-e/>
</parent>

So, I can find <container> using parent.find(). But to move its children into the same place under <parent> I need to have the index of <container>, as the insert() method requires an index. For now I use this kludge:

    while True:
        index = None
        found = None
        for i in range(len(parent)):
            if parent[i].tag =="container":
                found = parent[i]
                index = i
                break
        if found is None:
            break

        offset = 0
        while len(found) > 0:
            parent.insert(index+offset,found[0])
            offset+=1
        parent.remove(found)

I do know that offset is redundant as one could just increase index, I did that for aesthetic reasons. But the loop itself is quite the kludge. Here is what I would do if Element had an index() method, but it doesn't:

    found = parent.find("container")
    while found:
        index = parent.index(found)
        offset = 0
        while len(found) > 0:
            parent.insert(index+offset,found[0])
            offset+=1
        parent.remove(found)
        found = parent.find("container")

But Element.index() does not exist; _Element.index() exists but I don't know how to access _Element.


Solution

  • You can use container.getchildren() or list(container) to get all children and use addnext() to put them (one by one) after container, and later you can remove (aready) empty container. It needs reversed() to put children in correct order.

    parent = container.getparent()
    
    #for child in reversed(container.getchildren()):
    for child in reversed(container):
        container.addnext(child)
    
    parent.remove(container)    
    

    Full working example which I was using for tests (with some extra comments):

    html = '''
    <parent>
      <child-a/>
      <container><child-b/><child-c/></container>
      <child-d/>
      <container><child-e/><child-f/></container>
      <child-g/>
    </parent>
    '''
    
    import lxml.html
    
    tree = lxml.html.fromstring(html)
    
    for container in tree.findall('container'):
        parent = container.getparent()
        #for child in reversed(container.getchildren()):  # getchildren() - deprecated
        #for child in reversed(container):
        for child in container.iterchildren(reversed=True):
            #child.tail = None   # clean indentations  # elements in one line
            #child.tail = "\n"   # clean indentations  # next tag starts in first column
            #child.tail = container.tail   # clean indentations 
            container.addnext(child)
        parent.remove(container)    
    
    # https://lxml.de/apidoc/lxml.etree.html#lxml.etree.indent
    import lxml.etree as ET
    #ET.indent(tree, space='    ')  # clean all indentations - use 4 spaces
    #ET.indent(tree, space='....')  # clean all indentations - use 4 dots - looks like TOC (Table Of Contents) in book :)
    ET.indent(tree)   # clean all indentations - use (default) 2 spaces
    
    #html = lxml.html.tostring(tree, pretty_print=True).decode() 
    #html = lxml.html.tostring(tree).decode() 
    html = lxml.html.tostring(tree, encoding='unicode')  # not `utf-8` but `unicode` ???
    
    print(html)
    

    Result without child.tail = container.tail and without ET.indent(tree):

    <parent>
      <child-a></child-a>
      <child-b></child-b><child-c></child-c><child-d></child-d>
      <child-e></child-e><child-f></child-f><child-g></child-g>
    </parent>
    

    Result with child.tail = container.tail or with ET.indent(tree):

    <parent>
      <child-a></child-a>
      <child-b></child-b>
      <child-c></child-c>
      <child-d></child-d>
      <child-e></child-e>
      <child-f></child-f>
      <child-g></child-g>
    </parent>
    

    Doc: addnext(), addprevious(), lxml.etree.indent(), getparen(), getchildren(), iterchildren(reversed=True)