treetraversal

Select elements between two elements in HTML, at different levels


Alternative headline: traverse all nodes in a tree between two nodes.

I have a DOM tree (an ordered tree, where child elements at each level have a specific order). I am given start and stop nodes in the tree. The start is guaranteed to be before the stop in a depth-first traversal of the tree. I want to find/traverse all the nodes between these two efficiently. This means that if I can determine that a node between the two does NOT include the stop node, I do not want to recurse through all its descendants.

Is this a well-known (named?) problem, that has algorithms for the traversal, that perhaps computes shared ancestors or such to determine whether traverse a node's descendants?

Example 0 (same level):

<body>
  <p>ignore this</p>
  <div id="start">start here</div>
  <p id="A">visit A</p>
  <p id="B">visit B</p>
  <p id="C">visit C <span>but <b>not</b> <i>all this</i></span></p>
  <div id="stop">stop here</div>
  <p>ignore this</p>
</body>

In Example 0, I want to visit the following, in order:

Example 1 (stop level is deeper):

<body>
  <p>ignore this</p>
  <div><h2><span id="start">start here</span> A1</h2> A2</div>
  <section id="B"><many><more><levels><of><content>…</many></section>
  <section id="C">
     <div id="D"><many><more><levels><of><content>…</many></div>
     <div id="E">
       <div id="F"><many><more><levels><of><content>…</many></div>
       <div id="stop">stop here</div>
       <p>ignore this</p>
     </div>
  </section>
  <p>ignore this</p>
</body>

In Example 1, I want to visit the following, in order:

Example 2 (stop is higher):

<body>
  <p>ignore this</p>
  <div>
    <h2>
      <span id="start">start here</span>
      <div id="A">visit A</div>
      <section id="B"><many><more><levels><of><content>…</many></section>
    </h2>
    <section id="C"><many><more><levels><of><content>…</many></section>
  </div>
  <section id="D"><many><more><levels><of><content>…</many></section>
  <div id="stop">stop here</div>
  <p>ignore this</p>
</body>

In Example 2, I want to visit the following, in order:


Solution

  • Here's an algorithm that seems to accomplish the goal, in pseudocode:

    results   = []
    ancestors = get_ancestors_of(stop_node)
    current   = start_node
    
    while current and current != stop_node:
        if current in ancestors:
            # if this node contains the stop node, dive in
            current = current.children[0]
        else:
            if current != start_node:
                results.add(current)
    
            if current.next_sibling:
                current = current.next_sibling
    
            else:
                # pop up until you find the next "aunt" node
                while current and !current.next_sibling:
                    # assumes the root node returns null/false for its parent
                    current = current.parent
    
                if current:
                    current = current.next_sibling
    
    return results
    

    Here it is in Ruby, as a method that takes a block and yields nodes as it finds them:

    # Traverse nodes between two in the tree, yielding each to the block
    # but without traversing the descendants that are wholly between the two.
    def traverse_between(start_node, end_node)
        # Get all ancestors of the end_node
        end_node_ancestors = end_node ? end_node.ancestors.to_set : Set.new
    
        current = start_node
    
        while current && current != end_node
            # Traverse children if the end node is in there somewhere.
            # We don't yield ancestors of the end_node, since
            # they implicitly contain the end_node and possibly more.
            if end_node_ancestors.include?(current)
                current = current.element_children.first
            else
                yield current unless current == start_node
    
                if current.next_sibling
                    current = current.next_sibling
                else
                    # Move up to the next ancestor with a sibling; stop if we reach the root
                    while current && !current.next_sibling
                        current = current.parent
                        return if current.is_a?(Nokogiri::XML::Document)
                    end
                    current = current.next_sibling if current
                end
            end
        end
    end