perlxml-libxml

How to get the hierarchy structure of an element using XML::LibXML in perl


Like in this HTML snippet: <div class="c1"><span class="c2"><b class="c3"/></span></div> The expected hierarchy structure of b element should be: div.c1 span.c2 b.c3


Solution

  • There is the parentNode method in XML::LibXML::Node, which simply returns the parent node. So you can locate your node of interest (b) and then 'drill' upwards to the top of the tree, collecting suitable information about nodes. For the desired element.class format:

    use warnings;
    use strict;
    use feature 'say';
    
    use XML::LibXML;
    
    my $xml = q(<div class="c1"><span class="c2"><b class="c3"/></span></div>);
    
    my $doc = XML::LibXML->load_xml(string => $xml);
    
    my @hier;
    
    my ($node) = $doc->findnodes('//b');  # only first such node assigned
    
    unshift @hier, join '.', $node->nodeName, $node->getAttribute('class');
    
    while (my $parent = $node->parentNode) {
        last if $parent->nodeType == XML_DOCUMENT_NODE;  # top, <?xml ...    
    
        unshift @hier, join '.', $parent->nodeName, $parent->getAttribute('class');
        $node = $parent;
    }
    
    say for @hier;
    

    The getAttribute method is in the XML::LibXML::Element class.