phphtmlregexhrefdomdocument

Get href values from all <a> tags, including nested <a> tags


I've been searching for hours (there shouldn't be any duplicate) and attempting many different ways, using both RegEx (regular expressions) and DOMdocument, without success.

Non-Standard HTML Code:

<a class="SOMECLASS" href="javascript:__FUNCTION(SOME_HREF_INSIDE)" onclick="SOME_JS_FUNCTION();" id="SOME_ID" style="SOME_STYLE">
    <a href="SOME_URL_3">SOME TEXT</a>
</a>

Now the problem is I'm trying to get the url "SOME_URL_3" and both when parsing using RegEx or DOMdocument, the pasing stops as soon as it encounters the first href. Of course as the second "a" tag is part of the first one, the parser only sees it as one.

I observed that browsers seem to automatically separate the tags when parsing as follow.

Before:

<a href="SOME_URL">
    <a href="SOME_URL_2">
    </a>
</a>

After:

<a href="SOME_URL">
</a>
<a href="SOME_URL_2">
</a>

I've not been able to replicate this browsers behavior using php.

Previous Attempt:

$dom = new DOMDocument();
@$dom->loadHTML($result);

foreach($dom->getElementsByTagName('a') as $link) { 
    $href_count = 0;
    $attrs = array();

    for ($i = 0; $i < $link->attributes->length; ++$i) {
        $node = $link->attributes->item($i);
        if ($node->nodeName == "href") {
            $attrs[$node->nodeName][$href_count] = $node->nodeValue;
            $href_count++;
            if ($href_count >= 2) {
                echo "A second href has been found";
            }
        }
    }

    echo "<pre>";
    var_dump($attrs);
    echo "</pre>";
}

As you may expect, it unfortunately doesn't work, otherwise I wouldn't be here asking for help...

Please do not hesitate to share your knowledge, any help or suggestion will be greatly appreciated!


Update:

I had forgotten to specify in my initial question that the answer should still allow to capture href from standard/non-nested "a" tags. My goal is to extend/improve my existing HTML parser to ensure that I'm also retrieving the urls from any href attribute. My initial code was only using RegEx and I wasn't able to capture a additional href from within a nested "a" tags. The solution I'm looking for would allow to capture href both from nested and standard/non-nested "a" tags. Brandon White's solution is great for nested href only. However, it would be resource consuming to use two different RegEx (nested/non-nested) to parse the entire HTML content twice. An ideal solution would be a RegEx allowing to capture both at the same time, if this is possible.


Solution

  • The following solution extracts all <a> tag href values, whether nested or not.

    UPDATE 2024-03-16:

    As highlighted by @mickmackusa, it appears that DOMDocument now automatically parses nested <a> tags as sibling tags, which makes the previous solution irrelevant.

    It is now possible to extract all nested and standard/non-nested <a> tags, simply by iterating over the result of the getElementsByTagName('a') function, as shown below.

    Updated Solution: Demo

    $result = <<<HTML
    <a href="SOME_URL">
        <a href="SOME_URL_2">
            <a href="SOME_URL_DEEP">
            </a>
        </a>
    </a>
    
    <a href="SOME_URL3">
        <a href="SOME_URL_4">
        </a>
    </a>
    
    <a href="SOME_URL_5">
    </a>
    <a href="SOME_URL_6">
    </a>
    
    HTML;
    
    $dom = new DOMDocument();
    @$dom->loadHTML($result);
    foreach ($dom->getElementsByTagName('a') as $link) {
        $output[] = $link->getAttribute('href');
    }
    
    echo "<pre>";
    print_r($output);
    echo "</pre>";
    

    Output:

    <pre>Array
    (
        [0] => SOME_URL
        [1] => SOME_URL_2
        [2] => SOME_URL_DEEP
        [3] => SOME_URL3
        [4] => SOME_URL_4
        [5] => SOME_URL_5
        [6] => SOME_URL_6
    )
    </pre>
    

    For those looking to discard any href from the <a> tags containing nested <a> tags, you may refer to @mickmackusa's answer below.


    ORIGINAL ANSWER:

    While the original goal was to retrieve all <a> tags, whether nested or not, the solution below ignored any href from the <a> tags containing nested <a> tags. I actually ended up using a slightly different version, which was retrieving all <a> tags.

    However, since DOMDocument now automatically parses nested <a> tags as sibling tags, the condition present in the foreach loop is never met and the result always contains all <a> tags.

    Original Solution: Demo

    $result = <<<HTML
    <a href="SOME_URL">
        <a href="SOME_URL_2">
        </a>
    </a>
    
    <a href="SOME_URL3">
        <a href="SOME_URL_4">
        </a>
    </a>
    
    <a href="SOME_URL_5">
    </a>
    <a href="SOME_URL_6">
    </a>
    
    HTML;
    
    $dom = new DOMDocument();
    @$dom->loadHTML($result);
    foreach ($dom->getElementsByTagName('a') as $link) {
        $tag_html = $dom->saveHTML($link); // Get tag inner html
        
        if (substr_count($tag_html, "href") > 1) { // If tag contains more than one href attribute
            preg_match_all('/href="([^"]*)"/is', $tag_html, $link_output, PREG_SET_ORDER);
            $output[] = $link_output[1][1]; // Output second href
        } else { //Not nested tag
            $output[] = $link->getAttribute('href'); // Output first href
        }
    }
    
    echo "<pre>";
    print_r($output);
    echo "</pre>";
    

    Original Output:

    <pre>Array
    (
        [0] => SOME_URL_2
        [1] => SOME_URL_4
        [2] => SOME_URL_5
        [3] => SOME_URL_6
    )
    </pre>
    

    Updated Output:

    <pre>Array
    (
        [0] => SOME_URL
        [1] => SOME_URL_2
        [2] => SOME_URL3
        [3] => SOME_URL_4
        [4] => SOME_URL_5
        [5] => SOME_URL_6
    )
    </pre>