phpregexweb-scrapingcanonical-link

A regexp to retrieve either og:url meta or link rel="canonical"


i'm trying to write a script to scrape canonical URL from a remote URL. I'm not a professional developper, so if something is ugly in my code, any explanation would (and will) be appreciated.

What I'm trying to do is either look for:

<meta property="og:url" content="http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html" />
<link rel="canonical" href="http://www.another-canonical-url.com/is-here" />`

... and extract the URL out of it.

My code so far :

    $content = file_get_contents($url);
    $content = strtolower($content);
    $content = preg_replace("'<style[^>]*>.*</style>'siU",'',$content);  // strip js
    $content = preg_replace("'<script[^>]*>.*</script>'siU",'',$content); // strip css
    $split = explode("\n",$content); // Separate each line

    foreach ($split as $k => $v) // For each line
    {
        if (strpos(' '.$v,'<meta') || strpos(' '.$v,'<link')) // If contains a <meta or <link
        {
        // Check with regex and if found, return what I need (the URL)
        }
    }
    return $split_content;

I've been fighting with regex for hours, trying to figure out how to do so, but it seems it's well above my knowledge.

would someone know how I need to define this rule ? Plus, does my script seems okay to you, or is there room for improvement ?

Thanks a bunch !


Solution

  • Using DOMDocument this is how you can get the property and content

    $html = '<meta property="og:url" content="http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html" />';
    $dom = new DOMDocument;
    $dom->loadHTML($html);
    $attr = array();
    foreach ($dom->getElementsByTagName('meta') as $meta) {
        if ($meta->hasAttributes()) {
            foreach ($meta->attributes as $attribute) {
                $attr[$attribute->nodeName] = $attribute->nodeValue;
            }
        }
    }
    
    print_r($attr);
    

    Output ::

    Array
    (
        [property] => og:url
        [content] => http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html
    )
    

    The same you can get for the 2nd URL as

    $html = '<link rel="canonical" href="http://www.another-canonical-url.com/is-here" />';
    $dom = new DOMDocument;
    $dom->loadHTML($html);
    $attr = array();
    foreach ($dom->getElementsByTagName('link') as $link) {
        if ($link->hasAttributes()) {
            foreach ($link->attributes as $attribute) {
                $attr[$attribute->nodeName] = $attribute->nodeValue;
            }
        }
    }
    
    
    print_r($attr);
    

    Output ::

    Array
    (
        [rel] => canonical
        [href] => http://www.another-canonical-url.com/is-here
    )