i'm trying to write a script to scrape canonical URL from a remote URL. I'm not a professional developper, so if something is ugly in my code, any explanation would (and will) be appreciated.
What I'm trying to do is either look for:
<meta property="og:url" content="http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html" />
<link rel="canonical" href="http://www.another-canonical-url.com/is-here" />`
... and extract the URL out of it.
My code so far :
$content = file_get_contents($url);
$content = strtolower($content);
$content = preg_replace("'<style[^>]*>.*</style>'siU",'',$content); // strip js
$content = preg_replace("'<script[^>]*>.*</script>'siU",'',$content); // strip css
$split = explode("\n",$content); // Separate each line
foreach ($split as $k => $v) // For each line
{
if (strpos(' '.$v,'<meta') || strpos(' '.$v,'<link')) // If contains a <meta or <link
{
// Check with regex and if found, return what I need (the URL)
}
}
return $split_content;
I've been fighting with regex for hours, trying to figure out how to do so, but it seems it's well above my knowledge.
would someone know how I need to define this rule ? Plus, does my script seems okay to you, or is there room for improvement ?
Thanks a bunch !
Using DOMDocument
this is how you can get the property and content
$html = '<meta property="og:url" content="http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html" />';
$dom = new DOMDocument;
$dom->loadHTML($html);
$attr = array();
foreach ($dom->getElementsByTagName('meta') as $meta) {
if ($meta->hasAttributes()) {
foreach ($meta->attributes as $attribute) {
$attr[$attribute->nodeName] = $attribute->nodeValue;
}
}
}
print_r($attr);
Output ::
Array
(
[property] => og:url
[content] => http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html
)
The same you can get for the 2nd URL as
$html = '<link rel="canonical" href="http://www.another-canonical-url.com/is-here" />';
$dom = new DOMDocument;
$dom->loadHTML($html);
$attr = array();
foreach ($dom->getElementsByTagName('link') as $link) {
if ($link->hasAttributes()) {
foreach ($link->attributes as $attribute) {
$attr[$attribute->nodeName] = $attribute->nodeValue;
}
}
}
print_r($attr);
Output ::
Array
(
[rel] => canonical
[href] => http://www.another-canonical-url.com/is-here
)