phphtmlweb-crawlersimple-html-dombase-tag

Changing the Base URL for crawled links


I am crawling links from some websites with Simple HTML DOM, however I have run into the problem that many websites use relative links instead of the full URL.

So what happens is that I crawl the links, and output them directly onto my website, but each link leads to www.mydomain.com/somearticle instead of www.crawleddomain.com/somearticle.

I have done some digging and I found out about the BASE tag. Since I am crawling from multiple sites, I cannot just set a base tag for my website, because it will change from output to output. So I was searching to have a base tag only for a certain div. I stumbled upon this answer.

However, I tried manually including the base url like below, but that did not work:

echo ('http://www.baselink.com/' . strip_tags($post, '<p><a>'));

I also tried the second option, with the correct_urls($html, $baseurl); function, but apparently that does not exist.

Is there any way to change the base URL (or append it) to the relative URLs in a for-loop in PHP?

Here is the output

And here is the code I am using:

<div class='rcorners1'>
<?php
include_once('simple_html_dom.php');

$target_url = "http://www.buzzfeed.com/trending?country=en-us";

$html = new simple_html_dom();

$html->load_file($target_url);

$posts = $html->find('ul[class=list--numbered trending-posts trending-posts-now]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
  $post = $posts[$i];
  $post->find('div[class=trending-post-text]',0)->outertext = "";
  echo strip_tags ($post, '<p><a>');  
}
?>
</div>
</div>

Solution

  • You need a library that converts relative hrefs to absolute

    Then do something like:

    include_once('phpuri.php');
    
    $uri = phpUri::parse($target_url);
    
    foreach($html->find('a[href]') as $a){
      $a->href = $uri->join($a->href);
    }