phpregexrelative-pathrelative-url

file_get_contents( - Fix relative urls


I am trying to display a website to a user, having downloaded it using php. This is the script I am using:

<?php
$url = 'http://stackoverflow.com/pagecalledjohn.php';
//Download page
$site = file_get_contents($url);
//Fix relative URLs
$site = str_replace('src="','src="' . $url,$site);
$site = str_replace('url(','url(' . $url,$site);
//Display to user
echo $site;
?>

So far this script works a treat except for a few major problems with the str_replace function. The problem comes with relative urls. If we use an image on our made up pagecalledjohn.php of a cat (Something like this: Cat). It is a png and as I see it it can be placed on the page using 6 different urls:

1. src="//www.stackoverflow.com/cat.png"
2. src="http://www.stackoverflow.com/cat.png"
3. src="https://www.stackoverflow.com/cat.png"
4. src="somedirectory/cat.png" 

4 is not applicable in this case but added anyway!

5. src="/cat.png"
6. src="cat.png"

Is there a way, using php, I can search for src=" and replace it with the url (filename removed) of the page being downloaded, but without sticking url in there if it is options 1,2 or 3 and change procedure slightly for 4,5 and 6?


Solution

  • Rather than trying to change every path reference in the source code, why don't you simply inject a <base> tag in your header to specifically indicate the base URL upon which all relative URL's should be calculated?

    https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base

    This can be achieved using your DOM manipulation tool of choice. The example below would show how to do this using DOMDocument and related classes.

    $target_domain = 'http://stackoverflow.com/';
    $url = $target_domain . 'pagecalledjohn.php';
    //Download page
    $site = file_get_contents($url);
    $dom = DOMDocument::loadHTML($site);
    
    if($dom instanceof DOMDocument === false) {
        // something went wrong in loading HTML to DOM Document
        // provide error messaging and exit
    }
    
    // find <head> tag
    $head_tag_list = $dom->getElementsByTagName('head');
    // there should only be one <head> tag
    if($head_tag_list->length !== 1) {
        throw new Exception('Wow! The HTML is malformed without single head tag.');
    }
    $head_tag = $head_tag_list->item(0);
    
    // find first child of head tag to later use in insertion
    $head_has_children = $head_tag->hasChildNodes();
    if($head_has_children) {
        $head_tag_first_child = $head_tag->firstChild;
    }
    
    // create new <base> tag
    $base_element = $dom->createElement('base');
    $base_element->setAttribute('href', $target_domain);
    
    // insert new base tag as first child to head tag
    if($head_has_children) {
        $base_node = $head_tag->insertBefore($base_element, $head_tag_first_child);
    } else {
        $base_node = $head_tag->appendChild($base_element);
    }
    
    echo $dom->saveHTML();
    

    At the very minimum, it you truly want to modify all path references in the source code, I would HIGHLY recommend doing so with DOM manipulation tools (DOMDOcument, DOMXPath, etc.) rather than regex. I think you will find it a much more stable solution.