javaphpjqueryarchivewebarchive

Manipulate linked files on the HTML dynamically


I have a backup website. Something like Wayback machine. When I return the contents of the HTML, obviously, the linked documents (like images, javascript files, css files, etc.) are loaded from the original web server (instead of my server). Now I want to replace those links so that they are loaded from my server. I have two approaches to take:

  1. Do it server-side using Java or PHP. I can use both Java and PHP to do this. For instance in Java, I could use jSoup to parse the HTML and replace the links.
  2. Do it client-side using jQuery.

Using 2nd method means I don't have to add load on my server to parse the HTML but I think, as soon as the page is being loaded, the files will begin to download from the original server and the user's bandwidth would be wasted.

On the other hand, if I could somehow determine whether the image has been successfully downloaded, I could skip the download from my server and let the user use the file downloaded from the original server.

What is your suggestion for this?

Update

About relative and absolute links I should do some clarifications. The links on my service are stored as absolute paths. However, the HTML documents may have both types of links. What I need to do is:

In short, the relative links on the HTML should be converted to absolute links and then be sent to my website as the URL argument.


Solution

  • If the links are relative, you could add a <base> tag with jQuery.

    $(function () {
        var base = $("<base>", {
            href: "http://www.your-new-website.com/"
        });
        $("head").append(base);
    });
    

    UPDATED

    jQuery will not be the best solution because you will make 2 calls for every item...one for the initial load from the original server and one for the second load from the new server as jQuery changes the img and css links. Nonetheless, this should work.

    function replaceDomain(href) {
        var originalDomain = document.domain;
        var newDomain = "mysite.com/view/content?url=http://" + originalDomain;
        if (href.indexOf(originalDomain) == -1) href = "http://" + originalDomain + href;
        return href.replace(originalDomain, newDomain);
    }
    $(function () {
        //convert links
        $('a').each(function () {
            $(this).attr("href", replaceDomain($(this).attr("href")));
        });
    
        //convert imgs
        $('img').each(function () {
            $(this).attr("src", replaceDomain($(this).attr("src")));
        });
    
        //convert css links
        $('link').each(function () {
            $(this).attr("href", replaceDomain($(this).attr("href")));
        });
    });