phpmysqlhyperlinkfile-sharinglink-checking

Check link works and if not visually identify it as broken


I am working on a project which lists file sharing urls from the likes of Oron, filespost, depositfiles etc that reports sharing of copyrighted materials to identified content owners and rights holders in my network.

To better improve the service, which currently stands at a table populated from MySQL database with some filters built in to the php, I want to be able to identify the links that have ceased to function.

My thoughts are that when the data is retrieved from the MySQL database the download URL column entries (the url to the file or file host page) will be checked to see if they link to the actual file sharing page that allows users to start the download process, if they are working and provide the ability to download the file they should be left, link text or the cell colour turned green, if the file site displays file not found or similar the link text or cell background colour should turn red.

At present there is no quick and easy visual representation of active or inactive links.

I have a simple validation on the url based on if a 404 error is received but quickly realised that won't work given that these sites don't 404 or redirect even, they change the dynamically generated page to say the file is not available or file has been removed etc.

I have also incorporated a link checker script that uses a third part file share link checking service but this would require manual checks and manual updating of the database.

I have also checked to see if I can find specific fields or words on the page, but the given the range of sites and the broader range of terms used on the sites this to has been proven to be accurate and difficult to implement on all links.

It would also be helpful if urls could then be filtered down based on the active status. I'm guessing if the colour change was managed by a link class or cell class style I could filter the column based on class eg: link-dead or link-active. I think I can do this so help with this last bit on filtering based on class is not necessarily required.

Any help would be greatly appreciated.


Solution

  • As the sites you want to check are created by different people, there is unlikely to be a one-liner to detect if a link is broken or not over a vast number of sites.

    I suggest that you create a simple function for each site that detects if the link is broken for that particular site. When you want to check a link, you would decide which function to run on the external site's HTML based on the domain name.

    You can use parse_url() to extract the domain/host from the file links:

    // Get your url from the database. Here I'll just set it:
    $file_url_from_database = 'http://example.com/link/to/file?var=1&hello=world#file'
    
    $parsed_link = parse_url($file_url_from_database);
    $domain = $parsed_link['host']; // $domain now equals 'example.com'
    

    You could store the function names in an associative array and call them that way:

    function check_domain_com(){ ... }
    function check_example_com(){ ... }
    
    $link_checkers = array();
    $link_checkers['domain.com'] = 'check_domain_com';
    $link_checkers['example.com'] = 'check_example_com';
    

    or store the functions in the array (PHP >=5.3).

    $link_checkers = array();
    $link_checkers['domain.com'] = function(){ ... };
    $link_checkers['example.com'] = function(){ ... };
    

    and call these with

    if(isset($link_checkers[$domain]))
        // call the function stored under the index 'example.com'
        call_user_func($link_checkers[$domain]); 
    else
        throw( new Exception("I don't know how to check the domain $domain") );
    

    Alternatively you could just use a bunch of if statements

    if($domain == 'domain.com')
        check_domain_com();
    else if($domain == 'example.com')
        check_example_com(); // this function is called
    

    The functions could return a boolean (true or false; 0 or 1) to use, or call another function themselves if needed (for example to add an extra CSS class to broken links).

    I did something similar recently, though I was fetching metadata for stock photography from multiple sites. I used an abstract class because I had a few functions to run for each site.

    As a side note, it would be wise to store the last checked date in your database and limit the checking rate to something like 24 or 48 hours (or further apart depending on your needs).


    Edit to clarify implementation a little:

    As making HTTP requests to other websites is potentially very slow, you will want to check and update link statuses independently of page loads. You could achieve this like this:

    As people can easily click a link to check it's current state, it would be redundant to allow them to click a button to check from your page (nothing against the idea though).

    Note that the potentially resource-heavy update-all script should not be executable (accessible) via web.