web-crawlertracking

How to track all website activity and filtering web robot data


I'm doing a very rudimentary tracking of page views by logging url, referral codes, sessions, times etc but finding it's getting bombarded with robots (Google, Yahoo etc). I'm wondering what an effective way is to filter out or not log these statistics?

I've experimented with robot IP lists etc but this isn't foolproof.

Is there some kind of robots.txt, htaccess, PHP server-side code, javascript or other method(s) that can "trick" robots or ignore non-human interaction?


Solution

  • Just to add - a technique you can employ within your interface would be to use Javascript to encapsulate the actions that lead to certain user-interaction view/counter increments, for a very rudimentary example, a robot will(can) not follow:

    <a href="javascript:viewItem(4)">Chicken Farms</a>
    
    function viewItem(id)
    {
        window.location.href = 'www.example.com/items?id=' + id + '&from=userclick';
    }
    

    To make those clicks easier to track, they might yield a request such as

    www.example.com/items?id=4&from=userclick
    

    That would help you reliably track how many times something is 'clicked', but it has obvious drawbacks, and of course it really depends on what you're trying to achieve.