javanetwork-programminggrails

How to detect top legit search engine bots?


I want to develop a very robust method to detect only a few top search engines spiders such as googlebot and let them access content on my site, otherwise usual user registeration/login required to view that content.

Note that I also make use of cookies to let users access some content without being registered. So if cookies are disabled on client browser, no content except front page is offered. But I heard search engine spiders dont accept cookies so this will also shut out legitimate search engine bots. Is this correct?

One suggestion I heard is to do reverse lookup from ip address and if it resolves to for example googlebot.com, then do a forward dns lookup and if get back the original ip, then its legitimate and not some one impersonating as googlebot. I am using Java on linux server , so java based solution I am looking for.

I am only letting in top good search engine spiders such as google yahoo bing alexa etc and keep the others out to reduce server loads. But its very important top spiders index my site.


Solution

  • For a more complete answer to your question, you can't rely on only one approach. The problem is the conflicting nature of what you want to do. Essentially you want to allow good bots to access your site and index it so you can appear on search engines; but you want to block bad bots from sucking up all your bandwidth and stealing your information.

    First line of defense:

    Create a robots.txt file at the root of your site. See http://www.robotstxt.org/ for more information about that. This will keep good, well behaved bots in the areas of the site that make the most sense. Keep in mind that robots.txt relies on the User-Agent string if you provide different behavior for one bot vs. another bot. See http://www.robotstxt.org/db.html

    Second line of defense:

    Filter on User-Agent and/or IP address. I've already been criticized for suggesting that, but it's surprising how few bots disguise who and what they are--even the bad ones. Again, it's not going to stop all bad behavior, but it provides a level of due diligence. More on leveraging User-Agent later.

    Third line of defense:

    Monitor your Web server's access logs. Use a log analyzer to figure out where the bulk of your traffic is comming from. These logs include both IP address and user-agent strings so you can detect how many instances of a bot is hitting you, and whether it is really who it says it is: see http://www.robotstxt.org/iplookup.html

    You may have to whip up your own log analyzer to find out the request rate from different clients. Anything above a certain threshold (like maybe 10/second) would be a candidate to rate limit later on.

    Leveraging User Agent for Alternative Site Content:

    An approach we had to take to protect our users from even legitimate bots hammering our site is to split traffic based on the User-Agent. Basically, if the User-Agent was a known browser, they got the full featured site. If it was not a known browser it was treated as a bot, and was given a set of simple HTML files with just the meta information and links they needed to do their job. The bot's HTML files were statically generated four times a day, so there was no processing overhead. You can also render RSS feeds instead of stripped down HTML which provide the same function.

    Final Note:

    You only have so many resources, and not every legitimate bot is well behaved (i.e. ignores robots.txt and puts a lot of stress on your server). You will have to update your approach over time. For example, if one IP address turns out to be a custom search bot your client (or their client) made, you may have to resort to rate-limiting that IP address instead of blocking it completely.

    Essentially you are trying to get a good balance between serving your users, and keeping your site available for search engines. Do enough to keep your site responsive to the users, and only resort to the more advanced tactics as necessary.