My web site has a database lookup; filling out a CAPTCHA gives you 5 minutes of lookup time. There is also some custom code to detect any automated scripts. I do this as I don't want someone data mining my site.
The problem is that Google does not see the lookup results when it crawls my site. If someone is searching for a string that is present in the result of a lookup, I would like them to find this page by Googling it.
The obvious solution to me is to use the PHP variable $_SERVER['HTTP_USER_AGENT']
to bypass the CAPTCHA and custom security code for the Google bots. My question is whether this is sensible or not.
People could then use Google's cache to view the lookup results without having to fill out the CAPTCHA, but would Google's own script detection methods prevent them from data mining these pages?
Or would there be some way for people to make $_SERVER['HTTP_USER_AGENT']
appear as Google to bypass the security measures?
Or would there be some way for people to make $_SERVER['HTTP_USER_AGENT'] appear as Google to bypass the security measures?
Definitely. The user agent is laughably easy to forge. See e.g. User Agent Switcher for Firefox. It's also easy for a spam bot to set its user agent header to the Google bot.
It might still be worth a shot, though. I'd say just try it out and see what the results are. If you get problems, you may have to think about another way.
An additional way to recognize the Google bot could be the IP range(s) it uses. I don't know whether the bot uses defined IP ranges - it could be that that's not the case, you'd have to find out.
Update: it seems to be possible to verify the Google Bot by analyzing its IP. From Google Webmaster Central: How to verify Googlebot
Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:
host 66.249.66.1 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
host crawl-66-249-66-1.googlebot.com crawl-66-249-66-1.googlebot.com has address 66.249.66.1
I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.