
When to use 'http://' or 'http://www.' when scraping?

I am scraping a small number of sites with the ruby anemone gem.

Anemone.crawl("") do |anemone|
         anemone.on_every_page do |page|

Depending on the site, some require 'www' to be present in the url while others require that it be omitted. How can I configure the crawler or code it so that it known when to use the correct url?


  • You can't know, so, do something similar to what you'd do while sitting in front of the browser.

    Try one, see if you get a connection, see if you got a 200 response, then see if the title has "error" in it. If none of those fail, then consider it good.

    If not, try the other.

    The problem using a canned spider/crawler is you have to work around their code when the situation is different than they expected when they wrote the software.