ruby-on-railsrubyweb-crawleranemone

When to use 'http://' or 'http://www.' when scraping?


I am scraping a small number of sites with the ruby anemone gem.

Anemone.crawl("http://www.somesite.com") do |anemone|
         anemone.on_every_page do |page|
            ...
         end
end

Depending on the site, some require 'www' to be present in the url while others require that it be omitted. How can I configure the crawler or code it so that it known when to use the correct url?


Solution

  • You can't know, so, do something similar to what you'd do while sitting in front of the browser.

    Try one, see if you get a connection, see if you got a 200 response, then see if the title has "error" in it. If none of those fail, then consider it good.

    If not, try the other.

    The problem using a canned spider/crawler is you have to work around their code when the situation is different than they expected when they wrote the software.