I am scraping a small number of sites with the ruby anemone gem.
Anemone.crawl("http://www.somesite.com") do |anemone|
anemone.on_every_page do |page|
...
end
end
Depending on the site, some require 'www' to be present in the url while others require that it be omitted. How can I configure the crawler or code it so that it known when to use the correct url?
You can't know, so, do something similar to what you'd do while sitting in front of the browser.
Try one, see if you get a connection, see if you got a 200 response, then see if the title has "error" in it. If none of those fail, then consider it good.
If not, try the other.
The problem using a canned spider/crawler is you have to work around their code when the situation is different than they expected when they wrote the software.