[SOLVED] Anemone Crawler skip_links

Anemone Crawler skip_links_like not obeyed

I am using Anemone to crawl a massive site that to make things worse has the same content on a few different language versions.

There is domain.com/ for the main language and domain.com/de/, domain.com/es/ for the other languages so I decided to exclude these in the crawl like so:

crawler = Anemone::Core.new('http://domain.com', opts = {skip_query_strings: true})
crawler.skip_links_like(/(.+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/)

However when looking at what is being crawled via a puts page.url in the on_every_page do |page| block I can see that it is still crawling all the many language variations.

I've even tried to include this

crawler.focus_crawl{|page| page.links.reject{|i| !i.to_s.match(/(.+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/).nil? }}

To remove the language links from what is being considered next in the list of pages to crawl.

Any suggestions?

Solution

Turns out the skip_links_like method takes URIs not URLs meaning you can only match on parts after the top level domian so instead of this:

crawler.skip_links_like(/(.+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/)

I had to use this:

crawler.skip_links_like(/(^\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/)

or just the REGEX differences:

Wrong: .+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*

Right: ^\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*