rubyanemone

Anemone Crawler skip_links_like not obeyed


I am using Anemone to crawl a massive site that to make things worse has the same content on a few different language versions.

There is domain.com/ for the main language and domain.com/de/, domain.com/es/ for the other languages so I decided to exclude these in the crawl like so:

crawler = Anemone::Core.new('http://domain.com', opts = {skip_query_strings: true})
crawler.skip_links_like(/(.+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/)

However when looking at what is being crawled via a puts page.url in the on_every_page do |page| block I can see that it is still crawling all the many language variations.

I've even tried to include this

crawler.focus_crawl{|page| page.links.reject{|i| !i.to_s.match(/(.+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/).nil? }}

To remove the language links from what is being considered next in the list of pages to crawl.

Any suggestions?


Solution

  • Turns out the skip_links_like method takes URIs not URLs meaning you can only match on parts after the top level domian so instead of this:

    crawler.skip_links_like(/(.+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/)
    

    I had to use this:

    crawler.skip_links_like(/(^\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/)
    

    or just the REGEX differences:

    Wrong: .+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*

    Right: ^\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*