I am using Anemone
to crawl a massive site that to make things worse has the same content on a few different language versions.
There is domain.com/
for the main language and domain.com/de/
, domain.com/es/
for the other languages so I decided to exclude these in the crawl like so:
crawler = Anemone::Core.new('http://domain.com', opts = {skip_query_strings: true})
crawler.skip_links_like(/(.+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/)
However when looking at what is being crawled via a puts page.url
in the on_every_page do |page|
block I can see that it is still crawling all the many language variations.
I've even tried to include this
crawler.focus_crawl{|page| page.links.reject{|i| !i.to_s.match(/(.+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/).nil? }}
To remove the language links from what is being considered next in the list of pages to crawl.
Any suggestions?
Turns out the skip_links_like
method takes URIs not URLs meaning you can only match on parts after the top level domian so instead of this:
crawler.skip_links_like(/(.+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/)
I had to use this:
crawler.skip_links_like(/(^\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*)|(\.(jpg|pdf|png|jpeg)$)/)
or just the REGEX differences:
Wrong: .+com\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*
Right: ^\/(fi|de|it|no|se|en-bm|dk|fr|ie|en-nz|es|int).*