web-scrapingweb-crawlernutch

Does any open, simply extendible web crawler exists?


I search for a web crawler solution which can is mature enough and can be simply extended. I am interested in the following features... or possibility to extend the crawler to meet them:

Those things above can be done one by one without any big effort, but I am interested in any solution which provide a customisable, extendible crawler. I heard of Apache Nutch, but very unsure about the project so far. Do you have experiences with it? Can you recommend alternatives?


Solution

  • A quick search on GitHub threw up Anemone, a web spider framework which seems to fit your requirements - particularly extensiblility. Written in Ruby.
    Hope it goes well!