mongodbruby-on-rails-3.1mongoidweb-crawleranemone

Anemone with Rails and MongoDB


I am preparing to write my first web crawler, and it looks like Anemone makes the most sense. There is built in support for MongoDB storage, and I am already using MongoDB via Mongoid in my Rails application. My goal is to store the crawled results, and then access them later via Rails. I have a couple of concerns:

1) At the end of this page, it says that "Note: Every storage engine will clear out existing Anemone data before beginning a new crawl." I would expect this to happen at the end of the crawl if I were using the default memory storage, but shouldn't the records be persisted to MongoDB indefinitely so that duplicate pages are not crawled next time the task is run? If they are wiped "before beginning a new crawl", then should I just run my Rails logic before the next crawl? If so, then I would end up having to check for duplicate records from the previous crawl.

2) This is the first time I have really thought about using MongoDB outside the context of Rails models. It looks like the records are created using the Page class, so can I later just query these as I normally would using Mongoid? I guess it is just considered a "model" once it has an ORM providing the fancy methods?


Solution

  • Great questions.

    1) It depends on what your goal is.

    In most cases this default makes sense. One does a crawl with anemone and examines the data.

    When you do a new crawl, the old data should be erased so that the data from the new crawl can replace it.

    You could point the storage engine at a new collection before starting the new crawl if you don't want that to happen.

    2) Mongoid won't create the model classes for you.

    You need to define models so that mongoid knows to create a class for the collection, and optionally define the fields that each of the documents have so that you can use the . accessor method out of the box.

    Something like:

    class Page
      include Mongoid::Document
      field :url, type: String #i'm guessing, check what kind of docs anemone produces
      field :aliases, type: Array
      field ....
    end
    

    It will probably need to include the following fields:

    But please just take a look at what type (string, array, whatever) the storage engine is storing them as and don't make assumptions.

    Good luck!