pythonweb-scrapingscrapy

Single Scrapy Project vs. Multiple Projects


I have this dilemma on how to store all of my spiders. These spiders will be used by fed into Apache NiFi using a command line invocation and items read from stdin. I also plan to have a subset of these spiders return single item results using scrapyrt on a separate web server. I will need to create spiders across many different projects with different item models. They will all have similar settings (like use the same proxy).

My question is what is the best way to structure my scrapy project(s)?

  1. Place all spiders in the same repository. Provides an easy way to make base classes for Item loaders and Item pipelines.
  2. Group spiders for each project I am working on into separate repositories. This has the advantage of allowing Items to be the focal point of each project and not get too large. Unable to share common code, settings, spider monitors (spidermon), and base classes. This feels the cleanest though even though there is some duplications.
  3. Package only the spiders I plan to use non-realtime in the NiFi repo and the realtime ones in another repo. Has the advantage I keep the spiders with the projects that will actually use them but still centralizes/convolutes which spiders are used with which projects.

It feels like the right answer is #2. Spiders related to a specific program should be in their own scrapy project just like when you create a web service for project A, you don't say oh I can just throw all of my service endpoints for project B into the same service because that is where all my services will live, even though some settings may be duplicated. Arguably some of the shared code/classes could be shared through a separate package.

What do you think? How are you all structuring your scrapy projects to maximize reusability? Where do you draw the line of same project vs. separate project? Is it based on your Item model or data source?


Solution

  • Jakob from the Google Group Thread titled "Single Scrapy Project vs. Multiple Projects for Various Sources" recommended:

    whether spiders should go into the same project is mainly determined by the type of data they scrape, and not by where the data comes from.

    Say you are scraping user profiles from all your target sites, then you may have an item pipeline that cleans and validates user avatars, and one that exports them into your "avatars" database. It makes sense to put all spiders into the same project. After all, they all use the same pipelines because the data always has the same shape no matter where it was scraped from. On the other hand, if you are scraping questions from Stack Overflow, user profiles from Wikipedia, and issues from Github, and you validate/process/export all of these data types differently, it would make more sense to put the spiders into separate projects.

    In other words, if your spiders have common dependencies (e.g. they share item definitions/pipelines/middlewares), they probably belong into the same project; if each of them has their own specific dependencies, they probably belong into separate projects.

    Pablo Hoffman is one of the developers of Scrapy and he responded in another thread "Scrapy spider vs project" with:

    ...recommend to keep all spiders into the same project to improve code reusability (common code, helper functions, etc).

    We've used prefixes on spider names at times, like film_spider1, film_spider2 actor_spider1, actor_spider2, etc. And sometimes we also write spiders that scrape multiple item types, as it makes more sense when there is a big overlap on the pages crawled.