pythonprocessscrapyweb-crawlertwisted

Is there a way of running two spiders with a Crawl.runner / process and save the results in two separate files?


I have two scrapy spiders in two different scripts

Spiders
 Spider1.py
 Spider2.py

An example of the code in the spiders is as follows:

from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor

   class Spider(scrapy.spider):
     # some code

   runner = CrawlerRunner(
       settings={'FEEDS': 
       {'../input/next.csv': {'format': 
        'csv'}}})
   runner.crawl(Spider)
   d = runner.join()
   d.addBoth(lambda _: reactor.stop())
   reactor.run()

I am running both spiders from a separate script with the following code:

import runpy as r



def run_webscraper():
       r.run_path(path_name='Spider1.py') 
       r.run_path(path_name='Spider2.py')
       return
   if __name__ == '__main__':
       run_webscrapper()

When I try to run the spiders, Spider1 runs and saves the results in the corresponding csv file but when executes spider2 I get the following error:

twisted.internet.error.ReactorNotRestartable

Any ideas on how to fix the code so that the two spiders run and save their results in separate files (spider1.csv, spider2.csv)?

Is this actually possible?


Solution

  • I believe you can do this by creating a cutom setting within each spider like this:

    spider1:

    class Spider1(scrapy.Spider):
      name='spider1'
      custom_settings = {
        'FEEDS': {
          'spider1.csv': {
            'format': 'csv'
          }
        }
      }
    

    spider2:

    class Spider2(scrapy.Spider):
    name='spider2'
    custom_settings = {
      'FEEDS': {
        'spider2.csv': {
          'format': 'csv'
        }
      }
    }