pythonscrapytwistedreactortwisted.internet

Scrapy - ReactorAlreadyInstalledError when using TwistedScheduler


I have the following Python code to start APScheduler/TwistedScheduler cronjob to start the spider.

Using one spider was not a problem and worked great. However using two spiders result into the error: twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed.

I did found a related question, using CrawlerRunner as the solution. However, I'm using TwistedScheduler object, so I do not know how to get this working using multiple cron jobs (multiple add_job()).

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler

from myprojectscraper.spiders.my_homepage_spider import MyHomepageSpider
from myprojectscraper.spiders.my_spider import MySpider

process = CrawlerProcess(get_project_settings())
# Start the crawler in a scheduler
scheduler = TwistedScheduler(timezone="Europe/Amsterdam")
# Use cron job; runs the 'homepage' spider every 4 hours (eg. 12:10, 16:10, 20:10, etc.)
scheduler.add_job(process.crawl, 'cron', args=[MyHomepageSpider], hour='*/4', minute=10)
# Use cron job; runs the full spider every week on the monday, tuesday and saturday at 4:35 midnight
scheduler.add_job(process.crawl, 'cron', args=[MySpider], day_of_week='mon,thu,sat', hour=4, minute=35)
scheduler.start()
process.start(False)

Solution

  • I'm now using the BlockingScheduler in combination with Process and CrawlerRunner. As well as enabling logging via configure_logging().

    from multiprocessing import Process
    
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.project import get_project_settings
    from scrapy.utils.log import configure_logging
    from apscheduler.schedulers.blocking import BlockingScheduler
    
    from myprojectscraper.spiders.my_homepage_spider import MyHomepageSpider
    from myprojectscraper.spiders.my_spider import MySpider
    
    from twisted.internet import reactor
    
    # Create Process around the CrawlerRunner
    class CrawlerRunnerProcess(Process):
        def __init__(self, spider):
            Process.__init__(self)
            self.runner = CrawlerRunner(get_project_settings())
            self.spider = spider
    
        def run(self):
            deferred = self.runner.crawl(self.spider)
            deferred.addBoth(lambda _: reactor.stop())
            reactor.run(installSignalHandlers=False)
    
    # The wrapper to make it run multiple spiders, multiple times
    def run_spider(spider):
        crawler = CrawlerRunnerProcess(spider)
        crawler.start()
        crawler.join()
    
    # Enable logging when using CrawlerRunner
    configure_logging()
    
    # Start the crawler in a scheduler
    scheduler = BlockingScheduler(timezone="Europe/Amsterdam")
    # Use cron job; runs the 'homepage' spider every 4 hours (eg. 12:10, 16:10, 20:10, etc.)
    scheduler.add_job(run_spider, 'cron', args=[MyHomepageSpider], hour='*/4', minute=10)
    # Use cron job; runs the full spider every week on the monday, tuesday and saturday at 4:35 midnight
    scheduler.add_job(run_spider, 'cron', args=[MySpider], day_of_week='mon,thu,sat', hour=4, minute=35)
    scheduler.start()
    

    The script at least doesn't exit directly (it blocks). I now get the following output as expected:

    2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Adding job tentatively -- it will be properly scheduled when the scheduler starts
    2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Adding job tentatively -- it will be properly scheduled when the scheduler starts
    2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Added job "run_spider" to job store "default"
    2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Added job "run_spider" to job store "default"
    2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Scheduler started
    2022-03-31 22:50:24 [apscheduler.scheduler] DEBUG: Looking for jobs to run
    2022-03-31 22:50:24 [apscheduler.scheduler] DEBUG: Next wakeup is due at 2022-04-01 00:10:00+02:00 (in 4775.280995 seconds)
    

    Since we are using BlockingScheduler the scheduler will not directly exit, but start() is a blocking call. Meaning it allows the scheduler to run the jobs infinitely.