scrapycelerytwisted

Issue when running Scrapy from a script in Celery: The installed reactor does not match the requested


There are many ways to start a scrapy spider from a script (docs). But when you deal with it in Celery it becomes somewhat complicated.

What I want to have is a function that will start scrapy with the settings from the settings.py file

My setup looks something like this:

import os
import traceback
from twisted.internet import reactor
from celery import shared_task
from billiard.context import Process
from scrapy.crawler import CrawlerRunner
from my_spider.spiders.spider import MySpider
from scrapy.utils.project import get_project_settings

@shared_task
def start_scrapy(link):
    run_spider(link)

def run_spider(link):
    def _crawl(spider, *args, **kwargs):
        try:
            os.environ.setdefault("SCRAPY_SETTINGS_MODULE", "my_spider.settings")
            settings = get_project_settings()
            settings.update(
                {
                    "FEEDS": {
                        f"output-{args[0]}.json": {
                            "format": "json",
                            "encoding": "utf-8",
                            "overwrite": True,
                        },
                    },
                }
            )
            runner = CrawlerRunner(settings)
            deferred = runner.crawl(spider, *args, **kwargs) 
            deferred.addBoth(lambda _: reactor.stop())
            reactor.run()
        except Exception as e:
            print(f"Exception: {e}")
            traceback.print_exc()

    process = Process(target=_crawl, args=(MySpider, link))
    process.start()
    process.join()

Now, without the environ part it works, but it wouldn't have the settings from the settings.py file.

For some reason the error is silent, so I had to put prints with flush=True inside the scrapy source code. Thus I've found the following error: The installed reactor (twisted.internet.selectreactor.SelectReactor) does not match the requested one (twisted.internet.asyncioreactor.AsyncioSelectorReactor)

Of course in the spider settings I have

REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

And with prints I have checked that the get_project_settings returns the desired settings. I've tried out calling asyncioreactor.install() and only then importing the reactor, but to no avail.

I use macOS.

Celery configuration looks like this: celery.py

import os
from celery import Celery

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "listings.settings")

app = Celery("listings")

app.config_from_object("django.conf:settings", namespace="CELERY")
app.conf.update(
    worker_concurrency=4,
    worker_prefetch_multiplier=1,
)
app.autodiscover_tasks()

settings.py

REDIS_HOST = "localhost"
REDIS_PORT = 6379

CELERY_BROKER_URL = f"redis://{REDIS_HOST}:{REDIS_PORT}"
CELERY_RESULT_BACKEND = f"redis://{REDIS_HOST}:{REDIS_PORT}"
CELERY_TASK_TRACK_STARTED = True
CELERY_RESULT_EXTENDED = True
CELERY_RESULT_EXPIRES = 360

Package versions:

celery==5.3.6
Scrapy==2.11.1
Twisted==23.10.0
Django==4.2.4

Please help me find the solution to the problem.


Solution

  • So, with a clear mind I've come to a fairly logical solution: remove the twisted reactor from settings settings.delete("TWISTED_REACTOR")