There are many ways to start a scrapy spider from a script (docs). But when you deal with it in Celery it becomes somewhat complicated.
What I want to have is a function that will start scrapy with the settings from the settings.py file
My setup looks something like this:
import os
import traceback
from twisted.internet import reactor
from celery import shared_task
from billiard.context import Process
from scrapy.crawler import CrawlerRunner
from my_spider.spiders.spider import MySpider
from scrapy.utils.project import get_project_settings
@shared_task
def start_scrapy(link):
run_spider(link)
def run_spider(link):
def _crawl(spider, *args, **kwargs):
try:
os.environ.setdefault("SCRAPY_SETTINGS_MODULE", "my_spider.settings")
settings = get_project_settings()
settings.update(
{
"FEEDS": {
f"output-{args[0]}.json": {
"format": "json",
"encoding": "utf-8",
"overwrite": True,
},
},
}
)
runner = CrawlerRunner(settings)
deferred = runner.crawl(spider, *args, **kwargs)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
except Exception as e:
print(f"Exception: {e}")
traceback.print_exc()
process = Process(target=_crawl, args=(MySpider, link))
process.start()
process.join()
Now, without the environ part it works, but it wouldn't have the settings from the settings.py file.
For some reason the error is silent, so I had to put prints with flush=True inside the scrapy source code. Thus I've found the following error: The installed reactor (twisted.internet.selectreactor.SelectReactor) does not match the requested one (twisted.internet.asyncioreactor.AsyncioSelectorReactor)
Of course in the spider settings I have
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
And with prints I have checked that the get_project_settings returns the desired settings.
I've tried out calling asyncioreactor.install()
and only then importing the reactor, but to no avail.
I use macOS.
Celery configuration looks like this: celery.py
import os
from celery import Celery
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "listings.settings")
app = Celery("listings")
app.config_from_object("django.conf:settings", namespace="CELERY")
app.conf.update(
worker_concurrency=4,
worker_prefetch_multiplier=1,
)
app.autodiscover_tasks()
settings.py
REDIS_HOST = "localhost"
REDIS_PORT = 6379
CELERY_BROKER_URL = f"redis://{REDIS_HOST}:{REDIS_PORT}"
CELERY_RESULT_BACKEND = f"redis://{REDIS_HOST}:{REDIS_PORT}"
CELERY_TASK_TRACK_STARTED = True
CELERY_RESULT_EXTENDED = True
CELERY_RESULT_EXPIRES = 360
Package versions:
celery==5.3.6
Scrapy==2.11.1
Twisted==23.10.0
Django==4.2.4
Please help me find the solution to the problem.
So, with a clear mind I've come to a fairly logical solution: remove the twisted reactor from settings settings.delete("TWISTED_REACTOR")