pythonscrapyfastapi

Run Scrapy from a script when it gets a request


I have a FastAPI server that is listening to an endpoint, after receiving any post request, it will use Scrapy to grab some data depending on that data it's gotten from post request.

from fastapi import FastAPI
from pydantic import BaseModel
from typing import List
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


class Request(BaseModel):
    someIDs: List[str]


process = CrawlerProcess(get_project_settings())

app = FastAPI()


@app.post("/")
def home(request: Request):
    process.crawl('rt_criteria', ids=request.someIDs)
    process.start()  # the script will block here until the crawling is finished
    return {"crawled": True}

# uvicorn main:app --reload

This code will run for the first time as I expect, but for the second time, I will get

twisted.internet.error.ReactorNotRestartable

error on:

process.start()

Where should I write this and How can I fix the error?


Solution

  • I solved this problem using Background tasks in FastAPI.

    (I think it's using multiprocess under the hood.)

    from fastapi import FastAPI, BackgroundTasks
    
    # *** Not changed codes ***
    
    @app.post("/")
    async def home(request: Request, bt: BackgroundTasks):
        process.crawl('rt_criteria', mid=request.movieIDs)
        # Changed line below using Background tasks
        bt.add_task(process.start, stop_after_crawl=False)
        return {"crawled": True}
    
    # uvicorn main:app --reload