I'm trying to create an application that displays information to the user from an asynchronous scraper. The scraper must work independently and continuously. The user, when visiting the desired page of the site, must automatically subscribe to the flow of an independently working scraper.
For example: the scraper works using while True
. It collects, processes and sends processed data. The user has visited the site and should be able to see the data that the scraper gave during its last iteration. When the scraper again collected, processed and returned the data, the user’s data should be automatically updated, and so on in a circle.
Unfortunately, I can’t show you the project code due to confidentiality.
Next I will describe the methods that I found and tried to apply. All of them are somehow related to Django Channels and multiprocesses and threads.
Which of these methods is the most correct and how to connect your scraper correctly? Maybe I missed some other method, I don't know.
Multiprocessing, Pipe. It was suggested to create a separate script process (in my case, an asynchronous scraper) and use Pipe to pass this data to another script (in my case, Django View).
Websockets. Create an WS stream and connect to it.
SSE. This option seemed to me the most correct, since I do not need a two-way connection as in web sockets. I only need to update information on the client when updating information on the server. While looking for a way to implement SSE, I came across the daphne module. According to the short guide on the site, I connected the module and it works, but only with an example from the site itself. The point is that a separate function is created where some data is generated in an endless loop and transmitted to the user. Each new user is a separate thread on the server. I was unable to launch my scraper as a third-party module in this cycle. It throws the error RuntimeError: asyncio.run() cannot be called from a running event loop. Here is a link to a site with a guide. It's brief there. https://www.photondesigner.com/articles/server-sent-events-daphne?ref-yt-server-sent-events-daphne
Which of these methods is the most correct and how to connect your parser correctly? Maybe I missed some other method, I don't know.
Solving the problem was easier than I expected.
We take the Redis database, run the scraper script separately and write the data to the database. In another script we get data from the database.
I don’t know how “correct” this is, but that’s how it is for now.
Initially, I tried to implement this using channels and celery at the user's suggestion Sagun Devkota.