pythonmultithreadingmultiprocessingpython-asyncio

Crawling APIs - concurrency approach for periodical HTTP calls


I'm working on a piece of code that will be able to hit multiple web APIs (hardware that has APIs telling about the machine status) not blocking when one is waiting on the other to wait for the response, once the response of one arrives it'll be emitted on a websocket. One requirement is not to kill the APIs so hit them let's say once per 5 seconds as long as the main process is running.

The important part I'm struggling with is how to even approach it. What I did to this point is that the main process is spawning separate threads for different APIs and that thread is hitting the API emitting the response to the websocket time.sleep(5) and repeat. The main process is responsible to start new "workers" and kill ones that are not needed anymore also to restart ones that should be working but are not due to i.e. an exception. I have no idea if multi-threading is the way to go here - let's say I aim to "crawl" through 300 APIs.

Is spawning long lived workers the right way to achieve this? Should those be processes instead? Should I maybe have the main process coordinate executing short-living threads that will do the API call and die and do that every 5 seconds per API (that seems way worse to maintain)? If the last option, then how to handle cases where a response takes more than 5 seconds to arrive?

Some people are now talking about Python's asyncio like it's the golden solution for all issues but I don't understand how it could fit into my problem.

Can someone guide me to the right direction?


Solution

  • Let me rephrase this and tell me whether I'm right:

    I want to visit ~300 APIs frequently such that each is hit approximately every 5 seconds. How do I approach this and what worker/process management should I use?

    There's basically two different approaches:

    1. Spawn a thread for each API that is currently being watched (i.e. touched frequently) -- only feasible if at any given time only a subset of your total number of possible APIs is being watched.
    2. Set up a worker pool where all workers consume the same queue and have a management process fill the queue according to the time restrictions -- probably better when you always want to watch all possible APIs.

    Edit after your first comment:

    You know the number of APIs you want to watch, so the queue's length should never grow larger than that number. Also, you can scan the queue in your main process frequently and check whether an API address you want to add already is in there and don't append it another time.

    To avoid hitting APIs too frequent, you can add target timestamps along with the API address to the queue (e.g. as a tuple) and have the worker wait until that time is reached before firing the query to that API. This will slow down the consumption of your entire queue but will maintain a minimum delay between to hits of the same API. If you choose to do so, you just have to make sure that (a) the API requests always respond in a reasonable time, and (b) that all API addresses are added in a round-robin manner to the queue.