pythonpython-asynciofastapipython-multithreadinggil

What happens to the asyncio event loop when multiple CPU-bound tasks run concurrently in a ThreadPoolExecutor given Python’s GIL?


I'm working on an asynchronous Python application (using FastAPI/Starlette/asyncio) that needs to offload synchronous, CPU-bound tasks to a thread pool (ThreadPoolExecutor) to avoid blocking the event loop.

I understand that Python's Global Interpreter Lock (GIL) allows only one thread to execute Python bytecode at a time per process. But I want to clarify how this affects the asyncio event loop when multiple CPU-bound tasks (say 10) are submitted concurrently to the thread pool.


Scenario


What I Understand


My Questions

  1. Is this understanding accurate regarding how the GIL contention affects the asyncio event loop?
  2. Does Python’s GIL and time-slicing mechanism indeed cause the event loop to be “starved” or blocked temporarily when multiple CPU-bound threads are running?
  3. Are there any internal scheduling mechanisms or priorities in CPython that favor the event loop thread over worker threads in such scenarios?
  4. Are there recommended best practices or architectural patterns to avoid this problem, aside from moving CPU-bound tasks to ProcessPoolExecutor or external services?

Context

I am considering using ThreadPoolExecutor in FastAPI to run some blocking CPU-bound tasks asynchronously but want to understand the implications on event loop responsiveness when multiple such tasks run concurrently.


Thank you in advance for any clarifications or insights!


Solution

  • In short, yes, your understanding is correct. When a thread (even when executing a CPU-bound, instead of I/O-bound, task via ThreadPoolExecutor, as you mentioned) holds the Global Interpreter Lock (GIL), no other thread or the event loop (actually, the thread in which the event loop is executed; in this case, that is the main thread: the thread from which the Python interpreter was started—each Python process is created with one) can run Python bytecode, until the GIL is voluntarily released or a timeout has passed.

    As mentioned in this answer, at which I would recommend having a look, if a CPU-bound operation (or an I/O-bound one that wouldn't voluntarily release the GIL) was executed inside a thread, and the GIL had not been released after 5ms, Python would (automatically) tell/force the current thread to release the GIL. The 5ms value is the interpreter’s default thread switch interval, which could be obtained as follows:

    import sys
    
    print(sys.getswitchinterval())  # 0.005
    

    This floating-point value determines the ideal duration of the "timeslices" allocated to concurrently running Python threads and it could be configured using sys.setswitchinterval(interval) (the interval should be in seconds). Also, as noted in the linked documentation:

    Please note that the actual value can be higher, especially if long-running internal functions or methods are used. Also, which thread becomes scheduled at the end of the interval is the operating system's decision. The interpreter doesn't have its own scheduler.

    So, yes, "there's no guarantee the event loop will get immediate priority", and there is currently no way to make priority requests for particular threads, in order, for instance, to favor the main thread (in which the event loop is running, in this case) over worker threads.

    It should also be noted that this automatic GIL release (mentioned earlier) is best-effort, not guaranteed. Certain native Python functions, such as pow(), for instance, would likely not release the GIL in certain cases (e.g., when performing large computations); thus, essentially blocking every execution, and in this case, the entire server, while they are running:

    pow(365,100000000000000)  # this would not release the GIL
    

    Therefore, in such cases, it would be more appropriate to use a ProcessPoolExecutor instead of ThreadPoolExecutor or FastAPI/Starlette's external threadpool (see the linked answer above, as well as this answer for more details on that). You could alternatively increase the number of server workers/processes, as noted at the bottom of the first linked answer above (hence, please take a look at that part for more details and the possible constraints). Either way, a separate process would also mean a separate GIL. In the case of server workers, each worker would also have its own event loop that would run in the main thread of each process. Thus, depending on the number of server workers, a number of requests could be served in parallel without blocking the server.

    Example

    On a side note, in real-world scenarios, it might be best to create a reusable ProcessPoolExecutor, as explained and demonstrated in the linked answer above, instead of creating a new ProcessPool each time the endpoint is called, as shown in the example below; hence, please take a look at that answer for more details and examples on that.

    from fastapi import FastAPI
    import concurrent.futures
    import asyncio
    from multiprocessing import current_process
    from threading import current_thread
    
    
    app = FastAPI()
    
    
    def cpu_bound_task():
        pid = current_process().pid
        tid = current_thread().ident
        thread_name = current_thread().name
        process_name = current_process().name
        print(f"{pid} - {process_name} - {tid} - {thread_name}")
        pow(365,100000000000000)
    
    
    # this WILL block the event loop (because of `pow()`)
    @app.get("/blocking")
    async def blocking():
        loop = asyncio.get_running_loop()
        with concurrent.futures.ThreadPoolExecutor() as pool:
            res = await loop.run_in_executor(pool, cpu_bound_task)
        return "OK"
        
     
    # this WON'T block the event loop
    @app.get("/non-blocking")
    async def non_blocking():
        loop = asyncio.get_running_loop()
        with concurrent.futures.ProcessPoolExecutor() as pool:
            res = await loop.run_in_executor(pool, cpu_bound_task)
        return "OK"
        
        
    if __name__ == "__main__":
        import uvicorn
        uvicorn.run(app)
    

    If your cpu_bound_task() does not involve functions such as pow(), you could still choose to use ThreadPoolExecutor, but, although this would prevent the event loop from getting blocked, it wouldn't, however, give you the performance improvement you would expect from running CPU-bound tasks in parallel. ThreadPoolExecutor should be prefered when dealing with blocking I/O-bound tasks instead.

    As for whether "multiple CPU-bound tasks running concurrently in a threadpool could potentially monopolize the GIL, thus delaying the event loop and causing increased latency and reduced responsiveness", this needs to be tested for one to say with certainty and will depend on your server's system resources available, as well as the expected traffic (number of requests submitted within a certain time period) and the nature of the CPU-bound tasks. You might choose to implement a queue mechanism, in order to control the number of requests that are being processed concurrently (or in parallel); thus minimizing any potential delays and improving responsiveness, should this ever become an issue—see this answer, for instance (more options are provided later on).

    Further, as explained in the second linked answer earlier, it should be noted that when using ThreadPoolExecutor, the max_workers argument is by default set to None, meaning that the number of worker threads is set based on the following equation: min(32, os.cpu_count() + 4) (In Python 3.13 this is changed to min(32, (os.process_cpu_count() or 1) + 4). For ProcessPoolExecutor, on the other hand, max_workers defaults to os.process_cpu_count()). If, for instance, your machine has 4 physical cores, each with hyperthreading, then Python will see 8 CPUs and will allocate 12 threads (8 CPUs + 4) to the pool by default. This number of worker threads, however, may not be enough for your project requirements, and thus, it might need to be adjusted. While in ProcessPoolExecutor's case might not make that sense to increase the number of workers more than the default number given by Python, as explained above; however, when using ThreadPoolExecutor, it may do so, as it is common to have more threads than CPUs (physical or logical) in one's system. The reason is that threads are mainly used for I/O-bound tasks (that wait for relatively slow resources to respond, and thus, can be shuffled between I/O waits), not CPU-bound tasks. Note, though, that with too many threads active at once, your program may spend more time context switching than actually executing tasks. Thus, I would suggest performing various benchmark tests, similar to the ones described in the linked answer earlier, in order to compare the overall execution time and then choose a number of threads that gives approximately the best performance (e.g., ThreadPoolExecutor(max_workers=n)).

    It should also be noted that one could instead have the CPU-bound tasks run in the background (after returning an immediate response to the client with a unique ID assigned to their request) and implement a polling mechanism, in order for the client to check on the status of the task and retrieve the results when the task is completed (see Solution 2 of this answer and the citations provided there for working examples) or use websockets instead, as shown here, as well as here and here.

    GIL Becomes Optional in Python 3.13

    I should also mention that in Python 3.13, there have been advances in disabling the GIL and making Free-threaded CPython. As noted in the docs:

    This is an experimental feature and therefore is not enabled by default. The free-threaded mode requires a different executable, usually called python3.13t or python3.13t.exe. Pre-built binaries marked as free-threaded can be installed as part of the official Windows and macOS installers, or CPython can be built from source with the --disable-gil option.

    Free-threaded execution allows for full utilization of the available processing power by running threads in parallel on available CPU cores. While not all software will benefit from this automatically, programs designed with threading in mind will run faster on multi-core hardware. The free-threaded mode is experimental and work is ongoing to improve it: expect some bugs and a substantial single-threaded performance hit.

    With free-threaded CPython, threads will be able to execute Python code truly in parallel, making multi-threaded applications much more efficient; especially, in CPU-bound tasks. However, as mentioned in the docs, this is still an experimental feature and one should expect bugs, etc.


    As for any articles and tutorials to look for, I would suggest to start by taking the official FastAPI tutorial, as well as having a look at Starlette's docs. Apart from that, I would recommend having a thorough look at all the linked answers above (and the references included in them), as they would help you better understand a lot of the concepts and technologies included in FastAPI/Starlette.