pythonmultiprocessingpython-asynciogil

Is there a way to release the GIL for pure functions using pure python?


I think I must be missing something; this seems so right, but I can't see a way to do this.

Say you have a pure function in Python:

from math import sin, cos

def f(t):
    x = 16 * sin(t) ** 3
    y = 13 * cos(t) - 5 * cos(2*t) - 2 * cos(3*t) - cos(4*t)
    return (x, y)

is there some built-in functionality or library that provides a wrapper of some sort that can release the GIL during the function's execution?

In my mind I am thinking of something along the lines of

from math import sin, cos
from somelib import pure

@pure
def f(t):
    x = 16 * sin(t) ** 3
    y = 13 * cos(t) - 5 * cos(2*t) - 2 * cos(3*t) - cos(4*t)
    return (x, y)

Why do I think this might be useful?

Because multithreading, which is currently only attractive for I/O-bound programs, would become attractive for such functions once they become long-running. Doing something like

from math import sin, cos
from somelib import pure
from asyncio import run, gather, create_task

@pure  # releases GIL for f
async def f(t):
    x = 16 * sin(t) ** 3
    y = 13 * cos(t) - 5 * cos(2 * t) - 2 * cos(3 * t) - cos(4 * t)
    return (x, y)


async def main():
    step_size = 0.1
    result = await gather(*[create_task(f(t / step_size))
                            for t in range(0, round(10 / step_size))])
    return result

if __name__ == "__main__":
    results = run(main())
    print(results)

Of course, multiprocessing offers Pool.map which can do something very similar. However, if the function returns a non-primitive / complex type then the worker has to serialize it and the main process HAS to deserialize and create a new object, creating a necessary copy. With threads, the child thread passes a pointer and the main thread simply takes ownership of the object. Much faster (and cleaner?).

To tie this to a practical problem I encountered a few weeks ago: I was doing a reinforcement learning project, which involved building an AI for a chess-like game. For this, I was simulating the AI playing against itself for > 100,000 games; each time returning the resulting sequence of board states (a numpy array). Generating these games runs in a loop, and I use this data to create a stronger version of the AI each time. Here, re-creating ("malloc") the sequence of states for each game in the main process was the bottleneck. I experimented with re-using existing objects, which is a bad idea for many reasons, but that didn't yield much improvement.

Edit: This question differs from How to run functions in parallel? , because I am not just looking for any way to run code in parallel (I know this can be achieved in various ways, e.g. via multiprocessing). I am looking for a way to let the interpreter know that nothing bad will happen when this function gets executed in a parallel thread.


Solution

  • Is there a way to release the GIL for pure functions using pure python?

    In short, the answer is no, because those functions aren't pure on the level on which the GIL operates.

    GIL serves not just to protect objects from being updated concurrently by Python code, its primary purpose is to prevent the CPython interpreter from performing a data race (which is undefined behavior, i.e. forbidden in the C memory model, in which CPython executes) while accessing and updating global and shared data. This includes Python-visible singletons such as None, True, and False, but also all globals like modules, shared dicts, and caches. Then there is their metadata such as reference counts and type objects, as well as shared data used internally by the implementation.

    Consider the provided pure function:

    def f(t):
        x = 16 * sin(t) ** 3
        y = 13 * cos(t) - 5 * cos(2*t) - 2 * cos(3*t) - cos(4*t)
        return (x, y)
    

    The dis tool reveals the operations that the interpreter performs when executing the function:

    >>> dis.dis(f)
      2           0 LOAD_CONST               1 (16)
                  2 LOAD_GLOBAL              0 (sin)
                  4 LOAD_FAST                0 (t)
                  6 CALL_FUNCTION            1
                  8 LOAD_CONST               2 (3)
                 10 BINARY_POWER
                 12 BINARY_MULTIPLY
                 14 STORE_FAST               1 (x)
                 ...
    

    To run the code, the interpreter must access the global symbols sin and cos in order to call them. It accesses the integers 2, 3, 4, 5, 13, and 16, which are all cached and therefore also global. In case of an error, it looks up the exception classes in order to instantiate the appropriate exceptions. Even when these global accesses don't modify the objects, they still involve writes because they must update the reference counts.

    None of that can be done safely from multiple threads without synchronization. While it is conceivably possible to modify the Python interpreter to implement truly pure functions that don't access global state, it would require significant modifications to the internals, affecting compatibility with existing C extensions, including the vastly popular scientific ones. This last point is the principal reason why removing the GIL has proven to be so difficult.